Solved: "OCR Extract" action doesn't work well (alfresco-s... - Page 2

hisayo-s · ‎30 Aug 2018

Thanks you for your kindness.

However, my environment is consist of CentOS7 and Alfresco5.2 and OCRmyPDF(docker).

The scripts you have posted aren't match my environment.

As I am very new to docker, I don't know how to change the scripts.

hisayo-s · ‎30 Aug 2018

Comparing pdfsandwich to OCRmyPDF, pdfsandwich's quality for letter recognition is better than OCRmyPDF in Japanese.

So I will focused on using pdfsandwich.

Thank you very much for your help.

angelborroy · ‎30 Aug 2018

Did you test with these instructions?

https://github.com/keensoft/alfresco-simple-ocr/blob/master/docker/pdfsandwich-1.6-centos-7/Dockerfi...

I don't know if they are still working with latest CentOS releases, but it can be an starting point.

Hyland Developer Evangelist

fedorow · ‎27 Jun 2019

I try go over this solution. My deployment:

Alfresco 6.1.2-ga / Share 6.1.0

jbarlow83/ocrmypdf:v8.2.3 or v7.0.0

api-explorer-6.1.0-ea.war or 6.0.7-ga

And I have got "failed to copy".

I had file /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468.pdf but /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf don't.

My thought, I should change

INPUT_DIR=/ocr_input
OUTPUT_DIR=/ocr_output

but i don't understand how. "ocrmypdf" container don't contain this directories.

Log:

alfresco_1 | Exception in thread "defaultAsyncAction1" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
alfresco_1 | at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:450)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
alfresco_1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
alfresco_1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
alfresco_1 | at java.base/java.lang.Thread.run(Thread.java:834)
alfresco_1 | Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
alfresco_1 | ... 10 more
alfresco_1 | Caused by: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at org.alfresco.repo.content.AbstractContentWriter.putContent(AbstractContentWriter.java:491)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:83)
alfresco_1 | ... 11 more
alfresco_1 | Caused by: java.io.FileNotFoundException: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf (No such file or directory)
alfresco_1 | at java.base/java.io.FileInputStream.open0(Native Method)
alfresco_1 | at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
alfresco_1 | at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
alfresco_1 | at org.alfresco.repo.content.AbstractContentWriter.putContent(AbstractContentWriter.java:485)
alfresco_1 | ... 12 more

fedorow · ‎2 Jul 2019

So, to make it works on Alfresco/Share CE 6.1.2-ga/6.1.0 I made shared volume between alfresco and ocrmypdf containers. I replace /ocr_input and /ocr_output to one directory /ocr and map it as volume for both containers.

Only one problem, asynchronous mode for rule gives me error. So I turn it off.

Angel thanks!

docker-compose.yml

...
services:
   alfresco:
      ...
      volumes:
         - ocr:/ocr
      ...

   ocrmypdf:
      ...
      volumes:
         - ocr:/ocr
   ...
volumes:
   ...
   ocr:
      driver: local
...

bin/ocrmypdf.sh

(and remove {} from $OUTPUT_FILE_PARAM in copy output file command)

#!/bin/bash

INPUT_DIR=/ocr
OUTPUT_DIR=/ocr

# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"

# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}

LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`

# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")

# SSH parameters
SCP=cp
SSH=ssh
USER=root

# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR

# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"

# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM

# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE

SriramG · ‎18 Jun 2020

With the approach suggested by Fedorow, I was able to make OCR work with Alfresco 6.1.0. I update ocr_input and /ocr_output to /usr/local/tomcat/ocr_input and /usr/local/tomcat/ocr_out so that alfresco container can access these folders without any access issues.

Thanks Fedorow

docker-compose.yml

...
services:
   alfresco:
      ...
      volumes:
         - ocr-input:/usr/local/tomcat/ocr_input
         - ocr-output:/usr/local/tomcat/ocr_output
      ...

   ocrmypdf:
      ...
      volumes:
             - ocr-input:/usr/local/tomcat/ocr_input
             - ocr-output:/usr/local/tomcat/ocr_output
   ...
volumes:
   ...
  ocr-input:
       external: true
  ocr-output:
       external: true
...

bin/ocrmypdf.sh

#!/bin/bash

INPUT_DIR=/usr/local/tomcat/ocr_input
OUTPUT_DIR=/usr/local/tomcat/ocr_output

# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"

# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}

LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`

# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")

# SSH parameters
SCP=cp
SSH=ssh
USER=root

# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR

# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"

# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM

# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE

After the above changes I was able to successfully run OCR with Alfresco 6.1.

As we are running our Alfresco instance on Kubernetes and using HELM deployment, I need to configure the volumes in values.yaml file but I am not sure how to configure the volumes in values.yaml file. Any one has idea on how we need to make similar configuration in kubernetes.

Any help apprecaited.

EddieMay · ‎22 Jun 2020

Hi @SriramG,

Thanks for updating us on how you resolved your issue - really helpful.

Maybe start a new thread for your question about configuring volumes?

Cheers,

Digital Community Manager, Alfresco Software.
Problem solved? Click Accept as Solution!

"OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

"OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

Re: "OCR Extract" action doesn't work well (alfresco-simple-ocr + pdfsandwich)

We use cookies on this site to enhance your user experience