Thanks you for your kindness.
However, my environment is consist of CentOS7 and Alfresco5.2 and OCRmyPDF(docker).
The scripts you have posted aren't match my environment.
As I am very new to docker, I don't know how to change the scripts.
Comparing pdfsandwich to OCRmyPDF, pdfsandwich's quality for letter recognition is better than OCRmyPDF in Japanese.
So I will focused on using pdfsandwich.
Thank you very much for your help.
Did you test with these instructions?
I don't know if they are still working with latest CentOS releases, but it can be an starting point.
I try go over this solution. My deployment:
Alfresco 6.1.2-ga / Share 6.1.0
jbarlow83/ocrmypdf:v8.2.3 or v7.0.0
api-explorer-6.1.0-ea.war or 6.0.7-ga
And I have got "failed to copy".
I had file /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468.pdf but /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf don't.
My thought, I should change
INPUT_DIR=/ocr_input
OUTPUT_DIR=/ocr_output
but i don't understand how. "ocrmypdf" container don't contain this directories.
Log:
alfresco_1 | Exception in thread "defaultAsyncAction1" java.lang.RuntimeException: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
alfresco_1 | at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:450)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
alfresco_1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
alfresco_1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
alfresco_1 | at java.base/java.lang.Thread.run(Thread.java:834)
alfresco_1 | Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
alfresco_1 | ... 10 more
alfresco_1 | Caused by: org.alfresco.service.cmr.repository.ContentIOException: 05270018 Failed to copy content from file:
alfresco_1 | writer: ContentAccessor[ contentUrl=store://2019/6/27/18/13/0081dc19-8750-4ddb-ac3c-396b4ba1a859.bin, mimetype=application/pdf, size=0, encoding=UTF-8, locale=en_US]
alfresco_1 | file: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf
alfresco_1 | at org.alfresco.repo.content.AbstractContentWriter.putContent(AbstractContentWriter.java:491)
alfresco_1 | at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:83)
alfresco_1 | ... 11 more
alfresco_1 | Caused by: java.io.FileNotFoundException: /usr/local/tomcat/temp/Alfresco/OCRTransformWorker_source_5503547424193883468_ocr.pdf (No such file or directory)
alfresco_1 | at java.base/java.io.FileInputStream.open0(Native Method)
alfresco_1 | at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
alfresco_1 | at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
alfresco_1 | at org.alfresco.repo.content.AbstractContentWriter.putContent(AbstractContentWriter.java:485)
alfresco_1 | ... 12 more
So, to make it works on Alfresco/Share CE 6.1.2-ga/6.1.0 I made shared volume between alfresco and ocrmypdf containers. I replace /ocr_input and /ocr_output to one directory /ocr and map it as volume for both containers.
Only one problem, asynchronous mode for rule gives me error. So I turn it off.
Angel thanks!
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr:/ocr
...
ocrmypdf:
...
volumes:
- ocr:/ocr
...
volumes:
...
ocr:
driver: local
...
bin/ocrmypdf.sh
(and remove {} from $OUTPUT_FILE_PARAM in copy output file command)
#!/bin/bash
INPUT_DIR=/ocr
OUTPUT_DIR=/ocr
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
With the approach suggested by Fedorow, I was able to make OCR work with Alfresco 6.1.0. I update ocr_input and /ocr_output to /usr/local/tomcat/ocr_input and /usr/local/tomcat/ocr_out so that alfresco container can access these folders without any access issues.
Thanks Fedorow
docker-compose.yml
...
services:
alfresco:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
ocrmypdf:
...
volumes:
- ocr-input:/usr/local/tomcat/ocr_input
- ocr-output:/usr/local/tomcat/ocr_output
...
volumes:
...
ocr-input:
external: true
ocr-output:
external: true
...
bin/ocrmypdf.sh
#!/bin/bash
INPUT_DIR=/usr/local/tomcat/ocr_input
OUTPUT_DIR=/usr/local/tomcat/ocr_output
# ocrmypdf hostname
OCRMYPDF_SERVER="ocrmypdf"
# identify parameters, input and output file
array=( "$@" )
len=${#array[@]}
ARGS=${array[@]:0:$len-2}
LAST_ARGS="${@: -2}"
INPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 1`
OUTPUT_FILE_PARAM=`echo "$LAST_ARGS" | cut -d ' ' -f 2`
# extract filenames
INPUT_FILE=$(basename "$INPUT_FILE_PARAM")
OUTPUT_FILE=$(basename "$OUTPUT_FILE_PARAM")
# SSH parameters
SCP=cp
SSH=ssh
USER=root
# copy original pdf to ocrmypdf server
$SCP $INPUT_FILE_PARAM $INPUT_DIR
# execute ocrmypdf program
$SSH $USER@$OCRMYPDF_SERVER "/usr/bin/ocr.sh $ARGS $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE"
# copy transformed pdf back to alfresco path
$SCP $OUTPUT_DIR/$OUTPUT_FILE $OUTPUT_FILE_PARAM
# remove temporal files
rm -f $INPUT_DIR/$INPUT_FILE $OUTPUT_DIR/$OUTPUT_FILE
After the above changes I was able to successfully run OCR with Alfresco 6.1.
As we are running our Alfresco instance on Kubernetes and using HELM deployment, I need to configure the volumes in values.yaml file but I am not sure how to configure the volumes in values.yaml file. Any one has idea on how we need to make similar configuration in kubernetes.
Any help apprecaited.
Hi @SriramG,
Thanks for updating us on how you resolved your issue - really helpful.
Maybe start a new thread for your question about configuring volumes?
Cheers,
Ask for and offer help to other Alfresco Content Services Users and members of the Alfresco team.
Related links:
By using this site, you are agreeing to allow us to collect and use cookies as outlined in Alfresco’s Cookie Statement and Terms of Use (and you have a legitimate interest in Alfresco and our products, authorizing us to contact you in such methods). If you are not ok with these terms, please do not use this website.