Hey guys,
It is not generating the ocr within the Alfresco platform.
See the logs below:
tail -f /opt/alfresco/tomcat/logs/catalina.out
command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:183)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$200(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:164)
at es.keensoft.alfresco.ocr.OCRExtractAction$1.execute(OCRExtractAction.java:161)
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:450)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeInNewTransaction(OCRExtractAction.java:169)
at es.keensoft.alfresco.ocr.OCRExtractAction.access$100(OCRExtractAction.java:38)
at es.keensoft.alfresco.ocr.OCRExtractAction$ExtractOCRTask.run(OCRExtractAction.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: org.alfresco.service.cmr.repository.ContentIOException: 08140019 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:86)
at es.keensoft.alfresco.ocr.OCRExtractAction.executeImplInternal(OCRExtractAction.java:181)
... 10 more
Caused by: org.alfresco.service.cmr.repository.ContentIOException: 08140019 Failed to perform OCR transformation:
Execution result:
os: Linux
command: /opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601.pdf /opt/alfresco/tomcat/temp/Alfresco/OCRTransformWorker_source_5414022605894367601_ocr.pdf
succeeded: false
exit code: 1
out:
err: Traceback (most recent call last):
File "/usr/bin/ocrmypdf", line 11, in <module>
load_entry_point('ocrmypdf==6.1.2', 'console_scripts', 'ocrmypdf')()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 480, in load_entry_po
at es.keensoft.alfresco.ocr.OCRTransformWorker.transform(OCRTransformWorker.java:79)
root@pmituiutaba:/opt/alfresco/logs# gs --version
9.26
root@pmituiutaba:/opt/alfresco/logs# pip3 --version
pip 20.2.3 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)
root@pmituiutaba:/opt/alfresco/logs# tesseract --version
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX
Found SSE
root@pmituiutaba:/opt/alfresco/logs# ocrmypdf --version
6.1.2
root@pmituiutaba:/opt/alfresco/logs# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
cat alfresco.log | grep -i "Current version"
2020-09-15 00:04:09,348 INFO [org.alfresco.service.descriptor.DescriptorService] [localhost-startStop-1] Alfresco Content Services started (Community). Current version: 6.1.1 (r9d03d2fd-b168) schema 12,001. Originally installed version: 6.1.1 (r9d03d2fd-b168) schema 12,001.
cat /etc/sudoers
#
# This file MUST be edited with the 'visudo' command as root.
#
# Please consider adding local content in /etc/sudoers.d/ instead of
# directly modifying this file.
#
# See the man page for details on how to write a sudoers file.
#
Defaults env_reset
Defaults mail_badpass
Defaults secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"
# Host alias specification
# User alias specification
# Cmnd alias specification
# User privilege specification
root ALL=(ALL:ALL) ALL
alfresco ALL=(ALL) NOPASSWD: ALL
# Members of the admin group may gain root privileges
%admin ALL=(ALL) ALL
# Allow members of group sudo to execute any command
%sudo ALL=(ALL:ALL) ALL
# See sudoers(5) for more information on "#include" directives:
#includedir /etc/sudoers.d
cat /opt/alfresco/tomcat/shared/classes/alfresco-global.properties | grep -i "ocr"
#### OCR mit OCRmyPDF
ocr.command=/opt/alfresco/scripts/ocrmypdf.sh
ocr.output.verbose=false
ocr.output.file.prefix.command=
ocr.extra.commands=--verbose 1 --force-ocr -l por+eng
ocr.server.os=linux
/opt/alfresco/modules/share# l
total 12K
-rw-r--r-- 1 root root 12K Sep 14 18:48 simple-ocr-share-2.3.1.jar
/opt/alfresco/modules/platform# l
total 28K
-rw-r--r-- 1 root root 28K Sep 14 18:48 simple-ocr-repo-2.3.1.jar
Can you help please?
Thanks a lot!
/opt/alfresco/scripts/ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /***/***src.pdf /***/***target.pdf
Hi kaynezhang,
Running through the linux shell, it worked perfectly.
See the log:
./ocrmypdf.sh --verbose 1 --force-ocr -l por+eng /home/jbrasil/teste33.pdf /home/jbrasil/teste33-v2.pdf
DEBUG - ocrmypdf 6.1.2
DEBUG - tesseract 4.0.0-beta.1
DEBUG - qpdf 8.0.2
DEBUG - PyMuPDF not installed
DEBUG - os.symlink(/home/jbrasil/teste33.pdf, /tmp/com.github.ocrmypdf.l22048pv/origin)
________________________________________
Tasks which will be run:
Task enters queue = 'ocrmypdf.pipeline.triage'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/origin, /tmp/com.github.ocrmypdf.l22048pv/origin.pdf)
Completed Task = 'ocrmypdf.pipeline.triage'
Task enters queue = 'ocrmypdf.pipeline.repair_and_parse_pdf'
DEBUG - Beginning qpdf repair...
DEBUG - Repair OK; beginning parse...
DEBUG - <PdfInfo('...'), page count=1>
Completed Task = 'ocrmypdf.pipeline.repair_and_parse_pdf'
Task enters queue = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.pre_split_pages'
Task enters queue = 'ocrmypdf.pipeline.split_page'
Completed Task = 'ocrmypdf.pipeline.generate_postscript_stub'
Completed Task = 'ocrmypdf.pipeline.split_page'
Task enters queue = 'ocrmypdf.pipeline.ocr_or_skip'
INFO - 1: page already has text! – rasterizing text and running OCR anyway
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.pdf, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.pipeline.ocr_or_skip'
Task enters queue = 'ocrmypdf.pipeline.orient_page'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf.pipeline.orient_page'
Task enters queue = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.skip_page'
Uptodate Task = 'ocrmypdf.pipeline.skip_page'
WARNING:
In Task 'ocrmypdf.pipeline.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.
DEBUG - Rasterize 000001.ocr.oriented.pdf with png16m
DEBUG -
Completed Task = 'ocrmypdf.pipeline.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.preprocess_remove_background'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-background.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_remove_background'
Task enters queue = 'ocrmypdf.pipeline.preprocess_deskew'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-background.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-deskew.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_deskew'
Task enters queue = 'ocrmypdf.pipeline.preprocess_clean'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.l22048pv/000001.pp-clean.png)
Completed Task = 'ocrmypdf.pipeline.preprocess_clean'
Task enters queue = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_ocr_image'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.page.png, /tmp/com.github.ocrmypdf.l22048pv/000001.image)
Completed Task = 'ocrmypdf.pipeline.select_visible_page_image'
Task enters queue = 'ocrmypdf.pipeline.select_image_layer'
DEBUG - os.symlink(/tmp/com.github.ocrmypdf.l22048pv/000001.pp-clean.png, /tmp/com.github.ocrmypdf.l22048pv/000001.ocr.png)
Completed Task = 'ocrmypdf.pipeline.select_ocr_image'
Task enters queue = 'ocrmypdf.pipeline.ocr_tesseract_textonly_pdf'
DEBUG - 1: convert
DEBUG - ['tesseract', '-l', 'por+eng', '-c', 'textonly_pdf=1', '/tmp/com.github.ocrmypdf.l22048pv/000001.ocr.png', '/tmp/com.github.ocrmypdf.l22048pv/000001.text', 'pdf', 'txt']
DEBUG - 1: convert done
Completed Task = 'ocrmypdf.pipeline.select_image_layer'
Completed Task = 'ocrmypdf.pipeline.ocr_tesseract_textonly_pdf'
Task enters queue = 'ocrmypdf.pipeline.combine_layers'
Completed Task = 'ocrmypdf.pipeline.combine_layers'
Task enters queue = 'ocrmypdf.pipeline.merge_pages_ghostscript'
DEBUG - Final pages: /tmp/com.github.ocrmypdf.l22048pv/000001.rendered.pdf
/tmp/com.github.ocrmypdf.l22048pv/pdfa.ps
DEBUG - Ghostscript had to remove PDF 'overprinting' from the input file to complete PDF/A conversion.
Completed Task = 'ocrmypdf.pipeline.merge_pages_ghostscript'
Task enters queue = 'ocrmypdf.pipeline.copy_final'
Completed Task = 'ocrmypdf.pipeline.copy_final'
INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 3.38× larger than the input file.
Possible reasons for this include:
The optional dependency PyMuPDF is not installed.
The argument --force-ocr was issued.
DEBUG - <PdfInfo('...'), page count=1>
l /home/jbrasil/
total 116K
-rw-r--r-- 1 root root 26K Sep 14 18:56 teste33.pdf
-rw-r--r-- 1 root root 86K Sep 15 09:01 teste33-v2.pdf
It just doesn't generate through the Alfresco platform.
Can you help?
Thank you.
How did you install alfresco ? did you install it manually or install using docker?
Hi kaynezhang,
I installed using the loftuxab script.
alfinstall.sh
https://github.com/loftuxab/alfresco-ubuntu-install
I have always installed this script.
I never had a problem. First time this type of error occurs.
Anything else that needs to be investigated?
Thanks a lot.
Your installation is ok ,the error seems python script can't load tesseract lib correctly. But you can run the command successfully directly int shell,very strange.
Hi kaynezhang,
Very strange. We have other servers with Alfrescom running the same version.
See the script:
/ opt / alfresco / scripts
cat ocrmypdf.sh
#! / usr / bin / env bash
# set -o xtrace # Uncomment for debugging / troubleshooting
sudo ocrmypdf "$ @"
Theoretically, it is right.
I do not know what happened...
Thanks.
Ask for and offer help to other Alfresco Content Services Users and members of the Alfresco team.
Related links:
By using this site, you are agreeing to allow us to collect and use cookies as outlined in Alfresco’s Cookie Statement and Terms of Use (and you have a legitimate interest in Alfresco and our products, authorizing us to contact you in such methods). If you are not ok with these terms, please do not use this website.