miguelrodriguez

Indexing images with text in Alfresco with Tesseract-ocr

Blog Post created by miguelrodriguez Employee on Oct 11, 2017

  • Purpose

The purpose of this blog is to show how to scan images containing text so that the text is indexed and searchable by Alfresco. The following file types are supported: PNG, BMP, JPEG, GIF, TIFF and PDF (containing images).

 

For this exercise we are going to use a Linux OS...but this solution should equally work on Windows OS.

 

To scan images we are going to use Tesseract-ocr (tesseract). This package contains an OCR engine - libtesseract and a command line program - tesseract

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

 

  • Tesseract

Since we are using tesseract-ocr we need to install tesseract software for our Linux distribution (version 3 or greater)

Please follow the instructions explained here: Installing Tesseract

 

  • Transformation context file

Create a file named transformer-context.xml in alfresco's extension folder i.e. tomcat/shared/classes/alfresco/extension with the following content:

 

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
     license agreements. See the NOTICE file distributed with this work for additional
     information regarding copyright ownership. The ASF licenses this file to
     You under the Apache License, Version 2.0 (the "License"); you may not use
     this file except in compliance with the License. You may obtain a copy of
     the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
     by applicable law or agreed to in writing, software distributed under the
     License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
     OF ANY KIND, either express or implied. See the License for the specific
     language governing permissions and limitations under the License. -->

<beans>

     <!-- Transforms from TIFF to plain text using Tesseract
           and a custom script -->

     <bean id="transformer.worker.ocr.tiff"
          class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">

          <property name="mimetypeService">
               <ref bean="mimetypeService" />
          </property>
          <property name="checkCommand">
               <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                         <map>
                              <entry key=".*">
                                   <list>
                                        <value>${tesseract.exe}</value>
                                        <value>-v</value>
                                   </list>
                              </entry>
                         </map>
                    </property>
                    <property name="errorCodes">
                         <value>2</value>
                    </property>
               </bean>
          </property>

          <property name="transformCommand">
               <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                         <map>
                              <entry key=".*">
                                   <list>
                                        <value>${ocr.script}</value>
                                        <value>${source}</value>
                                        <value>${target}</value>
                                   </list>
                              </entry>
                         </map>
                    </property>
                    <property name="errorCodes">
                         <value>1,2</value>
                    </property>
                    <property name="waitForCompletion">
                         <value>true</value>
                    </property>
               </bean>
          </property>
          <property name="transformerConfig">
               <ref bean="transformerConfig" />
          </property>
     </bean>

     <bean id="transformer.ocr.tiff"
          class="org.alfresco.repo.content.transform.ProxyContentTransformer"
          parent="baseContentTransformer">

          <property name="worker">
               <ref bean="transformer.worker.ocr.tiff" />
          </property>
     </bean>

     <!-- Transforms from PDF to TIFF using Ghostscript -->
     <bean id="transformer.worker.pdf.tiff"
          class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">

          <property name="mimetypeService">
               <ref bean="mimetypeService" />
          </property>
          <property name="checkCommand">
               <bean name="transformer.ImageMagick.CheckCommand" class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                         <map>
                              <entry key=".*">
                                   <list>
                                        <value>${ghostscript.exe}</value>
                                        <value>-v</value>
                                   </list>
                              </entry>
                         </map>
                    </property>
               </bean>
          </property>

          <property name="transformCommand">
               <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                         <map>
                              <entry key=".*">
                                   <list>
                                        <value>${ghostscript.exe}</value>
                                        <value>-o</value>
                                        <value>${target}</value>
                                        <value>-sDEVICE=tiff24nc</value>
                                        <value>-r300</value>
                                        <value>${source}</value>
                                   </list>
                              </entry>
                         </map>
                    </property>
                    <property name="errorCodes">
                         <value>1,2</value>
                    </property>
                    <property name="waitForCompletion">
                         <value>true</value>
                    </property>
               </bean>
          </property>
          <property name="transformerConfig">
               <ref bean="transformerConfig" />
          </property>
     </bean>

     <bean id="transformer.pdf.tiff"
          class="org.alfresco.repo.content.transform.ProxyContentTransformer"
          parent="baseContentTransformer">

          <property name="worker">
               <ref bean="transformer.worker.pdf.tiff" />
          </property>
     </bean>

</beans>

 

We can see we are using a few variables here:

  • tesseract.exe: this is the tesseract binary file, normally installed as /usr/bin/tesseract
  • ocr.script: this is the script we are calling to transform images to text, installed in Alfresco home folder as ocr.sh
  • ghostcript.exe: this is the ghostcript binary file...usually is the gs binary file
  • source: this is the source image file
  • target: this is the resulting text file

 

  • OCR Script

The next step is to create the ocr.sh script. The location of the script will be reference also in alfresco-global.properties file by the property ocr.script as shown later in this blog.

 

Assuming Alfresco is installed in /opt/alfresco, create a file name /opt/alfresco/ocr.sh with the following content:

# save arguments to variables
SOURCE=$1
TARGET=$2
TMPDIR=/tmp/tesseract
FILENAME=`basename $SOURCE`
OCRFILE=$FILENAME.tif
LD_LIBRARY_PATH=/usr/lib

# Create temp directory if it doesn't exist
mkdir -p $TMPDIR

# to see what happens
# echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log

cp -f $SOURCE $TMPDIR/$OCRFILE

# call tesseract and redirect output to $TARGET
/usr/bin/tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng
rm -f $TMPDIR/$OCRFILE

A couple of points to consider here:

  • We are using LD_LIBRARY_PATH to point to the OS library path to find the libraries required by tesseract. If we don't do this it will be using the library path defined by Alfresco pointing to commons/lib folder, but the version of the libraries may not be the ones required by tesseract.
  • We are defining the location of the tesseract binary file as /usr/bin/tesseract. If installed on a different location then adjust the path to tesseract accordingly.

 

Finally make sure the ocr.sh file has executable permission set. You can set it with the following command: chmod 755 /opt/alfresco/ocr.sh

 

  • Tesseract properties

The next step is to define a set of properties for tesseract in alfresco-global.properties.

# OCR Script
ocr.script=/opt/alfresco/ocr.sh

#GS executable
ghostscript.exe=gs

#Tesseract executable
tesseract.exe=tesseract

# Define a default priority for this transformer
content.transformer.ocr.tiff.priority=10

# List the transformations that are supported
content.transformer.ocr.tiff.extensions.tiff.txt.supported=true
content.transformer.ocr.tiff.extensions.tiff.txt.priority=10
content.transformer.ocr.tiff.extensions.jpg.txt.supported=true
content.transformer.ocr.tiff.extensions.jpg.txt.priority=10
content.transformer.ocr.tiff.extensions.png.txt.supported=true
content.transformer.ocr.tiff.extensions.png.txt.priority=10
content.transformer.ocr.tiff.extensions.gif.txt.supported=true
content.transformer.ocr.tiff.extensions.gif.txt.priority=10

# Define a default priority for this transformer
content.transformer.pdf.tiff.available=true
content.transformer.pdf.tiff.priority=10
# List the transformations that are supported
content.transformer.pdf.tiff.extensions.pdf.tiff.supported=true
content.transformer.pdf.tiff.extensions.pdf.tiff.priority=10

content.transformer.complex.Pdf2OCR.available=true
# Commented to be compatible with Alfresco 5.x
# content.transformer.complex.Pdf2OCR.failover=ocr.pdf
content.transformer.complex.Pdf2OCR.pipeline=pdf.tiff|tiff|ocr.tiff
content.transformer.complex.Pdf2OCR.extensions.pdf.txt.supported=true
content.transformer.complex.Pdf2OCR.extensions.pdf.txt.priority=10

# Disable the OOTB transformers
content.transformer.double.ImageMagick.extensions.pdf.tiff.supported=false
content.transformer.complex.PDF.Image.extensions.pdf.tiff.supported=false
content.transformer.ImageMagick.extensions.pdf.tiff.supported=false
content.transformer.PdfBox.extensions.pdf.txt.supported=false
content.transformer.TikaAuto.extensions.pdf.txt.supported=false

 

The main property to consider is ocr.script pointing to the location of the ocr.sh file...adjust accordingly. All other properties can be left as they are.

 

  • Debugging

There are two areas we can debug:

  1. The Alfresco transformation service
  2. Tesseract execution

 

Alfresco Transformation Service

To debug the transformation service edit the file tomcat/shared/classes/alfresco/extension/custom-log4j.properties and add the following line at the bottom:

 

log4j.logger.org.alfresco.repo.content.transform=trace

 

Alfresco needs restarting to pick up this debug entry.

 

Tesseract execution

To get some execution information from tesseract edit the file /opt/alfresco/ocr.sh and uncomment the following entry by removing the '#' from the beginning of the line:

 

# echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log

 

Now when an image file with text is loaded in Alfresco we can see similar entries in alfresco.log file showing the ocr.sh script being called.

2017-10-10 15:20:17,182  DEBUG [content.transform.RuntimeExecutableContentTransformerWorker] [http-bio-8443-exec-6] Transformation completed: 
   source: ContentAccessor[ contentUrl=store:///opt/alfresco/tomcat/temp/Alfresco/ComplextTransformer_intermediate_pdf_9017478201188837562.tiff, mimetype=image/tiff, size=24925880, encoding=UTF-8, locale=en_GB]
   target: ContentAccessor[ contentUrl=store://2017/10/10/15/20/d3b4b9aa-ad28-4c8c-ae86-f99938bf4125.bin, mimetype=text/plain, size=1173, encoding=UTF-8, locale=en_GB]
   options: {maxSourceSizeKBytes=-1, pageLimit=-1, use=index, timeoutMs=120000, maxPages=-1, contentReaderNodeRef=null, sourceContentProperty=null, readLimitKBytes=-1, contentWriterNodeRef=null, targetContentProperty=null, includeEmbedded=null, readLimitTimeMs=-1}
   result: 
Execution result: 
   os:         Linux
   command:    /opt/alfresco/ocr.sh /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_source_5734790636289670188.tiff /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_target_1506982845420553983.txt
   succeeded:  true
   exit code:  0
   out:        
   err:        Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Page 1
 2017-10-10 15:20:17,183  TRACE [content.transform.TransformerLog] [http-bio-8443-exec-6] 4.1.2         tiff txt  INFO <<TemporaryFile>> 23.7 MB 1,950 ms ocr.tiff<<Runtime>>
 2017-10-10 15:20:17,183  TRACE [content.transform.TransformerDebug] [http-bio-8443-exec-6] 4.1.2         Finished in 1,950 ms

 

We can also take a look at the /tmp/ocrtransform.log file to see what files have been processed.

from /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_source_5734790636289670188.tiff to /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_target_1506982845420553983.txt

 

That's it, you should now be able to search for the text contained in the image files.

 

  • References

Most of the information on this blog comes from this GitHub repository https://github.com/bchevallereau/alfresco-tesseract, with some additional adjustments and inclusions.

Outcomes