AnsweredAssumed Answered

Content transformer for PDF

Question asked by yosrioh on Mar 28, 2016
Latest reply on Jan 16, 2017 by rodex
hello
i am trying to make a transformer from Scanned Pdf to a text PDF , the transformer should be loaded automaticly if a pdf is uploaded in alfresco .
im using tesseract and alfresco 5.0

after some research i have found a Post in the seedim forum that explains how to do that http://www.seedim.com.au/content/alfresco-search-pdf-images-using-transformations-and-tesseract-ocr

i first added a transformer in /opt/alfresco-community/tomcat/shared/classes/alfresco/extension/
named PDFimage-transform-context.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>
<bean id="transformer.worker.pdfimg2ocrtxt" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" >
    <property name="mimetypeService">
      <ref bean="mimetypeService" />
    </property>
    <property name="checkCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec">
        <property name="commandsAndArguments">
          <map>
            <entry key=".*">
              <list>
                <value>ls</value>
                <value>/opt/alfresco-community/pdf.sh</value>
              </list>
            </entry>
          </map>
        </property>
      </bean>
    </property>
    <property name="transformCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec">
        <property name="commandsAndArguments">
          <map>
            <entry key=".*">
              <list>
                <value>/opt/alfresco-community/pdf.sh</value>
                <value>${source}</value>
                <value>${target}</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1,2,3</value>
        </property>
      </bean>
    </property>
  </bean>

<bean id="transformer.pdfimg2ocrtxt" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
    <property name="worker">
                <ref bean="transformer.worker.pdfimg2ocrtxt" />
            </property>
        </bean>
    </beans>

i made a script code that i placed in /opt/alfresco-community
the script works fine when i lunch it from the terminal

#!/bin/bash

SOURCE=$1
TARGET=$2
TMPDIR=/home/yosri/tmp
name=yosri
TEMP_PDFTXT_FILE=$TMPDIR/pdftext.txt
echo running command "pdftotext -nopgbrk $SOURCE $TEMP_PDFTXT_FILE"
pdftotext -nopgbrk $SOURCE $TEMP_PDFTXT_FILE
FILESIZE=$(stat -c%s "$TEMP_PDFTXT_FILE")
echo "Size of $TEMP_PDFTXT_FILE = $FILESIZE bytes." >> /home/yosri/logfile.txt

# if file exists and has a size bigger than 0 then set wordlist as result of transformation and exit.
if [ -s $TEMP_PDFTXT_FILE ]; then
    echo Found wordlist from in $TEMP_PDFTXT_FILE >> /home/yosri/logfile.txt
    cat $TEMP_PDFTXT_FILE >> $TARGET
    rm -rf $TMPDIR/$name
    exit 0;
fi
# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f $SOURCE
# process each page
for f in $( ls *.jpg ); do
  # extract text
  tesseract $f $TMPDIR/${f%.*} -l eng
  cat $TMPDIR/${f%.*}.txt >> $TMPDIR/res.txt
  rm -f $TMPDIR/${f%.*}.txt
  rm -f $f
done

#combine all pages back to a ${TARGET}
cat $TMPDIR/res.txt >> $TARGET


and finally i added the priority line modification on the alfresco-global.properties



content.transformer.pdfimg2ocrtxt.priority=30
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.supported=true
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.priority=30
content.transformer.pdfimg2ocrtxt.extensions.pdf.txt.maxSourceSizeKBytes.use.index=9999


but when i upload a pdf with images it still not indexed .
i added some extra code to the transformer to verify if its loaded and the alfresco dont run so it is loaded .
can any one help me plz did i miss sth ?

Outcomes