AnsweredAssumed Answered

Server Side OCR with ABBYY Recognition Server

Question asked by abruzzi on Jun 9, 2009
Latest reply on Aug 1, 2012 by wmay
Just thought I'd post an FYI, for anyone looking to integrate server side OCR into Alfresco, we've found a pretty good solution:  ABBYY Recognition server.  It doesn't integrate out of the box, but since it provides a SOAP interface, it's pretty easy to whip up a script and transformer.  Since I know PHP and PHP has some easy SOAP functionality, I wrote my script in PHP (excuse the ugly code):

<?php

class files_object{
   public $FileName;
   public $FileContents;
}

class ProcessFile{

   public $location = "ogre.co.dac.int";
   public $workflowName;
   public $file;
};



if ($argv[1]=="–help"||$argv[1]=="-h") {
   print "Usage: php ocr.php <source file> <destination file> <ocr workflow name>\n";
   return 0;
} else {

   $input_file_name = $argv[1];
   $output_file_name = $argv[2];
   $workflow = $argv[3];
   
   if (is_null($input_file_name) || is_null($output_file_name) || is_null($workflow) ) {
      print "Usage: php ocr.php <source file> <destination file> <ocr workflow name>\n";
      return 2;
   }

   if(!file_exists($input_file_name) || !is_readable($input_file_name)) {
      print "Input file cannot be read or does not exist. Exiting.\n";
      return 2;
   } else {

      $input_filehandle = fopen($input_file_name, "r");
      $input_file_content = fread($input_filehandle, filesize($input_file_name));
      fclose($input_filehandle);

      $file = new files_object;
      $file->FileName = basename($input_file_name);
      $file->FileContents = $input_file_content;


      $soap_process = new ProcessFile;

      $soap_process->workflowName = $workflow;
      $soap_process->file = $file;

      $client = new SoapClient("http://ogre.co.dac.int/RecognitionWS/RSSoapService.asmx?wsdl");

      $results = $client->ProcessFile($soap_process);

      $content = $results->ProcessFileResult->InputFiles->InputFile->OutputDocuments->OutputDocument->Files->FileContainer->FileContents;
      $name = $results->ProcessFileResult->InputFiles->InputFile->OutputDocuments->OutputDocument->Files->FileContainer->FileName;

      $output_filehandle = fopen($output_file_name, "w");

      fwrite($output_filehandle, $content);

      fclose($output_filehandle);
   }
}
?>

This script takes the command line:

php ocr.php {source file} {target file} {workflow name}

(note: recognition server can define multiple workflows.  Currently I have an OCRtoPDF which returns a pdf document and OCRtoTXT which returns plain text.)

Then we simple created a new context file: (this example is for  tiff->pdf and tiff->txt


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>

   <bean id="transformer.TIFF.OCR" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer">
      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key="Linux">
                        <value>php /srv/alfresco/bin/ocr.php ${source} ${target} OCRtoPDF</value>
                    </entry>
                </map>
            </property>
         </bean>
      </property>
      <property name="explicitTransformations">
         <list>
            <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" >
                <property name="sourceMimetype"><value>image/tiff</value></property>
                <property name="targetMimetype"><value>application/pdf</value></property>
            </bean>
         </list>
      </property>
   </bean>

   <bean id="transformer.TIFF.TXT" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer">
      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key="Linux">
                        <value>php /srv/alfresco/bin/ocr.php ${source} ${target} OCRtoTXT</value>
                    </entry>
                </map>
            </property>
         </bean>
      </property>
      <property name="explicitTransformations">
         <list>
            <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" >
                <property name="sourceMimetype"><value>image/tiff</value></property>
                <property name="targetMimetype"><value>text/plain</value></property>
            </bean>
         </list>
      </property>
   </bean>

</beans>


We obviously have other mime conversions, but this is the core of it.  The nice thing is the image to txt transforms mean that anytime a jpeg, tiff, or png are uploaded, the image->txt conversion fires off and indexes any text found on the image while leaving the document intact in it's original version.

Recognition server seems to have pretty good accuracy and can be spread over multiple systems to speed things up.  The other benefit is it is surprisingly inexpensive.  Since Kofax client side OCR is the only Alfresco "supported" OCR, hopefully ABBYY RS will be a good alternative if you need or prefer server side OCR

Geof

Outcomes