AnsweredAssumed Answered

OCR in place, document versioning

Question asked by abruzzi on May 20, 2014
Latest reply on May 20, 2014 by romschn
We have an OCR server (ABBYY Recognition Server) that I have integrated into Alfresco using the ABBYY SOAP interface.  We have had a PDF to TXT, JPG to TXT, TIFF to TXT, and a few other transforms implemented for some time.  I wanted to come up with a mechanism where if someone uploaded a scanned PDF (without a text layer) it was automatically versioned and replaced with a new PDF from the OCR server that has been processed.

I came up with the following JS code that basically works:


     if (!document.isVersioned) {
         document.addAspect("cm:versionable");
    }

     var tempDir = companyhome.childByNamePath("tmp");
     var updatedVersion = document.transformDocument("application/vnd.dac", tempDir);

     var workingCopy = document.checkout();
     workingCopy.properties.content.write(updatedVersion.properties.content);
     workingCopy.properties.content.mimetype="application/pdf";
     workingCopy.save();
     workingCopy.checkin("OCRd by ABBYY", false);

     updatedVersion.remove();

As a brief explanation, I invented a mimetype "application/vnd.dac" so I could essentially have a transform in the system that sends to the OCR server and gets back an OCRd PDF.  The content of that file is then stuffed into the new version and its mimetype turned back to "application/pdf".  So the made up mimetype is only temporary.

Right now I'm running on a 3.1.2 machine, but this will eventually be moved to a 4.2 box, but my 4.2 box isn't quite ready to primetime. 

If I manually run the script on an existing file, it works fine.  However I want it to run automatically.  So I set it up as a rule in a space, and that where I get some odd behavior.  If I don't run the rule in the background, the web interface makes the users wait on the OCR server.  It could be fast, or it could wait several minutes.  Not ideal for the user, but expected.  So if I set the rule to run in the background, I get some strange behavior.  Specifically, the document gets two version, but version 1.0 is the OCRd document, and the current 1.1 version is the original upload (see attached image).

[img]https://forums.alfresco.com/sites/forums/files/Screen%20Shot%202014-05-19%20at%204.57.04%20PM.png[/img]

So I figure the rule/script is getting ahead of the upload new file process.  My first thought was to come up with a delay and some kind of test to see if the upload process is done before triggering the OCR process, however the Rhino JS engine doesn't seem to implement setInterval() and I'm not sure how else to get that effect.  I'm open to suggestions on why this is happening, and how I might be able to get around this.

thanks,

Geof

Outcomes