AnsweredAssumed Answered

OCR Scanned PDF for Search Indexing

Question asked by pjaromin on May 27, 2012
Latest reply on Aug 1, 2012 by wmay
I'm relatively new to Alfresco and have recently setup an environment where scanned bitmaps are run through a transformer for text/plain through tesseract OCR. This works brilliantly for single-page documents scanned into PNG, JPEG, TIFF, etc.

For multi-page documents my scanner will create a PDF. However, the standard transformer for text/plain obviously doesn't do OCR. For documents in this  "scanner" space I'd always want to run them through OCR (probably a custom transformer which I have no trouble coding). However I don't wish to remove/override the PDFBox transformer for the majority of PDFs that already contain extract-able text.

So what's the best solution here? I'm thinking I could extend the PDFBox one to include OCR and merge the results, but this seems a bit messy. Is there a way to chain multiple transformers together for a given mime-type? Or is there a way to specify a specific transformer based on the space?

Or should I create a rule that runs on PDFs in this space to OCR them and place the text in a specific property that's set to searchable? Or perhaps something else I'm completely oblivious to.

Suggestions?

Thanks!

-Patrick

Outcomes