AnsweredAssumed Answered

OCR pdf in Alfresco lost the spaces

Question asked by djarty on Feb 8, 2018

Hello!

Before figure out about need to use tesseract 4.00.00alpha for normal selection offset in Alfresco Community 201707

Now just last problem - if select text in OCRed pdf - spaces between words is lost.

 

In all other pdf viewers spaces selectable.

Use for automatic OCR (rule for folder): ocrmypdf 5.4.2 or pdfsanwich 0.1.6 and tesseract 4.00.00alpha under Ubuntu 16.04 LTS  

 

Its not really good if need to use OCRed text from pdf viewed by Alfresco (all words like one big word)

Also think its not good for search engine:

 

In text "Test PDF from LibreOffice"  I can search by words "from", "Test" etc.

But can't search text "PDF from" or "Test PDF"   (but can search "PDFfrom" "TestPDF") and its not good.

 

What can I do to fix it? (may be Alfresco side, may be ocr engine side options)

 

OCRed Sample added.

 

P.S. For developers can be useful this bug report from github

PDF output is missing spaces in some cases, while TXT output contains them · Issue #1235 · tesseract-ocr/tesseract · Git…  

 

Attachments

Outcomes