Before figure out about need to use tesseract 4.00.00alpha for normal selection offset in Alfresco Community 201707
Now just last problem - if select text in OCRed pdf - spaces between words is lost.
In all other pdf viewers spaces selectable.
Use for automatic OCR (rule for folder): ocrmypdf 5.4.2 or pdfsanwich 0.1.6 and tesseract 4.00.00alpha under Ubuntu 16.04 LTS
Its not really good if need to use OCRed text from pdf viewed by Alfresco (all words like one big word)
Also think its not good for search engine:
In text "Test PDF from LibreOffice" I can search by words "from", "Test" etc.
But can't search text "PDF from" or "Test PDF" (but can search "PDFfrom" "TestPDF") and its not good.
What can I do to fix it? (may be Alfresco side, may be ocr engine side options)
OCRed Sample added.
P.S. For developers can be useful this bug report from github