Hello,
I want to retrieve some informations from a text of a pdf file (scanned files).
I started by using pdfsandwich OCR to extract the text in the images (the text is added to each page invisibly "behind" the images), what i want to do, is search that text for informations that i need, How can i do that? is it with lucene search? I'm new to this, i don't know where to start, an example will be a big help for me.
Thank you.
So, if you are considering to write scripts that run inside the Alfresco Repository application, you may want to look into the documentation of that JavaScript API, especially the part about accessing content-related attributes. But with JavaScript you will generally be limited to working with textual content files, e.g. not PDF files (which are more or less in binary form) that have a text layer added above.
BUT, if the text layer is added by OCR, Alfresco will be able to index the document using SOLR, and you can definitely use JavaScript to execute a search query for the content, and then find the document to process further via JavaScript - you just may not be able to search in the content of the PDF itself, only indirectly via its indexed text in SOLR.
Ask for and offer help to other Alfresco Content Services Users and members of the Alfresco team.
Related links:
By using this site, you are agreeing to allow us to collect and use cookies as outlined in Alfresco’s Cookie Statement and Terms of Use (and you have a legitimate interest in Alfresco and our products, authorizing us to contact you in such methods). If you are not ok with these terms, please do not use this website.