Hi,
I have a requirement to perform content search of INDD files in Alfresco. When I say content search, what I mean is, I have an INDD file which has images with some text superimposed on it. Any user should be able to search for the INDD file using the text present in the image. Is there any feature in Alfresco which can serve my requirement.
Thanks,
S
Solved! Go to Solution.
In my opinion
1. A better solution is to use a library to combine your original pdf with extracted ocr text file into a searchable pdf ,and save the pdf into repository,then it can be searched directlry
2. If you place the extracted text file and the original pdf separately, I think you can reimplement webscript /api/solr/textContent which is used to get the content for the node property as text during indexing. In your implementation ,for your kind of pdf documentation return your extracted text file directly。
Why not export your innd file to pdf in Apparently Adobe InDesign and save the pdf to alfresco.
Or you can create both formats and save them in alfresco. pdf format can be saved as a rendition of innd format.
And integrate an OCR converter to ocr the pdf.
Roughly speaking, the way Alfresco indexes content is by first transforming a document to text. It then indexes that text.
I suspect Adobe InDesign (INDD) files are not plain text. If you have a Java class that knows how to extract the text from an InDesign file, you can write a custom content transformer that transforms from INDD to TXT using that Java API. Once you do that, Alfresco will be able to start indexing INDD files.
If there is no such API available in Java then you'll have to use a workaround such as the one that suggests.
Thanks for your input.
Because of certain constraints, we cant use Adobe InDesign to convert INDD to pdf. Instead, we have used exiftool and ImageMagick to convert INDD to pdf.But this pdf , has text as image hence its not searchable.
So I m using OCR to extract the text from the pdf. But now, I need to add this text to the metadata of the file to make it searchable. Can you please advise on how to do that?
In my opinion
1. A better solution is to use a library to combine your original pdf with extracted ocr text file into a searchable pdf ,and save the pdf into repository,then it can be searched directlry
2. If you place the extracted text file and the original pdf separately, I think you can reimplement webscript /api/solr/textContent which is used to get the content for the node property as text during indexing. In your implementation ,for your kind of pdf documentation return your extracted text file directly。
Thank you kayne zhang !
I was able to extract the text and add to the metadata using the webscript.
All this while, I had the whole processing of the INDD file as a standalone java program, that establishes a connection with Alfresco to update the metadata. But now I need to deploy the amp on alfresco. So my question is how do I integrate exiftool and tesseract with Alfresco. I see that Imagemagick is already present in alfresco.
Thanks
You can use following url as a reference
Alfresco Simple OCR Action | Alfresco Add-ons - Alfresco Customizations
Or even this one https://github.com/keensoft/alfresco-simple-ocr
Which is a more complete and more recent addon
Ask for and offer help to other Alfresco Content Services Users and members of the Alfresco team.
Related links:
By using this site, you are agreeing to allow us to collect and use cookies as outlined in Alfresco’s Cookie Statement and Terms of Use (and you have a legitimate interest in Alfresco and our products, authorizing us to contact you in such methods). If you are not ok with these terms, please do not use this website.