AnsweredAssumed Answered

How to get Content from PDF document.

Question asked by joncmuniz on Apr 24, 2014
Hi all. I'm using alfresco 4.2e and cmis 0.10.0
I have PDF documents that I wanted to extract the contents, as if to present a summary. And also to show the area containing the text of the search made by the User.
To do this I'm having to create copies of documents in plain text (Stream PDF to text conversion was very slow). I get through the content:
Load documents under the target folder

ItemIterable<QueryResult> documentsResultSet = sessionCopia.query(
"SELECT * from cmis:document where in_folder('" + parentFolder+ "')
and cmis:name ='" + fileName + "'", false).getPage();


So i get the id

CmisObject object = sessionCopie
.getObject(documentSearchResult.getId());
Document document = (Document) object;


Here i get the ALL the stream and transform to string where looking for in the plain text the searchParam. Using JAVA api.


return TransformAndExtractInputStreamForStringCmis.
getInputStreamToText(document.getContentStream().getStream(), searchParam);


Could someone point me to a better way of doing it I thought I could do this search within the content document and extract using something already indexed.
Finding the indexed document was easy.
But then find the contents inside it and extract it using cmis api would look like?

Thank you.

Outcomes