AnsweredAssumed Answered

Some DOCX files not indexed correctly - Outdated Tika version?

Question asked by chkk on Aug 4, 2016
Hello all,

I recently noticed that some Word 2013 (DOCX) content is not searchable from the SOLR full text index, this is on Community 5.1, tested with 2016-05 and 2016-06-AE.

The problem is that for any parts of the document where a Word Content Control is used, only the label of the control, but not its value is put in the index.

Knowing that Apache Tika is used to transform documents to text streams for the full-text index, I checked the same document with the Tika standalone app and observed that the Tika versions 1.5 (this is the version of the Tika jars in the SOLR directories) and 1.6 (this is the version in the Alfresco directories) indeed do not have the values in the extracted text. I repeated this test with the current Tika version 1.13, this works fine.

As a test, I tried to copy the current tika-core-1.13 and tika-parsers-1.13 jar files to tomcat/lib, hoping to override the existing versions, this does not seem to work. Is there any way to make Alfresco use a more recent Tika version? Or can the full-text indexing be intercepted to call an external Tika parser to provide the text stream?

Thanks,
Chris

Outcomes