AnsweredAssumed Answered

Lucene Index Erosion

Question asked by dbachem on Oct 22, 2009
Latest reply on Jul 9, 2010 by roberto.negrete
I have worked for 1 years with Alfresco Labs 3.0 and I had problems several times with the outcome of the Lucene search engine. Quite clearly the problems were found in an expansion around a new store to light. Some content usage information, which was stored in a separate store, was not found properly after rebooting the system. It found out that the search for TYPE, ASPECT and properties of type d:boolean and d:category worked as expected. Only the search for a d:text property failed after the reboot (with index.recovery.mode = AUTO). Thus i started an index rebuilt (restart with index.recovery.mode = FULL), which meant that the problem temporarily disappeared. But a short time later, having worked a little with the system, the problem returned - probably caused by arbitrary indexing processes, e.g. after changing any node, or even after uploading some JS WebScripts via WebDAV.

After importing 4,000 PDF documents, the problem has now assumed a new dimension. Although the documents have been indexed, but only basic node properties (Title, Summary, Categories) could found be fairly reliably by full text. However, searching the content of the PDF at a maximum yielded some lucky shots. After several tests it was clear that it was not a problem of protected PDFs, nor a problem of language (mainly German). A debug breakpoint in PdfBoxContentTransformer.transformInternal() finally prooved that after uploading a PDF there was an asynchronous indexing performed (after end of transaction). But not all terms arrived in the index. Only about 30% of the selected words (all meaningful nouns) I found in the search, after Alf-restart this even fell to about 10%.

Moreover, it turned out that even simple properties such as creator and modifier doesn't work reliable. Of 11 documents that were exclusively created and modified by the admin user searching for @cm\:creator:"admin" only 4 documents hit! So once again re-indexing. After that 7 of 11 documents were found with Creator == admin. After I had worked a short time with the system and uploaded another PDF via a web-client, the amount again reduced from 7 to 4 hits. So it seems that any Alfresco indexing processes eliminate parts of our (intact) index. Overall I would sum up the whole problem with the word index-erosion.

Installation details
  • Alfresco Labs 3.0 Stable (Tomcat + MySQL) on Windows and Linux
  • Content inventory of historically evolved, some with their own model extensions
  • Since August / September additional stores
  • In September the entire code was again regarding ResultSet.close () (ResultSets are described since then, as in the Alfresco wiki consistently closed in the finally block)
Has anyone made similar experiences with Alfresco + Lucene, or perhaps any idea what could be the problem?