AnsweredAssumed Answered

Tips on troubleshooting individual file indexing with Alfresco 5.2 / Solr 6

Question asked by dbiggins on Apr 24, 2018
Latest reply on Apr 25, 2018 by dbiggins

I have a Alfresco Community (201707) installation which i am using to compare the default solr 4 vs solr 6 in the alfresco-search-services-1.1.0 install.


After a full index with Solr 4, I get the following info from the solr4 admin page:

Num Docs: 163458

Max Docs: 163458


Deleted Docs: 0


Master (Searching) 1524504594659 159 6.5 GB


Nodes in Index: 70921
Transactions in Index: 80844
Approx transactions remaining: 0


Unindexed Nodes: 11441
Error Nodes in Index: 0

in the solr4 SUMMARY report, I can see that it's done:

Node count with FTSStatus Clean    69165
Node count with FTSStatus Dirty    0
Node count with FTSStatus New    0

When I test the solr 6 setup, I stop the alfresco app, make the changes to the alfresco install for Solr 6, start the solr server and the alfresco server, and let it re-index.  It plugs along for a few hours, and then completes with the following stats:


Num Docs:164357

Max Doc:164357


Deleted Docs: 0


Master (Searching)     1524581958240 586 2.48 GB

, and in the SUMMARY report:

Alfresco Nodes in Index    70937
Alfresco Transactions in Index    81470
Alfresco Unindexed Nodes    11698
Alfresco Error Nodes in Index    0

Node count with FTSStatus Clean    69181
Node count with FTSStatus Dirty    0
Node count with FTSStatus New    0

When i run the ERROR query I get nothing:


 So the indexer looks done and comparable volume-wise to the solr4 setup.


What first concerned me was the significantly smaller size: the Solr4 6.5 Gb vs Solr6 2.5 Gb size after a complete reindex, when I was expecting a 15% size increase with the introduction of fingerprints.


There are some docs that I can't get in a full text search result set, even though the docs have the index aspect attached.  I can try to reindex one of those docs, but no luck



At reindex time I saw a few

"FlateFilter: stop reading corrupt stream due to a DataFormatException"


"An error occured when reading table hmtx"

But no more then I saw on the solr4 setup.


Any thoughts on how best to troubleshoot the inconsistencies? 


Also, I know i can't upgrade to the pdfbox 2.0.X in 5.2, but anyone able to replace the pdfbox-1.8.10.jar and pdfbox-1.8.10.jar with pdfbox-1.8.13.jar and pdfbox-1.8.13.jar to get over the pdfbox probs?