Hi,
I am using Alfresco 5.1 and I have XML files to index. My XML contains tags such as
<paragraph eId="id-00000967-2e30-ecab-ad49-685fecd94436">
<content>
<p>Some text</p>
</content>
</paragraph>
I would like to be able to discard XML attribute such as eId during indexing. For now if I search for eca (that is a substring of the eId) I get some results.
I've seen that I could use <charFilter class="solr.HTMLStripCharFilterFactory"/> in SOLR schema.xml but so far this does not seem to give any results.
Does someone know how to achieve this ?
Thanks !
Solved! Go to Solution.
Answering to myself
The issue actually does not come from the indexing but from the extraction. It seems that text/xml mimetype is handled by a String extractor outputing the same in output as what it gets in input. Therefore, the whole XML goes to the indexing.
The solution was to create a custom extractor stripping out XML syntax (similar to HTML extraction) and to use a custom application/xml mimetype to trigger it
Answering to myself
The issue actually does not come from the indexing but from the extraction. It seems that text/xml mimetype is handled by a String extractor outputing the same in output as what it gets in input. Therefore, the whole XML goes to the indexing.
The solution was to create a custom extractor stripping out XML syntax (similar to HTML extraction) and to use a custom application/xml mimetype to trigger it
Ask for and offer help to other Alfresco Content Services Users and members of the Alfresco team.
Related links:
By using this site, you are agreeing to allow us to collect and use cookies as outlined in Alfresco’s Cookie Statement and Terms of Use (and you have a legitimate interest in Alfresco and our products, authorizing us to contact you in such methods). If you are not ok with these terms, please do not use this website.