Text 101

cancel
Showing results for 
Search instead for 
Did you mean: 

Text 101

andy1
Senior Member
2 0 2,266

This blog is intended to describe Alfresco Content Service's support for indexing and querying text. There is a little bit about text analysis, how you should and could integrate NLP and some stuff like entity extraction and disambiguation. Mostly, it's about the basics.

Text

Alfresco has 3 data types related to text that can be used to define textual properties and content.

  • Text metadata - d:content
  • Multi-lingual text metadata - d:mltext
  • Content - d:content

These properties can be single or multi-valued. Each property has a locale associated with the text in some way. For text properties, the locale is defined on the node (sys:locale). All text metadata properties on the node have this locale. Multi-lingual text is designed to store text in several locales. Each multi-lingual string value goes hand-in-hand with an explicit locale. Each content property also carries an explicit locale: this may differ from the node locale and therefore differ from the locale associated with text properties.

Text is used for three things during search and discovery:

  • to query;
  • to order; and
  • to facet.

Text may be treated in many ways.

  • As the whole text; exactly as entered. This is good for text that forms an identifier: a non-numeric key, a product code, etc.. The text can be found as an exact match or based on a patten that would match the whole text. The match is case sensitive. This is almost the same as SQL = and LIKE but without any options to control collation.

  • As the bits for full text search. Text is broken up into words or tokenised. Each word is processed in a locale dependant way that may include: the removal of stop words, stemming, lemmatisation, adding synonyms, handling diacritics and other character folding; and many other alternatives. Text is found based on individual terms or phrases that (usually) go through the same processing.  

  • Something in between that is not locale sensitive. This supports matching regardless of the locale but with more flexibility than just the text "as is". For exmaple:
    • The text is split on whitespace with no further processing
    • The text is split on whitespace and the tokens then processed in a fixed way (e.g SOLR's WordDelimiterFactory). This option is used for the Alfresco cross-locale search support
    • As an identifier that is not case sensitive.

Each choice affects query, ordering and faceting behaviour. In many cases, text may have to be indexed in several ways to get the desired user experience at discovery time. The behaviour at query time could require any or all of the options described above. Ordering and faceting generally require the full text (an identifier) that ideally maps to a DocValue in the lucene index. Tokenised and multi-valued fields generally lead to ambiguous ordering and double counting for faceting. While there are some use cases for this, they are less common. Query time ordering of single valued properties that are indexed as an identifier is reliable and is locale sensitive. The localised ordering is done when processing the results. For single valued properties this is straight forward. For multi-lingual properties the locale value that best matches the sort locale can be used. This follows the same fall back sequence as Java localised property files.  

In general, if a field is only tokenised, ordering is indeterminate. Faceting will consider all tokens as if the field were multi-valued. This is probably not what you want but produces some interesting results for shingles.

Locale Expansion

The public API supports a list of locales to use for query expansion. The terms in the query will be localised using all the locales provided. In the SOLR configuration, in the solrcore.properties file, it is possible to define a fixed set of locales to add for all queries, described here in the documentation. (Changes do not require a reindex.)

There is currently no index time locale expansion. Text is indexed in one locale.

Search Services generates tokens with a language specific prefixes held in a single field.

Configuration

Configuration comes in two parts: the property definition in the data model and the configuration of Alfresco Search Services.

The property definition has two parts that affect indexing and query behaviour that are outlined in the documentation. Here is an example.

<type name="cm:content">   
   <title>Content</title>    
  
<parent>cm:cmobject</parent>     
  
<properties>       
     
<property name="cm:content">         
        
<type>d:content</type>         
        
<mandatory>false</mandatory>          
        
<index enabled="true">            
           
<facetable>true</facetable>            
           
<atomic>false</atomic>            
           
<tokenised>true</tokenised>          
        
</index>       
     
</property>    
  
</properties>
</type>

Currently, at index time, any given string may be indexed in one of four ways: tokenised using a single specific locale, as a case sensitive identifier, as a case sensitive identifier stored as a DocValue and/or tokenised in a single specific way regardless of locale.

If the <index> element is not present it defaults to

<index enabled="true">  
  
<atomic>true</atomic> 
  
<stored>false</stored> 
  
<tokenised>true</tokenised>
</index>

If enabled is set to true then the property is indexed. The atomic element is historic and is no longer used. The facetable element, if present and set to true, means the value will be indexed "as is" for faceting using DocValues. If set to false it indicates the field will never be used for faceting.

The tokenised element supports three values:

  • true - the field will be tokenised in a locale specific for full text search. There is no identifier search. Ordering and faceting support will depend on the tokens generated.
  • false - the field will be indexed "as is" for case sensitive identifier search, There is no support for locale specific full text search. This supports non DocValue based faceting and ordering (using the lucene field cache)
  • both - combines the two

If factetable is true and tokenised is false or both then DocValues are used in preference to the field value cache.
 

The SOLR shared.properties file can be used to over-ride some aspects of this behaviour. Properties can be forced to be indexed and queries as an identifier.  The support for cross-language fixed tokenisation is only defined in this file. Cross language search can be enabled by property type or for a specific property. This is in addition to the index and query behaviour defined on the property in the data model. 

Non text data types are much simpler. The tokenised element is ignored. Only the facetable element determines if DocValues are used.

SOLR 4 vs SOLR 6

SOLR 4 in Alfresco Content Services 5.2 has two query templates: vanilla and rerank. Alfresco Search Services 1.0/1.1 with SOLR 6 has only the rerank template. The two differ in the way text is handled by default.

For SOLR 4 the default behaviour was to support cross language search for all d:text, d:mltext and d:content fields. This resulted in considerable duplication in the index and more complex queries. The shared.properties file could be added to change this for scalability reasons.  In Alfresco Search Services 1.0/1.1 the default behaviour is to offer cross language support only for cm:name. This has lead to a number of support issues when cross language search is lost for cm:title, cm:description and some cmSmiley Tongueerson attributes. These issues can be fixed by reverting to the default options for SOLR 4 and reindexing; adding selected field and type options for cross language search and reindexing; or adding explicit locale expansion in the configuration without reindexing. Resolving and reconsidering these defaults is part of the next Alfresco Search Service release.

The vanilla index template and the rerank templates differ in how phrase queries are handled. Using the vanilla template phrases are always treated as phrases. With the rerank template phrases are first executed as conjunction queries and then reordered based on the phrase matches. The rerank core configuration can include results that do not match the phrase but contain all the terms in the phrase in any order and any position. However, the documents that match the phrases will be at the top. This addition recall may cause issues for some customers but overall it scales and performs better. 

102

Moving on to consider more advanced ideas. Extracting more meaning from text such as sentiment, entities, etc can be thought of as a processing step that generates more complex tokens. These tokens still have a position in the text and some value. Taking "Alfresco Software" as an example, it could generate "Alfresco" at position 1 and "Software" at position 2. It could also be thought of as "ORG:ALFRESCO_SOFTWARE_INC" at 1-2. The metadata extractor framework or behaviours in the repository can be used to provide integration points with external services such as Amazon Comprehend. This has been discussed else where.

Cross-locale and multi-lingual search issues are outlined well by Martin White in Searching for Information in the Tower of Babel.  They are also covered in chapter 14 of Solr In Action.

Creating relationships between entities and embeddings and then using these for disambiguation, building ontologies, etc is  beyond the scope of what we can consider here.

In the future Alfresco Search Service may support more indexing behaviour to be over-ridden as part of the SOLR configuration. It is possible, but not recommended, to make specific changes to tokenise any property in a custom way by adding a specific field entry to the schema.xml rather than relying on the dynamic field entries.

Any change to index configuration will (most likely) require a reindex. This is why some model changes may be ignore until you accept this impact.  

Multi-valued fields lead to issues with ordering, multiple-counts for faceting and other oddities. For example, consider a multi-valued text field that contains values "one" and "two". A query that facets on this field but has a predicate that looks for the value "one" will still see the values "one" and "two" in the facet results.

Summary

That was a brief outline of the basic options for text indexing and how they affect index and query behaviour.