Skip navigation
All Places > Alfresco Content Services (ECM) > Blog > Author: andy1

Alfresco Content Services (ECM)

10 Posts authored by: andy1 Employee
andy1

Text 101

Posted by andy1 Employee Mar 21, 2018

This blog is intended to describe Alfresco Content Service's support for indexing and querying text. There is a little bit about text analysis, how you should and could integrate NLP and some stuff like entity extraction and disambiguation. Mostly, it's about the basics.

 

Text

 

Alfresco has 3 data types related to text that can be used to define textual properties and content.

  • Text metadata - d:content
  • Multi-lingual text metadata - d:mltext
  • Content - d:content

 

These properties can be single or multi-valued. Each property has a locale associated with the text in some way. For text properties, the locale is defined on the node (sys:locale). All text metadata properties on the node have this locale. Multi-lingual text is designed to store text in several locales. Each multi-lingual string value goes hand-in-hand with an explicit locale. Each content property also carries an explicit locale: this may differ from the node locale and therefore differ from the locale associated with text properties.

 

Text is used for three things during search and discovery:

  • to query;
  • to order; and
  • to facet.

 

Text may be treated in many ways.

  • As the whole text; exactly as entered. This is good for text that forms an identifier: a non-numeric key, a product code, etc.. The text can be found as an exact match or based on a patten that would match the whole text. The match is case sensitive. This is almost the same as SQL = and LIKE but without any options to control collation.

  • As the bits for full text search. Text is broken up into words or tokenised. Each word is processed in a locale dependant way that may include: the removal of stop words, stemming, lemmatisation, adding synonyms, handling diacritics and other character folding; and many other alternatives. Text is found based on individual terms or phrases that (usually) go through the same processing.  

  • Something in between that is not locale sensitive. This supports matching regardless of the locale but with more flexibility than just the text "as is". For exmaple:
    • The text is split on whitespace with no further processing
    • The text is split on whitespace and the tokens then processed in a fixed way (e.g SOLR's WordDelimiterFactory). This option is used for the Alfresco cross-locale search support
    • As an identifier that is not case sensitive.

 

Each choice affects query, ordering and faceting behaviour. In many cases, text may have to be indexed in several ways to get the desired user experience at discovery time. The behaviour at query time could require any or all of the options described above. Ordering and faceting generally require the full text (an identifier) that ideally maps to a DocValue in the lucene index. Tokenised and multi-valued fields generally lead to ambiguous ordering and double counting for faceting. While there are some use cases for this, they are less common. Query time ordering of single valued properties that are indexed as an identifier is reliable and is locale sensitive. The localised ordering is done when processing the results. For single valued properties this is straight forward. For multi-lingual properties the locale value that best matches the sort locale can be used. This follows the same fall back sequence as Java localised property files.  

 

In general, if a field is only tokenised, ordering is indeterminate. Faceting will consider all tokens as if the field were multi-valued. This is probably not what you want but produces some interesting results for shingles.

 

 

Locale Expansion

 

The public API supports a list of locales to use for query expansion. The terms in the query will be localised using all the locales provided. In the SOLR configuration, in the solrcore.properties file, it is possible to define a fixed set of locales to add for all queries, described here in the documentation. (Changes do not require a reindex.)

 

There is currently no index time locale expansion. Text is indexed in one locale.

 

Search Services generates tokens with a language specific prefixes held in a single field.

 

 

Configuration

 

Configuration comes in two parts: the property definition in the data model and the configuration of Alfresco Search Services.

 

The property definition has two parts that affect indexing and query behaviour that are outlined in the documentation. Here is an example.

 

<type name="cm:content">   
   <title>Content</title>    
  
<parent>cm:cmobject</parent>     
  
<properties>       
     
<property name="cm:content">         
        
<type>d:content</type>         
        
<mandatory>false</mandatory>          
        
<index enabled="true">            
           
<facetable>true</facetable>            
           
<atomic>false</atomic>            
           
<tokenised>true</tokenised>          
        
</index>       
     
</property>    
  
</properties>
</type>

 

Currently, at index time, any given string may be indexed in one of four ways: tokenised using a single specific locale, as a case sensitive identifier, as a case sensitive identifier stored as a DocValue and/or tokenised in a single specific way regardless of locale.

 

If the <index> element is not present it defaults to

 

<index enabled="true">  
  
<atomic>true</atomic> 
  
<stored>false</stored> 
  
<tokenised>true</tokenised>
</index>

 

If enabled is set to true then the property is indexed. The atomic element is historic and is no longer used. The facetable element, if present and set to true, means the value will be indexed "as is" for faceting using DocValues. If set to false it indicates the field will never be used for faceting.

 

The tokenised element supports three values:

  • true - the field will be tokenised in a locale specific for full text search. There is no identifier search. Ordering and faceting support will depend on the tokens generated.
  • false - the field will be indexed "as is" for case sensitive identifier search, There is no support for locale specific full text search. This supports non DocValue based faceting and ordering (using the lucene field cache)
  • both - combines the two

 

If factetable is true and tokenised is false or both then DocValues are used in preference to the field value cache.
 

The SOLR shared.properties file can be used to over-ride some aspects of this behaviour. Properties can be forced to be indexed and queries as an identifier.  The support for cross-language fixed tokenisation is only defined in this file. Cross language search can be enabled by property type or for a specific property. This is in addition to the index and query behaviour defined on the property in the data model. 

 

Non text data types are much simpler. The tokenised element is ignored. Only the facetable element determines if DocValues are used.

 

 

SOLR 4 vs SOLR 6

 

SOLR 4 in Alfresco Content Services 5.2 has two query templates: vanilla and rerank. Alfresco Search Services 1.0/1.1 with SOLR 6 has only the rerank template. The two differ in the way text is handled by default.

 

For SOLR 4 the default behaviour was to support cross language search for all d:text, d:mltext and d:content fields. This resulted in considerable duplication in the index and more complex queries. The shared.properties file could be added to change this for scalability reasons.  In Alfresco Search Services 1.0/1.1 the default behaviour is to offer cross language support only for cm:name. This has lead to a number of support issues when cross language search is lost for cm:title, cm:description and some cm:person attributes. These issues can be fixed by reverting to the default options for SOLR 4 and reindexing; adding selected field and type options for cross language search and reindexing; or adding explicit locale expansion in the configuration without reindexing. Resolving and reconsidering these defaults is part of the next Alfresco Search Service release.

 

The vanilla index template and the rerank templates differ in how phrase queries are handled. Using the vanilla template phrases are always treated as phrases. With the rerank template phrases are first executed as conjunction queries and then reordered based on the phrase matches. The rerank core configuration can include results that do not match the phrase but contain all the terms in the phrase in any order and any position. However, the documents that match the phrases will be at the top. This addition recall may cause issues for some customers but overall it scales and performs better. 

 

102

 

Moving on to consider more advanced ideas. Extracting more meaning from text such as sentiment, entities, etc can be thought of as a processing step that generates more complex tokens. These tokens still have a position in the text and some value. Taking "Alfresco Software" as an example, it could generate "Alfresco" at position 1 and "Software" at position 2. It could also be thought of as "ORG:ALFRESCO_SOFTWARE_INC" at 1-2. The metadata extractor framework or behaviours in the repository can be used to provide integration points with external services such as Amazon Comprehend. This has been discussed else where.

 

Cross-locale and multi-lingual search issues are outlined well by Martin White in Searching for Information in the Tower of Babel.  They are also covered in chapter 14 of Solr In Action.

 

Creating relationships between entities and embeddings and then using these for disambiguation, building ontologies, etc is  beyond the scope of what we can consider here.

 

In the future Alfresco Search Service may support more indexing behaviour to be over-ridden as part of the SOLR configuration. It is possible, but not recommended, to make specific changes to tokenise any property in a custom way by adding a specific field entry to the schema.xml rather than relying on the dynamic field entries.

 

Any change to index configuration will (most likely) require a reindex. This is why some model changes may be ignore until you accept this impact.  

 

Multi-valued fields lead to issues with ordering, multiple-counts for faceting and other oddities. For example, consider a multi-valued text field that contains values "one" and "two". A query that facets on this field but has a predicate that looks for the value "one" will still see the values "one" and "two" in the facet results.

 

Summary

 

That was a brief outline of the basic options for text indexing and how they affect index and query behaviour.

This post is a short technical overview of Alfresco Search and Discovery. It accompanies the related Architect/Developer Whiteboard Video. In a nutshell, the purpose of Search & Discovery is to enable users to find and analyse content quickly regardless of scale.

 

Below is a visual representation of the key software components that make up Alfresco Search and Discovery.

 

 

 

Search and Discovery can be split into five parts:

  • a shared data model - components shown in green;
  • the index and the underlying data that is created by indexing and used for query - components shown in orange;
  • querying those indexes  - components shown in red;
  • building indexes - components shown in yellow; and
  • some enterprise only components shown in blue.

 

What is an index?

 

An index is a collection of document and folder states. Each state represents a document or folder at some instance in time. This state includes where the document or folder is filed, its metadata, its content, who has access rights to find it, etc..

 

The data model that defines how information is stored in the repository also defines how it is indexed and queried. Any changes made to a model in the repository, like adding a property, are reflected in the index and, in this case, the new property is available to query in a seamless and predictable way.

 

We create three indexes for three different states:

  • one for all the live states of folders and content;
  • one for all the explicitly versioned states of folders and content; and
  • one for all the archived states of folders and content.

 

 It is possible to query any combination of these states.

 

Each of these indexes can exist as a whole or be broken up into parts, with one or more copies of each part, to meet scalability and resilience requirements. These parts are often referred to as shards and replicas. There are several approaches to breaking up a large index into smaller parts. This is usually driven by some specific customer requirement or use case. The options include: random assignment of folders and documents to a shard, assignment by access control, assignment by a date property, assignment by creation, assignment by a property value, and more. Another blog covers index sharding options in detail.

 

For example, sequential assignment to a shard at document creation time gives a simple auto-scalable configuration. Use one shard until it is full and then add the next shard when required. Alternatively,  if all your queries specify a customer by id it would make sense to split up your data by customer id. All the information for each customer would then be held together in a single shard.

 

Indexes typically exist as single shard up to around 50M files and folders. Combining all the shards gives an effective overall index. The data, memory, IO load, etc can then distributed in a way that can be scaled.  It is common to have more then one replica for each shard to provide scalability, redundancy and resilience.

 

Search Public API

 

The search public REST API is a self-describing API that queries the information held in the indexes outlined above. Any single index or combination of indexes can be queried. It supports two query languages. A SQL like query language defined by the CMIS standard and a Google like query language we refer to as Alfresco Full Text Search (AFTS).  Both reflect the data model defined in the repository. The API supports aggregation of query results for facet driven navigation, reporting and analysis.  

 

The results of any query and related aggregation always respect the access rights of the user executing the query. This is also true when using Information Governance where security markings are also enforced at query and aggregation time.

 

The search public API and examples are covered in several existing blogs. See:

 

Introducing Solr 6.3 and Alfresco Search Services 

v1 REST API - Part 9 - Queries & Search 

Structure, Tags, Categories and Query in the public API 

Basic Content Reporting using the 5.2.1 Search API 

 

For enterprise customers, we support also JDBC access using a subset of SQL via a thin JDBC driver. This allows integration with BI and reporting tools such as Apache Zeppelin.

 

The Content Model

 

All indexes on an instance of Alfresco Search Services share the same content model. This content model is a replica of the content model held in Alfresco Content Services. If you add a type, aspect, or property to the content model any data added will be indexed according to the model and be available to query.

 

The Alfresco full text search query language and CMIS query language are backed by the content model. Any property in the model can be used in either query language. The details of query implementation are hidden behind this schema. We may change the implementation and ”real” index schema in the future but this virtual schema, defined by the data model and the query syntax, will remain the same.

 

The data model in the repository defines types, aspects and properties. A property type defines much of its index and query behaviour.  Text properties in particular require some additional thought when they are defined. It is important to consider how a property will be used.

 

  • … as an identifier?
  • … for full text search?
  • … for ordering?
  • … for faceting and aggregation?
  • … for localised search?
  • … for locale independent search?

 

… or any combination of the above.

 

A model tracker maintains an up to date replica of the repository model on each search services instance.

 

Building Indexes

 

When any document is created, updated or deleted the repository keeps a log of when the last change was made in the database. Indexing can use this log to follow state changes in the order those changes were made. Each index, shard or replica records its own log information describing what it has added. This can be compared with the log in the database. The indexing process can replay the changes that have happened on the repository to create an index that represents the current state of the repository and resume this process at any point. The database is the source of truth: the index is a read view optimised for search and discovery.

 

Trackers compare various aspects of the index log with the database log and work out what information to pull from the database and add to the index state. The ACL tracker fetches read access control information. The metadata for nodes is pulled in batches from the repository in the order in which they were changed by the metadata tracker. If a node includes content, the content tracker adds that to the existing metadata information sometime after the metadata has been indexed. The cascade tracker asynchronously updates any information on descendant nodes when their ancestors change. This cascade is caused by operations such as rename and move, or when ancestors are linked to other nodes creating new structure and new paths to content. The commit tracker ensures that transactional updates made to the database are also transactionally applied to the index. No part transactions are exposed by search and transactions are applied in the order expected. The commit tracker also coordinates how information is added to the index and when and how often it goes live. The index state always reflects a consistent state that existed at some time in the database.

 

As part of the tracking process, information about each index and shard is sent back to the digital business platform. This information is used to dynamically group shards and replicas into whole indexes for query. Each node in the Digital Business Platform can determine the best overall index to use for queries.

 

All shards and replicas are built independently based on their own configuration. There is no lead shard that has to coordinate synchronous updates to replicas. Everything is asynchronous. Nothing ever waits for all the replicas of a shard to reach the same state. The available shards and replicas are intelligently assembled into a whole index.

 

Replicas of shards are allowed to be unbalanced - each shard does not have to have the same number of replicas. Each replica of a shard does not have to be in the same state. It is simple to deal with a hot shard - one that gets significantly more query load than the others - by creating more copies of that shard. For example, your content may be date sensitive with most queries covering recent information. In this case you could shard by date and have more instances of recent shards.

 

 

Query Execution

 

Queries are executed via the search endpoint of the Alfresco REST API or via JDBC. These APIs support anything from simple queries to complex multi-select faceting and reporting use cases. Via the public API each query is first analysed. Some queries can be executed against the database. If this is possible and requested that is what happens. This provides transactional query support. All queries can be executed against one or more Alfresco Search Services instances.  Here there is eventual consistency as the index state may not yet have caught up with the database. The index state however always reflects some real consistent state that existed in the database.

 

When a query reaches an Alfresco Search Services instance it may just execute a query locally to a single index or coordinate the response over the shards that make up an index. This coordination collates the results from many parts to produce an overall result set with the correct ranking, sorting, facet counts, etc.

 

JDBC based queries always go to the search index and not the database.

 

Open Source Search

 

Alfresco Search Services is based on Apache Solr 6, in which Alfresco is a leader. Alfresco is an active member of the Apache SOLR community. For example, we have Joel Bernstein on staff, who is a SOLR committer. He has led the development of the SOLR streaming API and has been involved with adding support for JDBC. Other Alfresco developers have made contributions to SOLR related to document similarity, bug fixes and patches.

 

Highly Scalable and Resilient Content Repository

 

These features combine to give a search solution that can scale up and out according to customer needs and is proven to manage one Billion documents and beyond. Customers can find content quickly and easily regardless of the scale of the repository.

Introduction

 

Alfresco Search Services 1.1.0 and Alfresco Content Services 5.2.1 added more support for query aggregation. ACS 5.2 included facet fields and facet queries. ACS 5.2.1 adds:

  • Stats
  • Ranges
  • Intervals
  • Pivots
  • Pivots with stats or ranges

 

These break downs work using SOLR 4 in ACS 5.2.1 as well as Alfresco Search Services 1.1.0 - with some minor limitations when using SOLR 4.

 

The examples presented here can be found as part of a postman collection here. Some example reports can be found here. Remember to set your hostname and port. There are more examples and additional details in the postman collections than those presented here, including other changes and additions to the search public API.

 

Stats

 

Stats faceting generates summary statistics for a field and can be applied to numeric, date and text types. The metrics available depend on both the data type and search technology ( SOLR 4 vs Alfresco Search Services). Stats supports the following metrics:

 

MetricDefaultNumericDateText
minYesYesYesYes
maxYesYesYesYes
sumYesYesYesNo
countValuesYesYesYesYes
missingYesYesYesYes
meanYesYesYesNo
stddevYesYesYesNo
sumOfSquaresYesYesYesNo
distinctValuesNoYesYesYes
countDistinctNoYesYesYes
cardinalityNoYesYesYes
percentilesNoYesNoNo

 

Stats can be used to find summary statistics for everything that matches any query. For example, the statistics about content size in a folder.

 

{
  "query": {
    "query": "NPATH:'1/Company Home/Sites'"
  },
    "filterQueries": [
    {
      "query": "TYPE:content AND -TYPE:\"cm:thumbnail\" AND -TYPE:\"cm:failedThumbnail\" AND -TYPE:\"cm:rating\" AND -TYPE:\"app:filelink\" AND -TYPE:\"fm:post\""
    }
  ],
  "paging": {
      "maxItems": "1"
  },
  "stats": [{
   "field" : "content.size",
   "label" : "Content Size"
  }]
}

This will return some default statistics. Note that countValues is used to distinguish the stats metric from the normal facet metric count. If stats are nested in a pivot then there will be both count and countValues metrics.

 

Ranges

 

 

Range in its simplest form defines a start point, an end point and a gap. The data is split into buckets of size gap, starting at the start point and ending at, or after, the end point. Adding an option for a hardened end point forces a smaller final bucket to finish exactly at the end point.

 

 

For each bucket you can control which edge bounds are included in the bucket. The lower option includes the lower bound of all buckets. The upper option includes the upper bound of all buckets. The edge option includes the lower bound of the first bucket and the upper bound of the last bucket.

 

The default is lower + edge.

 

 

Ranges can be used on date and numeric types. When using date types, if the bounds are defined using SOLR math expressions then the computations are sensitive to timezone.

 

For example:

 

{
    "query": {
        "query": "name:*"
    },
    "ranges":[ {
        "field": "created",
        "start": "NOW/YEAR-4YEARS",
        "end": "NOW/YEAR+1YEAR",
        "gap": "+1YEAR"
    }],
    "localization":  
    {
       "timezone": "GMT+6",
       "locales" : [ "fr", "en" ]                
    }
}

 

Produces entries like:

 

 "facets": [
                {
                    "type": "range",
                    "label": "created",
                    "buckets": [
                        {
                            "label": "[2012-12-31T18:00:00Z - 2013-12-31T18:00:00Z)",
                            "filterQuery": "created:[\"2012-12-31T18:00:00Z\" TO \"2013-12-31T18:00:00Z\">",
                            "metrics": [
                                {
                                    "type": "count",
                                    "value": {
                                        "count": "0"
                                    }
                                }
                            ],
                            "bucketInfo": {
                                "startInclusive": "true",
                                "start": "2012-12-31T18:00:00Z",
                                "end": "2013-12-31T18:00:00Z",
                                "endInclusive": "false"
                            }
                        },

....

 

 

The response includes a metric for the count and also how each bucket was defined - including the behaviour of the bounds.

 

 

 

Intervals

 

Intervals at first look like they do the same thing as ranges. However, the buckets can be defined any way you like. They can overlap, leave gaps, include other buckets, etc. You can also apply intervals to text. The more general use of intervals is limited as they can not be nested in pivot facets.

 

For intervals, you define any number of sets. Each set has a start and an end with options for the bounds - startInclusive and endInclusive. Again for dates, using date math the computation respects timezone.

 

Here is one example from the collection:

 

{
  "query": {
    "query": "name:*",
    "language": "afts"
  },
  "filterQueries": [{"query": "cm:created:[* TO 2016>"}],
   "localization":  
    {
       "timezone": "GMT+6",
       "locales" : [ "fr", "en" ]                
    },
  "facetIntervals": {
    "intervals": [
      {
        "label" : "TheCreated",
        "field": "cm:created",
          "sets": [
            {
              "label": "lastYear",
              "start": "2016",
              "end": "2017",
              "endInclusive" : false
            },
            {
              "label": "currentYear",
              "start": "NOW/YEAR",
              "end": "NOW/YEAR+1YEAR"
            },
          {
            "label": "earlier",
            "start": "*",
            "end": "2016",
            "endInclusive" : false
          }
        ]
      }
    ]
  }
}

 

 

Pivots

 

Pivots are nested field facets. For example, a field facet on SITE with a nested field facet for creator. Pivots will provide counts for each grouping along with sub-totals. The public API takes advantage of this nesting by defining all the groupings individually and then how to nest them. It is then easy to change the nesting. In the future it also allows for reusing groupings and defining default groupings.

 

Pivots by default only do counting - we will get to that later.

 

A pivot today can only support one nested break down at each level. You can however define as many pivots as you like. Some of the facet field options are ignored when used in a pivot as they are not supported.

 

So here is the pivot example we just described:

 

{
   "query": {
      "query": "name:*"
   },
   "facetFields": {
      "facets": [
         {"field": "SITE", "label": "Site"},
         {"field": "creator", "label": "Creator"}
      ]
    },
    "pivots" : [
        {
            "key": "Site",
            "pivots": [
                {
                    "key": "Creator"
                }
                ]
        }
        ]
}

 

The pivot key needs to refer to a facet label (or the label for a stats or range facet....)

 

Pivots with stats or ranges

 

Pivots can include reference to a single stats or range facet as the last key. These stats or range references can not include any further nested facets. So you can nest as many field facets as you like with a stats or range facet at the bottom.

 

Here is an example content size statistics broken down by site.

 

{
   "query": {
      "query": "name:*"
   },
   "facetFields": {
      "facets": [
         {"field": "SITE", "label": "site"}
      ]
    },
    "stats": [
        {
            "field": "content.size",
              "label": "size",
            "min": true,
            "max": true,
            "stddev": true,
            "missing": true,
            "sum": true,
            "countValues": true,
            "sumOfSquares": true,
            "percentiles": ["1","12.5","25","50","75","99"],
            "distinctValues": false,
            "countDistinct": true,
            "cardinality": true,
            "cardinalityAccuracy": 0.1
      }
      ],
    "pivots" : [
        {
            "key": "site",
            "pivots": [
                {
                    "key": "size"
                }
                ]
        }
        ]
}

 

Each pivot facet level will get all the stats - so sub-totals etc are included if you nest pivots.

 

Limitations

 

Pivots with nesting stats and range do not allow some options. Range and stats have to come at the bottom of the pivot. With ranges at the bottom you just get a count metric - so there is no way to do date vs general stats. Lookout for more news on this in the next month or so.

 

It is possible to split breakdowns out table style and reorder them - its just a bit of a pain to do and make sure you have the right sub-totals, if required. Some stats you can reshape (counts, sum); some like percentile will leave you scratching your head or require some manipulation we are not going to cover here.

 

So for the easy stuff, you can map the results to a table.

 

SiteCreatorDoc CountContent Size
OneBob2400
OneCharlie1700
One31100
TwoAlice9900
TwoBob51000
Two141900
ThreeAlice721
ThreeBob94000
ThreeCharlie215
Three374026

 

You can reshape the data by sorting on the grouping fields and build any sub-aggregations. Many tools can help you with the reshape and just need the groupings and metrics. Date ranges can be treated as groupings -  but you will just have the count metric.

 

SiteCreatorDoc CountContent Size
One31100
Three374026
Two141900
ThreeAlice721
TwoAlice9900
Alice16921
OneBob2400
ThreeBob94000
TwoBob51000
Bob165400
OneCharlie1700
ThreeCharlie215
Charlie22705

 

 

 

Other

 

The collection and example reports contain more samples than I can cover here.  They contain some additional description, cover more data types and describe some typical options you may want to use. The examples also include other functionality added to the  5.2.1 such as: time zone, locales, templates, grouping facet queries, building filter queries for multi-select facets, etc..

 

Summary

 

So there are a few examples here and many more in the postman collections. Go explore the options with your data. 

andy1

Index Sharding

Posted by andy1 Employee Sep 18, 2017

 

Introduction

 

At some point, depending on your use case, a SOLR index reaches some maximum size. This could be related to any number of metrics, for example: the total document count, metadata index rate, content index rate, index size on disk, memory requirements, query throughput, cache sizes, query response time, IO load, facet refinement, warming time, etc. When you hit one of those constraints sharding and/or replication may be the solution. There are other reasons to introduce sharding and replicas, for example, fault tolerance, resilience and index administration. There are other configuration changes that could help too. Some other time! This is about sharding.

 

 

Sharding is not free. It is more complicated to set up and administer than a single, unsharded index. It also requires some planning to determine the number of shards. Executing a query against a sharded index is as fast as the slowest shard response. Faceting often requires a second refinement step to get the numbers exactly right. At worst, adding a facet may double the query cost. Some facets are better than others in this respect. In particular, field facets will generally cause this second step on high cardinality fields. Mostly, performance scales with index size, so splitting your index in half will half the response time from each shard. There are overheads for coordination and particularly for facet refinement. As your shard count grows there are more bits that have to be there. However, at some point sharding is the way to go.

 

Sharding is there to split your index into parts to scale the size of your index. When you reach your limiting constraint for index size you need to shard. Replication is there to provide multiple copies of the same shard for resilience or throughput. Some issue may be addressed by both. Query execution times can be reduced by reducing the size of the index - by splitting into two halves. You may get to the same point splitting the load over two replicas.

 

This blog post describes the sharding methods available and when they may be appropriate.

 

Getting started

 

First, you will need some ideas about the size of your repository, how it will grow, how your data is made up and how you want to query it.

 

For Alfresco Content Services 5.1 we carried out a 1 billion+ node scale test in AWS, some aspects of which I will describe next.

 

But first: this information is indicative and the numbers will vary depending on your use case, sharding and replication strategy, data, configuration, faceting, ordering and a whole host of other things that will not be the same for you. For example, your content tracking will be slower as we did not need to transform content. Ideally you would benchmark the performance of a single shard for your particular use case and environment. Our AWS environment is not yours! AWS marches on at a pace, and so do we, with indexing and permission performance improvements in Alfresco Search Service 1.0 and on. Please take a look at our Scalability Blueprint and refer to the documentation here.

 

If you did not read the disclaimer above then do so. It's important! For the 1 billion node bench mark against Alfresco Content Service 5.1 using Share, we aimed for a single shard size of around 50M nodes with 20 shards. In fact we got 40-80M nodes because of how ACLs are generated. (ACL ids are now more evenly distributed with the second version of ACL id based sharding.) We wanted to get to size as fast as possible: it did not have to be real persisted content. Most of the content was repeatably generated on the fly, but never the less, representative of real content with respect to index and query. No transformation of content was required. For our data and configuration, 50M nodes was the point where metadata indexing started to fall behind our ingestion rate. Between 10-20M nodes content tracking started to fall behind. We could have rebalanced the metadata/content load but there was no  configuration to do this easily at the time. There will be soon. Our indexes were 300-600G on ephemeral disk, SOLR 4 was given 16G and the OS had around 45G to cache parts of the index in memory. IO load initially limited phrase query performance for content so we developed the rerank core. This execute phrase queries as a conjunction query followed by a "re-rank by phrase" operation. We also configured out facets that required refinement and left those based on facet queries. The rerank core had a bunch of additional improvements for scaling which are now the default in Alfresco Search Services 1.0 and on. Again, the details are in the Scalability Blueprint.

 

Most sharding options are only available using Alfresco Search Services 1.0 and later.

 

 

Sharding Methods

 

With any of the following, if the shard where a node lives changes, it will be removed from the old shard and added to the new. As is the nature of eventual consistency until this completes the node may be seen in the wrong shard, in neither shard or in both shards. Eventually it will only be seen in the right shard.

 

By ACL ID v1 (Alfresco Content Services 5.1, Alfresco Search Services 1.0 & 1.1)

 

Nodes and access control lists are grouped by their ACL id. This places nodes together with all the access control information required to determine the access to a node in the same shard. Shards may be more uneven in size than using other methods. Both nodes and access control information are sharded. The overall index size will be smaller then other methods as the ACL index information is not duplicated in every shard. However, the ACL count is usually much smaller than the node count. This method is great if you have lots of ACLs and documents evenly distributed over those ACLs. Heavy users of Share sites and use cases where lots of ACLs are set on content or lots of folders should benefit from this option. If your documents are not well distributed over ACLs, perhaps many millions  have the same access rights, there are likely to be better options. So the metrics to consider are the number of nodes / number of ACLs and the number of ACLs with respects to the number of shards. You want many more ACLs than shards and an even distribution of documents over ACLs.


Nodes are assigned to a shard by the modulus of the ACL id. There is better distribution of ACLs over shards using version 2 of ACL id based sharding which uses the murmur hash of the ACL id.

 

At query time all shardsare used. If the query is constrained to one share site using the default permission configuration the query may only have data in one shard. All but one shard should be fast and no query refinement will be required as all but one shard will give zero results. Ideally we want to hit one shard or a sub-set of shards.

 

shard.method=MOD_ACL_ID

 

By DBID (Alfresco Search Services 1.0 & 1.1)

 

Nodes are distributed over shards at random based on the murmur hash of the DBID. The access control information is duplicated in each shard. The distribution of nodes over each shard is very even and shards grow at the same rate. This is the default sharding option in 5.1. It is also the fall back method if information for any other sharding information is unavailable.

 

At query time all shards would be expected to find nodes. No query would target a specific shard.

shard.method=DB_ID

 

By Date (Alfresco Search Services 1.0 & 1.1)

 

Sharding by date is not quite the same as the rest. The shards are assigned sequentially to buckets with wrap around. It is not random. Some specific age range goes in each bucket. The size of this range can be controlled by configuration.

 

A typical use case is to manage information governance destruction. If you have to store things for a maximum of 10 years you could shard based on destruction date. However, not everything has a destruction date and things with no shard key are assigned randomly to shards as "by DIDB". So unless everything really does have a destruction date, it is best to stick with the creation date or modification date - everything has one of those.

 

At query time all shards are used. Date constrained queries may produce results from only a sub set of shards.

 

To shard over 10 live years 12 shards is a good choice. Using 12 shards has more flexibility to group and distribute shards. You need more than 10 shards and you may have financial years that start in funny places. Of these 12 shards 10 (or 11) will be live - one ready to drop and one ready to be used. When the year moves on the next shard will come into play. One shard will contain all recent nodes. Any shard may see a delete or update. Older shards are likely to change less and have more long lived caching. Most indexing activity is seen by one shard.

 

The smallest range unit is one month with any number of months grouped together. You could have a shard for every month, pair of months,  quarter, half year, year, 2 years, etc.

 

Some use cases of date sharding may have hot shards. Perhaps you mostly query the last year and much less frequently the rest. This is where dynamic shard registration helps. There is nothing to say that each shard should have the same number of replicas so you can replicate the hot shard and assume that the other shards come back with nothing quickly.

shard.key=cm:createdshard.method=DATE
shard.date.grouping=3

 

 

By Metadata (Alfresco Search Services 1.0 & 1.1)

 

In this method the value of some property is hashed and this hash used to assign the node to a random shard. All nodes with the same property value will be assigned to the same shard. Customer id, case id, or something similar, would be a great metadata key. Again, each shard will duplicate all the ACL information.

 

At query time all shards are used. Queries using the property used as the partition key may produce results from only a sub set of shards. If you partition by customer id and specify one customer id in the query (ideally as a filter query) only one shard will have any matches.

 

It is possible to extract a sub-key to hash using a regular expression; to use a fragment of the property value.

 

Only properties of type d:text, d:date and d:datetime can be used.

shard.key=cm:creator shard.method=PROPERTY
shard.regex=^\d{4}

 

By ACL ID v2 (Alfresco Search Services 1.0 & 1.1)

 

This is the same as ACL ID V1 but the murmur hash of the ACL ID is used in preference to its modulus.

This gives better distribution of ACLs over shards. The distribution of documents over ACLs is not affected and can still be skewed.

 

 

shard.method=ACL_ID

 

 

By DBID range (Alfresco Search Services 1.1)

 

This is the only option to offer any kind of auto-scaling as opposed to defining your shard count exactly at the start. The other sharding methods required repartitioning in some way. For each shard you specify the range of DBIDs to be included. As your repository grows you can add shards. So you may aim for shards of 20M nodes in size and expect to get to 100M over five years. You could create the first shard for nodes 0-20M, set the max number of shards to 10 and be happy with one shard for a year. As you approach node 20M you can create the next shard for nodes 20M-40M and so on.  Again, each shard has all the access control information.

 

There should be reasonable distribution unless you delete nodes in some age related pattern. Date based queries may produce results from only a sub set of shards as DBID increases monotonically over time.

 

shard.method=DB_ID_RANGE
shard.range=0-20000000

 

Explicit sharding (Alfresco Search Services 1.2)

 

This is similar to sharding by metadata. Rather then hashing the property value, it explicitly defines the shard where the node should go. If the property is absent or an invalid number sharding will fall back to to using the mumur hash of the DBID. Only text field are supported. If the field identifies a shard that does not exist the node will not be indexed anywhere. Nodes are allowed to move shards. You can add, remove or change the property that defines the shard.  

 

shard.method=EXPLICIT_ID
shard.key=

 

Availability matrix

 

Index EngineACL v1DBIDDate/timeMetadataACL v2DBID RangeExplicit
5.0
5.1 + SOLR 4
5.2.0 + SOLR 4
5.2.0 +
Alfresco Search Services 1.0
5.2.1 +
Alfresco Search Services 1.1
5.2.1 +
Alfresco Search Services 1.2

 

Comparison Overview

 

FeatureACL v1DBIDDate/timeMetadataACL v2DBID RangeExplicit
All shards required

ACLs replicated on all shards
Can add shards as the index grows
Even shards✔✔✔✔✔✔✔✔✔✔✔✔✔
Falls back to DBID sharding
One shard gets new content

Possible

Possible
Query may use one shardPossiblePossiblePossible
Has Admin advantagesPossiblePossible
Nodes can move shard

 

 

Summary

 

A bit of planning is required to determine the evolution of your repository over time and determine your sharding requirements. There are many factors that may affect your choice. Benchmarking a representative shard will tell you lots of interesting stuff and help with the choices for your data and use case. Getting it wrong is not the end of the world. You can always rebuild with a different number of shards or change your sharding strategy.

Introduction

 

Some of this week's questions have been related to query templates. Specifically, how can I use my custom properties in Share live search and standard search to find stuff. To do this you need to change some query templates.

 

What is a query template?

 

A query template is a way of taking a user query and generating a more complex query. Somewhat like the dismax parsers in SOLR. The Share template for live search looks like this:

 

%(cm:name cm:title cm:description TEXT TAG)

 

The % identifies something to replace and is followed by a field or, in this case, a group of fields to use for the replacement.  Whatever query a user enters for the template is applied to those fields. For groups of fields they are ORed together.

 

So for example, if you search for alfresco in the live search it will generate:

 

(cm:name:alfresco OR cm:title:alfresco OR cm:description:alfresco OR TEXT:alfresco OR TAG:alfresco)

 

If you search for =Alfresco in the live search it will generate:

 

(=cm:name:Alfresco OR =cm:title:Alfresco OR =cm:description:Alfresco OR =TEXT:Alfresco OR =TAG:Alfresco)

 

For multiple words they are ANDed together(by default), So for one two you get:

 

(cm:name:(one AND two) OR  cm:title:(one AND two) OR cm:description:(one AND two) OR TEXT:(one AND two) OR TAG:(one AND two))

 

If you search for the phrase "alfresco is great":

 

(cm:name:"alfresco is great" OR cm:title:"alfresco is great" OR cm:description:"alfresco is great" OR TEXT:"alfresco is great" OR TAG"alfresco is great")

 

Here, the template is simply defining the fields used for search. You can also add different importance to each term if you specify each field rather than a replacement group. For example:

 

(%cm:name^10 OR  %cm:title^2  OR %cm:description^1.5 OR %TEXT OR %TAG)

 

Here we are ranking name matches higher then title, title over description and all over TEXT (content) and TAGs.

 

Query templates can contain any query element - so we could limit the results to certain types.....but this would be better as a filter query .....

(%cm:name^10 OR  %cm:title^2  OR %cm:description^1.5 OR %TEXT OR %TAG) AND TYPE:content

 

You could split your template into two parts - one for content and one for folders  - if you want to change the balance of relevance between them.

 

Customizing Share Templates

 

There are two templates for share: one for live search and one for standard search. The configuration of advanced search is discussed elsewhere. In

<tomcat>\shared\classes\alfresco\extension\templates\webscripts\org\alfresco\slingshot\search

 

You will need two files with the following default content

  • search.get.config.xml

    <search>
       <default-operator>AND</default-operator>
       <default-query-template>%(cm:name cm:title cm:description ia:whatEvent ia:descriptionEvent lnk:title lnk:description TEXT TAG)</default-query-template>
    </search>

  • live-search-docs.get.config.xml

    <search>
      <default-operator>AND</default-operator>
      <default-query-template>%(cm:name cm:title cm:description TEXT TAG)</default-query-template>
    </search>

You will probably want to change them so they do not use field groups and you can add boosting as described above.

It is then easy to add your own properties and boosting to one or both of the templates.

 

Search public API

 

The search public API in ACS 5.2 and later supports templates (as has the Java API for some time). Each template is mapped to a field. 

{
    "query": {
        "language": "afts",
        "query": "WOOF:alfresco"
    },
    "include": ["properties"],
    "templates": [
        {
            "name": "WOOF",
            "template": "(%cm:name OR %cm:content^200 OR %cm:title OR %cm:description) AND TYPE:content"
        }
    ]
}

 

A template is assigned to a name which can be used in the query language just like any other field. If the name of a template is set as the default field any part of the query that does not specify a field will go to the template. That is how Share maps the user query to the template and exposes a default Google like query but also allows advanced users to execute any AFTS query. Most of the time it uses the default field and the template above.

 

As of 5.2.1, templates can also be used in the CMIS QL CONTAINS() expression

{
    "query": {
      "language": "cmis",
      "query": "select * from cmis:document where CONTAINS('alfresco')"
  },
  "include": ["properties"],
   "templates": [
    {
      "name": "TEXT",
      "template": "%cmis:name OR %cmis:description^200"
    }
  ]
}

CMIS CONTAINS() defines TEXT as the default field. This normally goes to cm:content but can be redefined, as we do here.

 

Summary

 

Query templates are a great way to hide query mapping from the end user. The Share templates allow you to add your own properties to Share queries and tweak the weight given to each field - perhaps giving name matches greater prominence.

Introduction

 

This blog compares query support provided by transactional metadata query (TMDQ) and the Index Engine. The two differ in a number of respects which are described here. This blog is an evolution of material previously presented at the Alfresco Summit in 2013.

 

TMDQ delivers support for transactional queries that used to be provided by Lucene in Alfresco Content Services (ACS) prior to version 5.0. In ACS 4.1 SOLR was introduced with an eventually consistency query model. In 5.0, Lucene support was removed in favour of TMDQ. As TMDQ replaced Lucene, some restrictions on its use are similar. For example, both post processes results for permissions in a similar way and there are restrictions on the scale of the result sets it can cope with as a result. The Index Engine has no such restrictions. It also seems from use that if a query can be run against the database the scope of the query is such that it will probably have no issue with the number of results returned.

 

 

Overview

 

Some queries can be executed both transactionally against the database or with eventual consistency against the Index Engine. Only queries using the AFTS or CMIS query languages can be executed against the database. The Lucene query language can not be used against the database while selectNodes  (XPATH) on the Java API always goes against the database, walking and fetching nodes as required.

 

In general, TMDQ does not support: structural queries, full text search, special fields like SITE that are derived from structure and long strings (> 1024 characters). Text fields support exact(ish) and pattern based matching subject to the database collation. Filter queries are rewritten along with the main query to create one large query. Ordering is fine, but again subject to database collation for text.

 

TMDQ does not support faceting. It does not support any aggregation: this includes counting the total number of matches for the query. FINGERPRINT support is only on the Index Server.

 

AFTS and CMIS queries are parsed to an abstract form. This is then sent to an execution engine. Today, there are two execution engines: the database and the Index Engine. The default is to try the DB first and fall back to the Index Engine - if the query is not supported against the DB. This is configurable for a search sub-system and per query using the Java API. Requesting consistency should appear in the public API "some time soon".

 

Migrations from Alfresco Content Service prior to 5.0 will require two optional patches to be applied to support TMDQ. Migrations to 5.0 require one patch: 5.0 to 5.1 a second. New installations will support TMDQ by default. The patches add supporting indexes that make the database ~25% larger.

 

 

Public API and TMDQ

 

From the public API, anything that is not a simple query, a filter query, an option that affects these, or an option that affects what is returned for each node in the results, is not supported by TMDQ. The next two sections consider what each query language supports.

 

Explicitly, TMDQ supports: 

  • query
  • paging
  • include
  • includeRequest
  • fields
  • sort
  • defaults
  • filterQueries
  • scope (single)
  • limits for permission evaluation

 

The default limits for permission evaluation will restrict the results returned from TMDQ based on both the number of results processed and time taken. These can be increased if required.

 

The public API does not support TMDQ for:

  • templates
  • localisation and timezone
  • facetQueries
  • facetFields
  • facetIntervals
  • pivots
  • stats
  • spellcheck
  • highlight
  • ranges facets
  • SOLR date math

 

Some of these will be ignored and produce transactional results; others will fail and be eventual.

 

The public API will ignore the SQL select part of a CMIS query and decorate the results as it would do for AFTS.

 

 

CMIS QL & TMDQ

 

For the CMIS query language all expressions except for CONTAINS(), SCORE() and IN_TREE() can now be executed against the database. Most data types are supported except for the CMIS uri and html types. Strings are supported but only if 1024 characters or less in length. In Alfresco Content Services 5.0, OR, decimal and boolean types were not supported; they are from 5.1 on. Primary and secondary types are supported and require inner joins to link them together - they can be somewhat tedious to write and use.

 

You can skip joins to secondary types from the fetch in CMIS using the public API. You would need an explicit SELECT list and supporting joins from a CMIS client. You still need joins to secondary types for predicates and ordering. As CMIS SQL supports ordering as part of the query language you have to do it there and not via the public API sort.  

 

Post 5.2, left outer join from primary and secondary types to secondary types will also be supported. This covers queries to find documents that do not have an aspect applied - which is currently best implemented using something like

CONTAINS('-ASPECT:hidden')

today.

For multi-valued properties, the CMIS query language supports ANY semantics from SQL 92. A query against a multi-lingual property like title or description is treated as multi-valued and may match in any language. In the results you will see the best value for your locale - which may not match the query. Ordering will consider any value.

 

UPPER() and LOWER()

 

UPPER() and LOWER() functions were in early drafts for the CMIS 1.0 specification and sunsequently dropped. These are not part of the CMIS 1.0 or 1.1 specifications. They are currently supported in the CMIS query language for TMDQ only as ways to address specific database collation issues and case sensitivity. Only equality is supported.  LIKE is not currently supported. For example:

 

{
   "query": {
       "language": "cmis",
       "query" : "select * from cmis:document where LOWER(cmis:name) = 'project contract.pdf'"
   }
}

 

Alfresco FTS QL & TMDQ

 

It is more difficult to write AFTS queries that use TMDQ as the default behaviour is to use full text queries for text: these can not go against the database. Again, special fields like SITE and TAG that are derived from structure will not go to the database. TYPE, ASPECT and the related exact matches are OK. All property data types are fine but strings again have to be less than 1024 characters in length. Text queries have to be prefixed with = to avoid full text search. PARENT is supported. OR is supported in 5.1 and later.

 

Ranges are not currently supported - there is no good reason for this - it needs to generate a composite constraint which we have not done. PATH is not supported nor is ANCESTOR.

 

Subtle differences

 

  1. The database has fixed collation as defined by the database schema. SOLR can use any collation. The two engines can produce different results for lexical comparison, case sensitivity, ordering, when using mltext properties, etc

  2. The database results include hidden nodes. You can exclude them in the query. The Index Engine will never include hidden nodes and respects the index control aspect.

  3. The database post filters the results to apply permissions. TMDQ is not intend to scale to more than 10s of thousands of nodes. It will not perform well for users who can read 1 node in a million. It can not and will not tell you how many results matched the query. To do this could require an inordinate number of permission checks. It does enough to give you the page requested and if there is more. The Index Engine can apply permissions at query and facet time to billions of nodes.
    For the same reason, do not expect any aggregation support in TMDQ: there is currently no plan to push access restriction into the database at query time.

  4. CONTAINS() support is actually more complicated. The pure CMIS part of the query and CONTAINS() part are melded together into a single abstract query representation. If the overall query, both parts, can go against the database that is fine. You have to follow the rules for AFTS & TMDQ. By default, in CMIS the CONTAINS() expression implies full text search so queries will go to the Index Server.

  5. The database does not score. It will return results in some order that depends on the query plan - unless you ask for specific ordering. For a three part OR query where some docs match more then one constraint they are treated equal. In the Index Engine - the more parts of an OR match the higher the score. The docs that match more optional parts of the query will come higher up.

  6. Queries from share will not use TMDQ as they will most likely have a full text part to the query and ask for facets.

 

Summary

 

Transactional Metadata Query and the Index Engine are intended to support different use case. They differ in queries and options that they support and subtly in the results with respect to collation and scoring. We default to trying transactional support first for historical reasons and it seems to be what most people prefer if they can have it.

andy1

Explaining Eventual Consistency

Posted by andy1 Employee Jun 19, 2017

Introduction

 

Last week, eventual consistency cropped up more than usual. What it means, how to understand it and how to deal with it. This is not about the pros and cons of eventual consistency, when you may want transactional behaviour, etc. This post describes what eventual consistency is and its foibles in the context of the Alfresco Index Engine. So here are the answers to last week's questions ....

 

Background

 

Back in the day, Alfresco 3.x supported a transactional index of metadata using Apache lucene. Alfresco 4.0 introduced an eventually consistent index based on Apache SOLR 1.4.  Alfresco 5.0 moved to SOLR 4 and also introduced transaction metadata query (TMDQ). TMDQ was added specifically to support the transactional use cases that used to be addressed by the lucene index in previous versions. TMDQ uses the database and adds a bunch of required indexes as optional patches. Alfresco 5.1 supports a later version of SOLR 4 and made improvements to TMDQ. Alfresco Content Services 5.2 supports SOLR 4, SOLR 6 and TMDQ.

 

When changes are made to the repository they are picked up by SOLR via a polling mechanism. The required updates are made to the Index Engine to keep the two in sync. This takes some time. The Index Engine may well be in a state that reflects some previous version of the repository. It will eventually catch up and be consistent with the repository - assuming it is not forever changing.

 

When a query is executed it can happen in one of two ways. By default, if the query can be executed against the database it is; if not, it goes to the Index Engine. There are some subtle differences between the results: For example, collation and how permission are applied. Some queries are just not supported by TMDQ. For example, facets, full text, "in tree" and structure. If a query is not supported by TMDQ it can only go to the Index Engine.

 

What does eventual consistency mean?

 

If the Index Engine is up to date, a query against the database or the Index Engine will see the same state. The results may still be subtly different - this will be the next topic! If the index engine is behind the repository then a query may produce results that do not, as yet, reflect all the changes that have been made to the repository.

 

Nodes may have been deleted

  • Nodes are present in the index but deleted from the repository
    • Deleted nodes are filtered from the results when they are returned from the query
      • As a result you may see a "short page" of results even though there are more results
      • (we used to leave in a "this node has been deleted" place holder but this annoyed people more)
    • The result count may be lower than the facet counts
    • Faceting will include the "to be deleted nodes" in the counts
      • There is no sensible post fix for this other than re-querying to filtering stuff out and someone could have deleted more....

 

Nodes may have been added

  • Nodes have been added to the repository but are not yet in the index at all
    • These new nodes will not be found in the results or included in faceting
  • Nodes have been added to the repository but only the metadata is present in the index
    • These nodes cannot be found by content

 

Nodes metadata has changed

  • The index reflects out of date metadata
    • Some out of date nodes may be in the results when they should not be 
    • Some out of date nodes may be missing from the results when they should not be
    • Some nodes may be counted in the wrong facets due to out of date metadata
    • Some nodes may be ordered using out of date metadata

 

Node Content has changed

  • The index reflects out of date content but the metadata is up to date
    • Some out of date nodes may be in the results when they should not be 
    • Some out of date nodes may be missing from the results when they should not be

 

Node Content and metadata has changed

  • The index reflects the out of date metadata and content
  • The index reflects out of date content (the metadata is updated first)
    • Some out of date nodes may be in the results when they should not be 
    • Some out of date nodes may be missing from the results when they should not be
    • Some nodes may be counted in facets due to out of date metadata

 

An update has been made to an ACL (adding an access control entry to a node)

  • The old ACL is reflected in queries
    • Some out of date nodes may be in the results when they should not be
    • Some out of date nodes may be missing from the results when they should not be
    • The ACLs that are enforced may be out of date but are consistent with the repository state when the node was added to the index. Again, to be clear, the node and ACL may be out of date but permission for the content and metadata is consistent with this prior state. For nodes in the version index, they are assigned the ACL of the "live" node when the version was added to the index.

 

A node may be continually updated

  • It is possible that such a node may never appear in the index.
  • By default, when the Index Engine tracks the repository it only picks up changes that are older than one second. This is configurable. If we are indexing node 27 in state 120, we only add information for node 27 if it is still in that state. If it has moved on to state 236, say, we will skip node 27 until we are indexing state 236 - assuming it has not moved on again. This avoids pulling "later" information into the index which may have an updated ACE or present an overall view inconsistent with a repository state. Any out of date-ness means we have older information in the index - never newer information. 

 

How do I deal with eventual consistency?

 

To a large extent this depends on your use case. If you do need a transactional answer, the default behaviour will give you one if it can. For some queries it is not possible to get a transactional answer. You can force this in the Java API and it will be coming soon in the public API.

 

If you are using SOLR 6, the response from the search public API will return some information to help. It will report the index state consistent with the query.

 

...

"context": {

    "consistency": {

        "lastTxId": 18

    }
},

....

 

This can be compared with the last transaction on the repository. If they are equal the query was consistent.

 

In fact, we know the repository state for each node when we added it to the index. In the future we may check if the index state for a node reflects the repository state for the same node - we can mark nodes as potentially out of date - but only for the page of results. Faceting and aggregation is much more of a pain. Marking potentially out of date nodes and providing other indicators of consistency are on the backlog for the public API.

 

If your query goes to the Index Server and it is not up to date you could see any of the issues described above in what eventual consistency means.

 

Using the Index Engine based on SOLR 6 gives better consistency for metadata updates. Some update operations that infrequently require many nodes to be updated are now done in the background - these are mostly move and and rename operations that affect structure. So a node is now renamed quickly. Any structural information that is consequently changed on all of its children is done after. Alfresco Search Services 1.0.0 also includes improved commit coordination and concurrency improvements. These both reduce the time for changes to be reflected in the index. Some of the delay also comes from the work that SOLR does before an index goes live. This can be reduced by tuning. The cost is usually a query performance hit later.  

 

Hybrid Query?

 

Surely we can take the results from the Index Engine for  transactions 1-1076 and add 1077 - 2012 from TMDQ?

 

It's not quite that simple. TMDQ does not support all queries, it does not currently support faceting and aggregation, scoring does not really exist and collation is not as flexible or the same. You reinvent the query coordination that is already in SOLR to combine the two result sets. It turns out to be a difficult but not forgotten problem.

 

Summary

 

For most use cases eventual consistency is perfectly fine. For transactional use cases TMDQ is the only solution unless the index and repository are in sync. The foibles of eventual consistency are well known and hopefully clearer, particularly in the context of the Alfresco Index Server.

 

 

 

Introduction

 

Alfresco content services has supported structural queries for a long time. You can find documents by how they are filed in a folder structure, how they are categorised and how they have been tagged. You can add new types of category, add new categories of your own to existing hierarchies, decorate the base category object using aspects and these categories are all discoverable. Underneath the hood, categories and tags are implemented the same way. Tags are a flat category. Categories are treated as an additional path to a document in a category hierarchy. A document is linked to a category by setting a property to the node ref of one or more categories.

 

In Alfresco Search Services we added support to role-up and facet on some things that are effectively encoded into paths - the SITE and TAG fields. Via the 5.2 public API for search, there is now support for more advanced path based drill down, navigation and roll-up.

 

Just to be clear, TAG and SITE roll ups are available in Share; path and taxonomy faceting is not. You can roll up the node ref in the category property in Share but this is probably not what you want.

 

This post is about the public API for search.

 

So what's new?

 

There are 6 new bits of information in the index and fields that can be used in query, filter queries and facets. These are:

  • TAG - used to index all the lowercase tags that have been assigned to a node.
  • SITE - used to index the site short name for a node, in any site. Remember a node can be in more than one site, although this is unusual. If a node is not in any site it is assigned a value of "_REPOSITORY_" to reflect this.
  • NPATH - the "name path" to a node. See the example below for how it is indexed.
  • PNAME - the path from the node up through its parents - again, see the example below.
  • APATH - as NPATH but using UUID - the UUID can be used as the key for internationalisation (SOLR 6 only).
  • ANAME - as PNAME but using UUID - the UUID can be used as the key for internationalisation (SOLR 6 only).

 

The search public API in Alfresco Content Services 5.2 exposes filter queries and faceting by field with prefix restrictions. These, in combination with the additional data, supports new ways to drill-in and roll up data.   

 

Example of what is in the index

 

Lets say we have uploaded the file CMIS-v1.1-cs01.pdf into the site woof we would have .....

"PATH":["/{http://www.alfresco.org/model/application/1.0}company_home/{http://www.alfresco.org/model/site/1.0}sites/{http://www.alfresco.org/model/content/1.0}woof/{http://www.alfresco.org/model/content/1.0}documentLibrary/{http://www.alfresco.org/model/content/1.0}CMIS-v1.1-cs01.pdf"]      

"SITE":["woof"]        

"APATH":[
          "0/264ed642-b527-488a-9139-ecde3673e4de",          

          "1/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612,a4e4-354d10f3217e",

          "2/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2",

          "3/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19",

          "4/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",

         "F/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675"]

       

"ANAME":[
          "0/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",          

          "1/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",      

          "2/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",          

          "3/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",          

          "4/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",

          "F/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675"]

"NPATH":[
          "0/Company Home",          

          "1/Company Home/Sites",          

          "2/Company Home/Sites/woof",          

          "3/Company Home/Sites/woof/documentLibrary",          

          "4/Company Home/Sites/woof/documentLibrary/CMIS-v1.1-cs01.pdf",          

          "F/Company Home/Sites/woof/documentLibrary/CMIS-v1.1-cs01.pdf"]

"PNAME":[
          "0/documentLibrary",          

          "1/woof/documentLibrary",          

          "2/Sites/woof/documentLibrary",          

          "3/Company Home/Sites/woof/documentLibrary",          

          "F/Company Home/Sites/woof/documentLibrary"]

Queries

 

To find things by SITE you can just use

SITE:"woof"

It is great to do this in a filter query in the search API as filter queries are cached, reused and warmed.

 

Similarly for TAGs

  TAG:tag

SITE and TAG also support faceting.

 

The fun starts with drill down. NPATH can be used for navigation. You can get top level names by asking for a facet on NPATH starting with the prefix "0/". You can then remove the "0/" from the facets returned to get the names of the top level things. Here is the JSON body of the request:

 

{
  "query": {
    "query": "*"
  },
  "facetFields": {
    "facets": [
      {"field": "NPATH", "prefix": "0/"}
    ]
  }
}

The response contains,  amongst many others, "0/categories". So lets drill into that another layer. We need the prefix "1/categories" and we filter stuff out based on where we want to drill-in "0/categories" - as below.

 

{
  "query": {
    "query": "*"
  },

  "filterQueries": [{"query": "NPATH:\"0/categories\""}],
  "facetFields": {
    "facets": [
      {"field": "NPATH", "prefix": "1/categories"}
    ]
  }
}

 

This gives us "1/categories/General" and "1/categories/Tags". So lets skip a step or two and count the stuff in the General/Languages category ... I have not added the prefix here to show how you can get counts for all nodes in the hierarchy.

 

 

{
  "query": {
    "query": "*"
  },
  "filterQueries": [{"query": "NPATH:\"2/categories/General/Languages\""}],
  "facetFields": {
    "facets": [
      {"field": "NPATH" }
    ]
  }
}

In a clean repository, this will show the structure of Language and show how many sub-categories exist.

 

With UUIDs it is pretty much the same.

 

PNAME

 

PNAME helps with some odd use cases. "I am sure I put it in a folder called .....". It gives faceting based on ancestry. It can highlight common structures for storing data or departure from such a structure: things in odd locations. If you have used folders to encode state you can roll up on this state. (Yes you should probably have used metadata or workflow).

 

PNAME can also be used to count direct members of a category - rather than everything in the category and below. So  NPATH can count everything below and PNAME everything direct. The design uses the same prefix for each. So to get the next category layer with total and direct counts you would use something like:

 

{
  "query": {
    "query": "*"
  },

 

  "filterQueries": [{"query": "NPATH:\"2/categories/General/Regions\""}],
  "facetFields": {
    "facets": [
      {"field": "NPATH", "prefix": "3/categories/General/Regions", "label": "clade"},
      {"field": "PNAME", "prefix": "3/categories/General/Regions", "label": "direct"}
    ]
  }
}

 

Summary

 

SITE and TAG provide easy query time access to concepts in Share. NPATH and PNAME support queries that will progressively drill into a folder or category structure and support faceting to count the documents and folders in each part of the next layer. APATH and ANAME do the same job with a UUID key to aid internationalisation and a bridge to other public APIs where UUID is ubiquitous.

 

Notes

 

Clade is a term from biological classification that defines a grouping with a common ancestor.

 

The public API can not currently have a field facet on the same field with different configuration  e.g. NPATH with a prefix and NPATH without a prefix. You have to choose one or the other.

 

We used something inspired by PathHierarchyTokenizer to support this functionality. In order to use doc values we parse internally and use a multi-valued string field. We are pre-analysing as doc values and tokenisation do not play well.

 

If you want to exclude the category nodes themselves from the counts add a filter query

-TYPE:category

 

Don't forget the workhorse support for structure provided by querying PATH, ANCESTOR and PARENT. You can also use PARENT and ANCESTOR for faceting on NodeRefs.

 

In SOLR 6, fields of type d:category are now indexed "as is" and you can facet on the NodeRef.

andy1

Document Fingerprints

Posted by andy1 Employee May 12, 2017

Introduction

 

 

Support to find related documents by content fingerprint, given a source document, was added in Alfresco Content Services 5.2 and is exposed as part of the Alfresco Full Text Search Query Language. Document fingerprints can be used to find similar content in general or biased towards containment. The language adds a new FINGERPRINT keyword:

 

FINGERPRINT:<DBID | NODEREF | UUID>

 

By default, this will find documents that have any overlap with the source document. The UUID option is likely to be the most common as UUID is ubiquitous in the public API. To specify a minimum amount of overlap use

 

FINGERPRINT:<DBID | NODEREF | UUID>_<%overlap>

FINGERPRINT:<DBID | NODEREF | UUID>_<%overlap>_<%probability>

 

FINGERPRINT:1234_20

Will find documents that have 20% overlap with the document 1234.

 

FINGERPRINT:1234_20_80

Will execute a faster query that will be 80% confident anything brought back will overlap by 20%

 

Additional information is added to the SOLR 4 or 6 indexes using the rerank template to support fingerprint queries. It makes the indexes ~15% bigger.

 

The basis of the approach taken here and a much wider discussion is presented in Mining of Massive Datasets, Chapter 3. Support to generate text fingerprints was contributed by Alfresco to Apache Lucene via LUCENE-6968. The issue includes some more related material.

 

Similarity and Containment

 

Document similarity covers duplicate detection, near duplicate detection, finding different renditions of the same content, etc. This is important to find and reduce redundant information. Fingerprints can provide a distance measure to other documents, often  based on Jaccard distance, to support "more like this" and clustering. The distance can also be used as a basis for graph traversal. Jaccard distance is defined as:

 

 

the ratio of the amount of common content to the total content of two documents. This distance can be used to compare the similarity of any two documents with any other pair of documents.

 

Containment is related but is more about inclusion. For example, many email threads include parts or all of previous messages. Containment is not symmetrical like the measure of similarity above, and is defined as:

 

         

 

It represents how much of the content of a given document is common to another document. This distance can be used to compare a single document (A) to any other document. This is the main use case we are going to consider.

 

Minhashing

 

Minhashing is an example of a text processing pipeline. First, the text is split into a stream of words; these words are then combined into five word sequences, known as shingles, to produce a stream of shingles. So far, all standard stuff available in SOLR. The 5-word shingles are then hashed, for example, in 512 different ways; keeping the lowest hash value for each hash. This results in 512 repeatably random samples of 5 word sequences from the text represented by the hash of the shingle. The same text will generate the same set of 512 minhashes. Similar text will generate many of the same hashes. It turns out that if 10% of all the min hashes from two documents overlap then it is a great estimator that J(A,B) = 0.1.

 

Why 5 word sequences? A good question. Recently, word embedding suggests 5 or more words are enough to describe the context and meaning of a central word. Looking at the distribution of words, 2 word shingles, 3 word shingles, 4 word shingles and 5 word shingles found on the web; at 5 word shingles the frequency distribution flattens and broadens compared with the trend seen for 1, 2, 3 and 4 word shingles.

 

Why 512 hashes? With a well distributed hash function this should give good hash coverage for 2,500 words and around 10% for 25,000, or something like 100 pages of text.

 

We used a 128-bit hash to encode both the hash set position (see later) and hash value to minimise collision compared with a 64 bit encoding including bucket/set position.

 

An example

 

Here are 2 summaries of the 1.0 and 1.1 CMIS specification. It demonstrates, amongst other things, how sensitive the measure is to small changes. Adding a single word affects 5 shingles.

 

 

The content overlap of the full 1.0 CMIS specification found in the 1.1 CMIS specification, C(1.0, 1.1) ~52%.

 

The MinHashFilter in Lucene

 

The MinHashFilterFactory has four options:

  • hashCount - the number of real hash functions computed for each input token (default: 1)
  • bucketCount - the number of buckets each hash is split into (default: 512)
  • hashSetSize - for each bucket, the size of the minimum set (default:1)
  • withRotation - empty buckets take values from the next with wrap-around (default: true if bucketCount > 1 )

 

Computing the hash is expensive. It is cheaper to split a single hash into sub-ranges and treat each as supplying an independent minhash. Using individual hash functions, each will have a min value - even if it is a duplicate. Splitting the hash into ranges means this duplication can be controlled using the withRotation option. If this option is true you will always get 512 hashes - even for very short docs. Without rotation, there may be less than 512 hashes depending on the text length. You can get similar, but biased, sampling using a minimum set of 512.

 

Supporting a min set is in there for fun: it was used in early minhash work. It leads to a biased estimate but may have some interesting features. I have not seen anyone consider what the best combination of options may be. The defaults are my personal opinion from reading the literature linked on LUCENE-6968. There is no support for rehashing based on a single hash as I believe the bucket option is more flexible.  

 

Here are some snippets from our schema.xml:

....

<fieldType name="text_min_hash" class="solr.TextField" positionIncrementGap="100">

   <analyzer type="index">

      <tokenizer class="solr.ICUTokenizerFactory"/>
      <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose" />
      <filter class="solr.ShingleFilterFactory" minShingleSize="5" maxShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" tokenSeparator=" " />
      <filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" hashCount="1" hashSetSize="1" bucketCount="512" />
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.KeywordTokenizerFactory" />
   </analyzer>
</fieldType>

.....

<field name="MINHASH"           type="identifier"  indexed="true"  omitNorms="true"  stored="false" multiValued="true"  required="false"  docValues="false"/>

.....

 <field name="min_hash"               type="text_min_hash"    indexed="true"  omitNorms="true"  omitPositions="true" stored="false"  multiValued="true" />

 

It is slightly misleading as we use the min_hash field as a way of finding an analyser to use internally to fill the MINHASH field. We also have a method to get the MINHASH field values without storing them in the index. Others would most likely have to store it in the index.

 

Query Time

 

At query time, the first step is to get hold of the MINHASH tokens for the source document - we have a custom search component to do that in a sharded environment form information we persist along side the index. Next you need to build a query. The easy way is to just OR together queries for each hash in a big boolean query. Remember to use setDisableCoord(true) to get the right measure of Javccard distance.. Applying a cut-off simply requires you to setMinimumNumberShouldMatch() based on the number of hashes you found.

 

Building a banded query is more complicated and derived from locality sensitive hashing, described in Mining of Massive Datasets, Chapter 3. Compute the band size as they suggest, based on the similarity and expected true positive rate required. For each band, build a boolean query that must match all the hashes in a band selected from the source docs fingerprint. Then, OR all of these band queries together. Avoid the trap of making a small band at the end by wrapping around to the start. This gives better query speed at the expense of accuracy.

andy1

What's in a date?

Posted by andy1 Employee Feb 1, 2011
A quick tour of Alfresco support to query date and date time properties.



By default, Alfresco treats date and datetime properties the same. Both are indexed and queried to a resolution of day. The index actually stores the date string as yyyy-MM-dd, for example, 2011-02-01, to support both querying and ordering down to day. This approach has several limitations. While the extra resolution is currently required to be included at query time it is ignored. Date properties can be ordered at query time. Datetime properties are ordered after the query execution in Java, requiring a DB property fetch to get the missing time.



However, datetime properties can be configured to use the alternative DateTimeAnalyser supporting full time resolution, down to milliseconds. This configuration also supports variable resolution of dates. If you just include year in a query against a datetime type it will only consider the year in the match. For example, the following Alfresco FTS queries will match with increasing date resolution when using the DateTimeAnalyser.



@cm:created:'2011'

@cm:created:'2011-02'

@cm:created:'2010-02-01'

@cm:created:'2010-02-01T11'

@cm:created:'2010-02-01T11:04'

@cm:created:'2010-02-01T11:04:31'

@cm:created:'2010-02-01T11:04:31.000'



Similarly, if only years are used in a datetime range it will ignore the lower resolution fields. (Take care to make sure the resolution of the start date and end match as mixed resolutions are not currently supported)



@cm:created:['2010' TO '2011']

@cm:created:['2011-02-01T11:03' TO '2011-02-01T11:04']



The DateTimeAnalyser is not used by default as it would require all Alfresco users to rebuild their lucene indexes. It can be configured (after stopping Alfresco)  by either



1) changing the setting in alfresco/model/dataTypeAnalyzers.properties to:

d_dictionary.datatype.d_datetime.analyzer=org.alfresco.repo.search.impl.lucene.analysis.DateTimeAnalyser




or, 2) copying alfresco/model/dataTypeAnalyzers.properties and related files to a new location and changing the definition of the bean that loads this file - 'dictionaryBootstrap' - currently defined in core-services-context.xml.





Once configured to use the DateTimeAnalyser, delete your existing indexes and restart Alfresco with the index.recovery.mode property set to FULL.



index.recovery.mode=FULL



The date time tokeniser stores the dates in parts - as a crude trie of year, month, day, hour (24H), minutes, seconds and milliseconds. As this only supports querying there is an additional field in the index to support ordering.



Varying resolution date time queries and range queries are supported in Alfresco 3.4.0E and later. The DateTimeAnalyser has been around since Alfresco 2.1.

Filter Blog

By date: By tag: