blong

Scaling Search with DB_ID_RANGE

Blog Post created by blong Employee on Nov 29, 2017

Alfresco Content Services v5.2.2 introduced a new sharding method called DB_ID_RANGE.  This is the first shard scalable sharding method available to Alfresco.  Before going into the details of what I mean by shard scalable, it might be good to go through a little Alfresco Search history.

 

Before v4.0, Alfresco used Apache Lucene as its search engine.  Each Alfresco platform server indexed all properties and contents.  In clustered environments, each server had to duplicate the same indexing operations, building their own index.  Since it was embedded into the Alfresco platform, it scaled horizontally with the platform.  By default, indexing was performed synchronously/in-transaction providing consistency to searches.  Slow indexing operations would cause latency on uploads and updates.  When doing asynchronous indexing, the architecture had its own set of problems we don't need to go into.  Under this architecture, it is possible to scale horizontally for search performance, but only to scale with the number of requests.  A full search must go over the full index.  It is not possible to scale the indexing operations.

 

In Alfresco v4.0, Apache Solr was introduced.  Apache Solr is a web application wrapped around the Apache Lucene search engine.  The key capability here is the independent scalability of the search engine.  Although this brought horizontal scalability to the search performance, it did little to help indexing performance.

 

The concept of sharding was first introduced in Alfresco v5.1.  Sharding allows for an index to be divided among instances.  This means that indexing performance is scalable with respect to the number of shards.  This does introduce the need to search across multiple servers for a single result set, which may have a negative impact on search performance.

 

The only sharding method supported in Alfresco v5.1 was ACL_ID.  This means shards were based on permissions.  If all nodes have the same permissions, then they are all indexed by the same shard.  This is very good for search performance, but only good for index performance if the repository has a diverse set of permissions.

 

To support other use cases, especially those without diverse sets of permissions, several sharding methods were introduced in Alfresco Content Services v5.2.  This includes DB_ID, DATE, and any custom text field.  DB_ID and DATE allow for well balanced shards.  The DB_ID and DATE methods almost always result in roughly the same number of nodes in each shard.  This resolves they key issue around node balance with ACL_ID.

 

With sharding and all these sharding methods available, most scalability issues have a solution.  However, there is still one minor issue. A sharding group must have a predefined a number of shards.  This means that each shard will grow indefinitely. Since it is best to hold the full index in memory, scalability is best when you can limit the size of each shard to something feasible given your hardware.

 

Search Engine

ProsCons

Apache Lucene

(Alfresco v3.x to v4.x)

Index consistent

Scale with search requests

Embedded: no HTTP layer

No scale independence from platform

No scale with single search request

No scale with index load

No control of index size

Apache/Alfresco Solr 1

(Alfresco v4.0 to v5.1)

Scale independence from platform

Scale with search requests

Index eventually consistent

No scale with single search request

No scale with index load

No control of index size

Back-end Database

(Alfresco v4.2+)

Index consistent

Used alongside Solr engines

Scale with back-end database

DBA skills needed to maintain

Only available for simple queries

Apache/Alfresco Solr 4

(Alfresco v5.0+)

 

See Solr 1

 

See Solr 1

Alfresco Search Service v1.x

(Alfresco v5.2+)

 

See Solr 4

Embedded web container

 

See Solr 4

Shard Method: ACL_ID

(Alfresco v5.1+)

(Solr4 or SS v1.0+)

See SS v1.x

Scale with single search request

Scale with index load

Some control of index size

No scale for number of shards

Possible search result merge across shards

Balance depends on node permission diversity

Shard Method: DATE

(Alfresco v5.2+)

(SS v1.0+)

See ACL_ID

Date period search one shard

No scale for number of shards

Likely search result merge across shards

Index load on one shard at a time

Shard Method: custom

See ACL_ID

Custom field search one shard

No scale for number of shards

Likely search result merge across shards

Shard Method: DB_ID

See ACL_ID

Shard load always balanced

No scale for number of shards

Always search result merge across shards

Shard Method: DB_ID_RANGE

See DB_ID

Scale for number of shards

Full control of index size

Proactive administration required

Always search result merge across shards

 

You can see similar comparison information in Alfresco's documentation here: Solr 6 sharding methods | Alfresco Documentation.

 

In Alfresco Content Services v5.2.2 and Alfresco Search Services v1.1.0, the sharding method DB_ID_RANGE is now available.  This allows an administrator to define a set number of nodes indexed by each shard.  This allows additional shards to be added at any time.  Although it has always been possible to add additional shards at any time, those shards would have a new hash which would inevitably perform duplicate indexing work.  In this case, you can create a new group of shards that hold mutually exclusive indexes from another group of shards.

 

Let's start with a fresh index.  Follow the instructions provided here: Installing and configuring Solr 6 without SSL | Alfresco Documentation.  However, ignore the initialization of the alfresco/archive core.  If you did this anyway, stop the Alfresco SS server, remove the alfresco/archive directories, and start it back up.  We basically want to start it without any indexing cores.

 

To properly enable sharding, follow the instructions here: Dynamic shard registration | Alfresco Documentation.  Although that is under Solr 4 configuration, it holds for Alfresco SS too.  I also recommend you change the solrhome/templates/rerank/conf/solrcore.properties file to meet your environment.

 

To start using DB_ID_RANGE, we are going to define the shards using simple GET requests through the browser.  In this example, we are going to start with a shard size of 100,000 nodes each.  So the 1st shard will have the 1st 100,000 nodes, the 2nd will have the next 100,000.  We will define it with just 2 shards to start.  When we need to go beyond 200,000 nodes, we will create a new shard group, starting at 200,000.  We will also change the shard size to 1,000,000 per shard.

 

We are going to start with 3 server instances and grow to use 5 instances.

 

Create your 1st shard group and the 1st shard on the 1st and 2nd instances of your servers:

http://<instance1>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=2&numNodes=2&nodeInstance=1&template=rerank&shardIds=0&property.shard.method=DB_ID_RANGE&property.shard.range=0-99999
http://<instance2>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=2&numNodes=2&nodeInstance=2&template=rerank&shardIds=0&property.shard.method=DB_ID_RANGE&property.shard.range=0-99999

Create the 2nd shard on the 1st and 3rd instances of your servers:

http://<instance1>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=2&numNodes=2&nodeInstance=1&template=rerank&shardIds=1&property.shard.method=DB_ID_RANGE&property.shard.range=100000-199999
http://<instance3>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=2&numNodes=2&nodeInstance=2&template=rerank&shardIds=1&property.shard.method=DB_ID_RANGE&property.shard.range=100000-199999

For about the first 200,000 nodes added to the system, this will work for your search engine.  In this configuration, twice as much load will be placed on instance1 than the other two instances, so it is not a particularly great setup, but this is just an example for learning purposes.

 

Now let's suppose we are at 150,000 nodes and we want to get ready for the future.  Let's add some more shards.

http://<instance2>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=1&template=rerank&shardIds=2&property.shard.method=DB_ID_RANGE&property.shard.range=200000-999999
http://<instance3>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=2&template=rerank&shardIds=2&property.shard.method=DB_ID_RANGE&property.shard.range=200000-999999

Now we are ready for 800,000 more nodes and room to add 3 more shards.  Let's suppose we are now approaching 1,000,000 nodes, so let's add another 1,000,000 node chunk.

http://<instance4>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=1&template=rerank&shardIds=3&property.shard.method=DB_ID_RANGE&property.shard.range=1000000-1999999
http://<instance5>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=2&template=rerank&shardIds=3&property.shard.method=DB_ID_RANGE&property.shard.range=1000000-1999999

Suppose the search is not performing as well as you would like and you scaled vertically as much as you can.  To better distribute the search load on the new shard, you want to add another instance to the latest shard.

http://<instance1>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=3&template=rerank&shardIds=3&property.shard.method=DB_ID_RANGE&property.shard.range=1000000-1999999

When you move beyond 2,000,000 nodes and you want to downscale the shard above, you can try the following command to remove the shard.  Notice the coreName combines the coreName used to create the shard, appended by a dash and the shardId.

http://<instance1>:8983/solr/admin/cores?action=removeCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7-3

It is recommended that you keep the commands you used to create the shards.  You should hold that in your documentation so you know which shards were defined for which DBID ranges.  The current administration console interface does not help you with that all that much.  I would expect to see that improve with future versions of Alfresco CS.

Outcomes