Skip navigation
All Places > Alfresco Content Services (ECM) > Blog > 2017 > November
2017
blong

Scaling Search with DB_ID_RANGE

Posted by blong Employee Nov 29, 2017

Alfresco Content Services v5.2.2 introduced a new sharding method called DB_ID_RANGE.  This is the first shard scalable sharding method available to Alfresco.  Before going into the details of what I mean by shard scalable, it might be good to go through a little Alfresco Search history.

 

Before v4.0, Alfresco used Apache Lucene as its search engine.  Each Alfresco platform server contained embedded Lucene libraries.  The platform used those libraries to index all properties and contents.  In clustered environments, each server performed the same indexing operations, effectively building their own index in parallel to other instances.  Since Lucene was embedded into the Alfresco platform, it naturally scaled horizontally with the platform.

 

By default, indexing was performed synchronously or in-transaction.  This means it provided consistency to searches.  In other words, if you searched Alfresco immediately after adding or updating a node, that node would be in the result set.  The downside to this approach is that the index operation is performed in series giving the user a slower experience.  There was an option to do asynchronous indexing, however, the architecture had its own set of problems we don't need to go into for this blob post.

 

Under the embedded Lucene architecture, it was possible to scale horizontally for search performance, but only with respect to the number of search requests.  A single search must still execute completely one instance and cover the full index.  On top of that, it is not possible to scale the indexing operations.  This solution was fine for smaller repositories, but when a repository grows into the millions, it becomes untenable.

 

Along comes Alfresco v4.0 and Apache Solr was integrated with Alfresco.  Apache Solr is a web application wrapped around the Apache Lucene search engine.  So it is not a major technological shift.  The key capability here is the independent scalability of the search engine.  This was implemented using a polling model.  Basically, the Solr application polls Alfresco for new transactions since the last index, Alfresco provides the data, and Solr indexes those transactions.  Since it utilizes polling, searches become inconsistent.  This means there is a lag between updates and search results.  It also means different Solr servers could return different results, depending on their current lag.  By default, this lag is 15 seconds, so users are rarely impacted.

 

Under the Solr1 architecture, scalability was similar to Lucene.  The only difference is the ability to scale independent of the platform/backend application.  A single search still needed to execute on one Solr instance and cover the full index and each Solr instance still must index the full repository.

 

None of these scalability issues change with Solr4 or Solr6 outside of the addition of sharding or index replication.  Index replication is for a different blog post, so we will stick with sharding in this post.

 

The concept of Solr sharding was first introduced in Alfresco v5.1.  Sharding allows for an index to be divided among instances.  This means that one complete index can be distributed across 2+ servers.  This means that the indexing is scalable by a factor of the number of shards.  5 shards across 5 servers will index 5 times faster.  On top of that, a single search is also distributed in a similar matter, making a search 5 times faster.  However, a search across shards must merge search results, possibly making the search slower in the end.  When you consider loads were there are more searches than shards, you actually lose some search performance in most cases.

 

Solr shards are defined by a hash algorithm called a sharding method.  The only sharding method supported in Alfresco v5.1 was ACL_ID.  This means permissions were hashed to determine shards that contained the index.  So when a search is performed, ACLs are determined, containing shards are selected, and the search is performed, merged, and returned to the user.  It is optimal for one shard to be selected, then a small index search is performed and no result set merge is performed.  This is only beneficial if permissions are diverse.  If millions of documents have the same ACL the sharding is unbalanced and effectively useless.

 

To support other use cases, especially those without diverse sets of permissions, several sharding methods were introduced in Alfresco Content Services v5.2.  This includes DB_ID, DATE, and any custom text field.  DB_ID allows for well balanced shards in all situations.  DATE allows for well balanced shards in most situations.  That would not be the case with heavy quarter-end or year-end processing.  A well balanced shard provides the best scalability.  There good and bad reasons to choose ACL_ID or DB_ID or DATE or your own custom property.  Those are for another blog post.

 

With sharding and all these sharding methods available, most scalability issues have a solution.  However, there is still another issue.  A sharding group must have a predefined a number of shards.  This means that each shard will grow indefinitely.  So an administrator must project our the maximum repository size and create an appropriate number of shards.  This can be difficult, especially with repositories without a good retention policy.  Also, since it is best to hold the full index in memory, scalability is best when you can limit the size of each shard to something feasible given your hardware.

 

Search Engine

ProsCons

Apache Lucene

(Alfresco v3.x to v4.x)

Consistent

Scale with search requests

Embedded: no HTTP layer

No scale independence from platform

No scale with single search request

No scale with index load

Indefinite index size

Apache/Alfresco Solr v1

(Alfresco v4.0 to v5.1)

Scale independence from platform

Scale with search requests

Eventually consistent

No scale with single search request

No scale with index load

Indefinite index size

Back-end Database

(Alfresco v4.2+)

Consistent

Used alongside Solr engines

Scale with back-end database

DBA skills needed to maintain

Only available for simple queries

Apache/Alfresco Solr v4

(Alfresco v5.0+)

Same as Solr v1

Sharding available

Same as Solr v1

Alfresco Search Service v1.x

(Alfresco v5.2+)

Same as Solr v4

Embedded web container

Same as Solr v4

Shard Method: ACL_ID

(Alfresco v5.1+)

(Solr v4 or SS v1.0+)

Embedded web container

Scale independence from platform

Scale with search requests

Scale with single search request

Scale with index load

Reduction of index size

No scale for number of shards

Likely search result merge across shards

Balance depends on node permission diversity

Indefinite index size

Shard Method: DATE

(Alfresco v5.2+)

(SS v1.0+)

Same as ACL_ID

Date search performance

Reduction of index size

No scale for number of shards

Likely search result merge across shards

Index load on one shard at a time

Indefinite index size

Shard Method: custom

(Alfresco v5.2+)

(SS v1.0+)

Same as ACL_ID

Custom field search performance

Reduction of index size

No scale for number of shards

Likely search result merge across shards

Balance depends on custom field

Indefinite index size

Shard Method: DB_ID

(Alfresco v5.2+)

(SS v1.0+)

Same as ACL_ID

Always balanced

Reduction of index size

No scale for number of shards

Always search result merge across shards

Indefinite index size

Shard Method: DB_ID_RANGE

(Alfresco v5.2.2+)

(SS v1.1+)

Same as DB_ID

Scale for number of shards

Full control of index size

Proactive administration required

Always search result merge across shards

 

You can see similar comparison information in Alfresco's documentation here: Solr 6 sharding methods | Alfresco Documentation.

 

In Alfresco Content Services v5.2.2 and Alfresco Search Services v1.1.0, the sharding method DB_ID_RANGE is now available.  This allows an administrator to define a set number of nodes indexed by each shard.  This allows additional shards to be added at any time.  Although it has always been possible to add additional shards at any time (theoretically), those shards would have a new hash which would inevitably perform duplicate indexing work already performed.

 

Let's start with a fresh index.  Follow the instructions provided here: Installing and configuring Solr 6 without SSL | Alfresco Documentation.  However, ignore the initialization of the alfresco/archive core.  If you did this anyway, stop the Alfresco SS server, remove the alfresco/archive directories, and start it back up.  We basically want to start it without any indexing cores.

 

To properly enable sharding, follow the instructions here: Dynamic shard registration | Alfresco Documentation.  Although that is under Solr 4 configuration, it holds for Alfresco SS too.  I also recommend you change the solrhome/templates/rerank/conf/solrcore.properties file to meet your environment.

 

To start using DB_ID_RANGE, we are going to define the shards using simple GET requests through the browser.  In this example, we are going to start with a shard size of 100,000 nodes each.  So the 1st shard will have the 1st 100,000 nodes, the 2nd will have the next 100,000.  We will define it with just 2 shards to start.  When we need to go beyond 200,000 nodes, it would be logical to create a new shard group, starting at 200,000.  However, that does not work yet in Alfresco v5.2.  You must define a maximum number of shards that is as large as feasibly possible for your environment.

 

We are going to start with 3 server instances and grow to use 5 instances.

 

Create your 1st shard group and the 1st shard on the 1st and 2nd instances of your servers:

http://<instance1>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=1&template=rerank&shardIds=0&property.shard.method=DB_ID_RANGE&property.shard.range=0-99999
http://<instance2>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=2&template=rerank&shardIds=0&property.shard.method=DB_ID_RANGE&property.shard.range=0-99999

Create the 2nd shard on the 1st and 3rd instances of your servers:

http://<instance1>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=1&template=rerank&shardIds=1&property.shard.method=DB_ID_RANGE&property.shard.range=100000-199999
http://<instance3>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=2&template=rerank&shardIds=1&property.shard.method=DB_ID_RANGE&property.shard.range=100000-199999

For about the first 200,000 nodes added to the system, this will work for your search engine.  In this configuration, twice as much load will be placed on instance1 than the other two instances, so it is not a particularly great setup, but this is just an example for learning purposes.

 

Now let's suppose we are at 150,000 nodes and we want to get ready for the future.  Let's add some more shards.

http://<instance2>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=1&template=rerank&shardIds=2&property.shard.method=DB_ID_RANGE&property.shard.range=200000-999999
http://<instance3>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=2&template=rerank&shardIds=2&property.shard.method=DB_ID_RANGE&property.shard.range=200000-999999

Now we are ready for 800,000 more nodes and room to add 3 more shards.  Let's suppose we are now approaching 1,000,000 nodes, so let's add another 1,000,000 node chunk.

http://<instance4>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=1&template=rerank&shardIds=3&property.shard.method=DB_ID_RANGE&property.shard.range=1000000-1999999
http://<instance5>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=2&template=rerank&shardIds=3&property.shard.method=DB_ID_RANGE&property.shard.range=1000000-1999999

Suppose the search is not performing as well as you would like and you scaled vertically as much as you can.  To better distribute the search load on the new shard, you want to add another instance to the latest shard.

http://<instance1>:8983/solr/admin/cores?action=newCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7&numShards=4&numNodes=2&nodeInstance=3&template=rerank&shardIds=3&property.shard.method=DB_ID_RANGE&property.shard.range=1000000-1999999

When you move beyond 2,000,000 nodes and you want to downscale the shard above, you can try the following command to remove the shard.  Notice the coreName combines the coreName used to create the shard, appended by a dash and the shardId.

http://<instance1>:8983/solr/admin/cores?action=removeCore&storeRef=workspace://SpacesStore&coreName=alfresco-shards-0-7-3

It is recommended that you keep the commands you used to create the shards.  You should hold that in your documentation so you know which shards were defined for which DBID ranges.  The current administration console interface does not help you with that all that much.  I would expect to see that improve with future versions of Alfresco CS.

Hi,

 

Just a short post to announce that the "live sandbox" (*) of the Alfresco Content Services REST API Explorer has been upgraded to Alfresco Content Services (ACS) 5.2.2.

 

This means you can now explore and "try out" in a shared playground (*) the new API features that have been added to Community 201711 (6.0.0-ea) and ACS 5.2.2. These include new Audit and Avatar API's.

 

For more details on the changes, please refer to:

 

Alfresco Community Edition 201711 GA Release 

 

... or you can also diff the API specs (between 5.2.1 & 5.2.2).

 

If you're just getting started with the Alfresco public APIs and would like more background information or detailed tutorials, then please refer to:

 

Alfresco public REST APIs 

 

If you prefer to download &/or trial your own dedicated instance, see Download and Install Alfresco.

 

It should be noted that if you install Governance Services (GS) then you will also have access to the new GS API Explorer & GS Core REST APIs (part of Records Management Community 2.6.a).

 

Regards,

Alex

 

(*) Please be aware that (as always) this is a publicly accessible & shared demo server. Please do not upload private, sensitive or valuable data. The server is not backed-up and may be reset or taken offline without warning.

As promised in our last release post, RM 2.6.c follows hot on the heels of 2.6.b.  RM 2.6.c is compatible with Alfresco Platform 5.2.g and Share 5.2.f.

What's in 2.6.c?

Due to the short period and a focus on enterprise features, there isn't a huge amount that's new in this release. We have had a lot of updates to improve strings in our ten supported localisations.  We have also acted on feedback we received to improve one of our English UI strings (RM-5793).

We have included a few security fixes by updating RM to use the latest release of Aikau.

Our developer documentation has had some attention too, and we've started work on a new format of technical documentation.  We've also formalised our contribution guidelines.

What's coming up next?

As noted above RM 2.6.c has been tested with Alfresco 5.2.x, but we don't know of any compatibility issues with the latest community release of ACS. If you try it and find issues then we'd love to hear about it.  Compatibility with 6.0.x is on our backlog, but we don't expect to address it until RM 3.0.a (nb. our next community release is currently expected to be 2.7.a).

If you have any feedback about 2.6.c, or about what you'd like to see in the next release, then please let us know. You can use the comments below, or feel free to message me or one of the team.

Links

*Alfresco Content Services (ACS) is no longer using AWS Quick Start for deployments on AWS. We recommend Amazon EKS as the new reference deployment for ACS on AWS - read the following blog post from our Product Manager Harry Peek to learn more: https://www.alfresco.com/blogs/digital-transformation/alfresco-embraces-amazon-eks*

 

During my time at Alfresco I have worked on several projects. The most exciting ones have been any that have involved the usage of AWS and the wealth of services they have to offer.

 

AWS have a set of products called Quickstarts. These are a collection of reference architectures that allow any business with an AWS account to "quickly" start with a particular product. These products are offerings from the Quickstart community that contribute. For example there are Quickstarts for Chef, Docker, Git and others.

 

Alfresco currently has a Quickstart for Alfresco Content Services. This allows the end user to deploy a CloudFormation template, the AWS technology used to deploy an ACS environment in your own AWS account, in under 45 minutes. Once complete, and license permitting, there will be an ACS cluster ready to use. A deployment that may typically have taken days and weeks to manually complete now only takes minutes, all because of the power of AWS.

 

Having recently worked on both the Alfresco Content Services Quickstart and another Quickstart we are developing internally (watch this space) I'd like to share my thoughts about the process in general and describe a day in the life of a Quickstarter.

 

Firstly, understanding what you want to build is key. Obviously. But what I mean by this is making sure that you ask yourselves the right questions; What are the underlying resources needed to complete your deployment? Do you need to offer a secure network? Secure access to any resources deployed? If these are a resounding yes, then AWS offer some free templates that you can use to help "layer" your infrastructure so you can concentrate on coding the product and higher-level infrastructure you may want to deploy.

We at Alfresco always try to ensure that any Quickstart we offer now and in the future is secure and "production" ready. There's no reason why the Alfresco Content Services Quickstart shouldn't be used in production. The template can be downloaded, extended and deployed by anyone.

 

Once you have identified what you want to build, next is the proposal. The AWS Quickstart team are ready to hear about what you want to build. Following their guidelines, we:

 

  1. Proposed a Quickstart. They review and agree or push back for more information.
  2. Start development. This happens in parallel to the proposal. Removing bottlenecks is key.
  3. Internal testing. We need to ensure that the Quickstart we are building is reliable and repeatable. It builds in the same way each and ever time we deploy it in our own AWS accounts.
  4. Submit a code review. The AWS Quickstart team will create a git repo ready for the code to be reviewed. 

 

Some simplified steps there, but some of these steps can take weeks. What I have learnt and discovered with my most recent project is:

 

  • Networking. I don't mean CIDR's, I mean people. It's been a great opportunity to network internally and externally with individuals who I may never have done unless I was working on a Quickstart. Talking with the AWS team and learning from them has been invaluable and has helped me grow as a developer.
  • Product development. This is the most important point I want to impress on you, the reader. When developing your own Quickstart, try and make sure you utilise as many AWS resources as possible. For example; rather than install MySQL or another equivalent on an EC2 instance use RDS or Dynamo, instead of storing files locally store them on S3, rather than storing your session state locally use RDS or ElastiCache. If your product can't talk to these services or has hard-coded limitations then its time to take a look at your product to make it more "cloud-aware".
  • Communication. Make sure at every step you talk to the wider audiences within your organisations to keep them in the loop. Do not assume people know what you are doing and why. The quicker you can receive feedback from technical and non-technical people the better your first iteration of your Quickstart will be.
  • Documentation. We cant forget this can we? A Deployment Guide is required for each and every Quickstart. This needs to be thorough, tested and well written. Its just as important if not more so than the Quickstart itself. Its the enabler for your users. Potentially, each and every Quickstart of yours that is deployed is a new customer. The guide needs to hit the ground running.

 

Thanks for reading!

resplin

Moving from SMB to WebDAV

Posted by resplin Employee Nov 3, 2017

This long blog post explains how we are determining the future of shared network drive support in Alfresco Content Services. Specifically, we are considering dropping support for using the SMB protocol and instead increasing our investment in WebDAV for these use cases. We want to explain our reasoning and seek your feedback.

 

Analysis of the Problem

Alfresco has long worked to make the power of enterprise applications available to users who don't want to understand all the details of an ECM or BPM system. Early in the product's life, Alfresco added to the Content Repository the ability to be accessed as a shared network drive so that knowledge workers could receive the benefits of ECM while continuing their habit of "throwing everything in the Z: drive". We ended up implementing this capability three separate times: the CIFS dialect of SMBv1 in our JLAN module, standards compliant WebDAV, and the Windows specific WebDAV implemented in our AOS module. But the broad adoption of this capability has made it worthwhile.

 

Microsoft recently announced the end-of-life of SMBv1, including the CIFS dialect implemented by Alfresco. This protocol was the main exploit vector for the recent round of cryptolocker attacks. We cannot recommend our customers use this protocol, so we have been evaluating other options. You can see some of our conversations on this topic in the thread Re: SMB2 / SMB3 server support.

 

That conversations mentions a number of ways to implement SMBv3 which we investigated: upgrading our current implementation, using 3rd party open source libraries (there aren't any mature implementations of the server), implementing a storage back-end for Samba, and using proprietary libraries. Recognizing that the effort involved in implementing SMBv3 would slow down our progress on the other priorities described on our Content Repository Roadmap 2017, we also looked at alternatives ways to meet the same use cases.

 

WebDAV is an obvious choice to replace SMB, as our implementation is mature and it is widely used by our customers. It is also more robust than SMB when used on high latency networks such as when deploying the Content Repository in a cloud environment like AWS, which is an increasingly common use case. In many ways, WebDAV is a better fit for ECM use cases than SMB, which is intended to be used by high performance filesystems. Customers who attempt to use ACS as a file server are sometimes disappointed as a content repository makes a different set of trade-offs from a file server; it has many more capabilities but lower total throughput. Specifically, a file server uses SMB to allow mid-file access and high performance operations by exposing raw file handles to client applications, but this is not possible when the content is encrypted, is stored in an object store like S3 or Centerra, or is stored in a cheap high-latency infrequent access storage tier.

 

Many of the use cases where customers have expressed a preference for SMB over WebDAV require high frequency mid-file access. These use cases are not suitable for pulling directly from an content repository because they don't allow it to perform ECM functions. Instead, customers should synchronize the desired content to the client machine and back to the repository when the file is finished being used. Our proprietary Desktop Sync Server offers this capability, and is one of many solutions provided by both Alfresco partners and the open source community that can be used for this purpose.

 

Though our analysis suggests that WebDAV is an adequate replacement for SMBv1 in most use cases, we wanted to hear from a larger set of customers.

 

Research

We sent a survey to 150 customers who have previously indicated that they use either the CIFS or WebDAV shared network drives, and 52 responded. Important findings included:

 

  1. WebDAV is already used more widely than CIFS.
  2. The ACS Windows Explorer shortcuts available through CIFS are not as widely used as expected.
  3. Concerns with SSO access to shared network drives. NTLMv1 is also insecure, and the survey showed that Kerberos is much more widely used.

 

We specifically asked customers why they don't use WebDAV in every circumstance, and a few key reasons surfaced:

 

  1. Concerns about performance: WebDAV does not perform as well as CIFS on a local network. Part of this performance is due to the mid-file access that CIFS can provide, but we believe there is room for optimization in our implementation which will help address this concern.
  2. Concerns about compatibility: Some applications, such as Adobe products, struggle when accessing large files over WebDAV. One reason why CIFS performs better for these applications is the direct file handles for mid-file access that we discussed earlier. The second reason is that our CIFS implementation intelligently handles the file shuffling these applications do during write operations. We plan to port this shuffling from our CIFS implementation to our WebDAV libraries.
  3. Multiple customers raised a concern that the Windows 255 character limit impacts WebDAV folders. Our plan is to use repository shortcuts to make it easy to mount deep folders on short paths.
  4. The largest file that can be shared with WebDAV is 4GB. Desktop Sync is a better way to work with such files, as working with a 4GB file over WebDAV would require many round-trips of the full file.

 

For those who are interested, here are the detailed results from the survey. Note that the questions are usually multi-select, so results do not add to 100%. Also, there was an "Other" option where respondents could enter additional text, which accounts for the last few results in many questions. I apologize for the truncation in the answers.

 

Truncated options: Shared network drive, Custom application, An official Alfresco Connector, Publication through a web portal or public web site, Other: From Jive or Liferay Portlets.

Truncated options: Engineering Designs, Graphic Design, Other: all types of research files.

 

The Future

As a result of this analysis and research, we intend to take the following actions:

  • It is expected that Alfresco Content Services 5.2 will be the last release with a CIFS implementation. Along with retiring CIFS, we will be retiring NTLM and the ACS Windows Explorer shortcuts. Instead we will recommend the use of WebDAV and Kerberos.
  • We will compensate for the identified shortcomings in WebDAV by:
    • Implementing smart file shuffling with WebDAV to increase compatibility with commonly used applications.
    • Making it easy to deep-link into the Alfresco Content Repository over WebDAV to avoid issues with path length.
    • Focusing on improving WebDAV performance at scale.
    • Continuing to improve the performance of Desktop Sync when used with very large files.
  • We are also considering SAML support for shared network drives, though that work is not currently scheduled and won't be available in the next release.
  • If there is sufficient customer demand for an SMBv3 implementation, we will reconsider that development effort. In order to lower the cost of development, it is likely that we would leverage a proprietary 3rd party library. As such, any future SMBv3 functionality is not expected to be part of our open source offerings.

 

I look forward to discussing the implications of this change in the comments below.

Filter Blog

By date: By tag: