1. Overview
Alfresco Search Enterprise 3.2 consists of Alfresco Content Services, Elasticsearch Server and the Elasticsearch connectors. Further According to the official documentation there are number of prerequisites such as ActiveMQ, Postgresql Database and Transform Service. Please also note that it is not a must to have transform service running to extract general metadata.
In this post I will cover how we can Scale ES during re-indexing/ live indexing and when to use different ES connector jars.
2. Alfresco Search Enterprise (ASE)
Alfresco Content Services supports the Elasticsearch platform for searching within the repository using Alfresco Search Enterprise 3.2. Alfresco Search Enterprise module is consist of 6 jar files.
ASE Jar List
2.1. Re-Indexing
alfresco-elasticsearch-reindexing-3.2.0-app.jar: This is all-in-one jar file which index content, medatdata and path for existing content store.
However, this perticular jar comes with 3 parameters which we can configure according to the business requirement.
# Reindexing services execution
alfresco.reindex.metadataIndexingEnabled = true
alfresco.reindex.contentIndexingEnabled = true
alfresco.reindex.pathIndexingEnabled = true
Therefore if we wanted to reindex metadata only, you should pass the parmenters to the above command accordingly as below
Sample Search Queries to try Out:
For Metadata Search: cm:name:'test', cm:author:admin ,cm:title:'test'
For Path Search: PATH:"/app:company_home/st:sites/cm:test/cm:documentLibrary/*"
For Content Search: cm:content:’test’
2.2. Live-Indexing
There are 5 live indexing jars available in ES connector distribution zip.
alfresco-elasticsearch-live-indexing-3.2.0-app.jar : This is all-in-one jar file which index content, medatdata and path for realtime data which consist of all 4 live-indexing jar files specific to mediation, metadata, content, and path. Unlike with all-in-one reindex jar we do not have control over what we should index.
When to use other live indexing jars?
In the events that business do not have the requirement to use full text indexing(content indexing) and when deployinng at Scale.
To start alfresco-elasticsearch-live-indexing-mediation-3.2.0-app.jar run below command.
alfresco-elasticsearch-live-indexing-metadata-3.2.0-app.jar: Index metadata only. To start run below command.
alfresco-elasticsearch-live-indexing-path-3.2.0-app.jar: Index path only
alfresco-elasticsearch-live-indexing-content-3.2.0-app.jar : Index content only
3. Deploying at Scale
3.1. Live-Indexing
When designing highly available systems deploying at scale is essential. Hence below diagram shows most optimized way of designing high available architecture.
Live-Indexing: Deploying at Scale
There will be Single point of Failure in Mediation Component as it cannot be scaleup. Therefore, it is a must that we need Monitor the mediation component and run reindexing app for the specific period in case of a failure.
3.2. Re-Indexing
It can take a large amount of time when re-indexing a large repository using a single re-index process. Therefore, with below two approaches you can scale reindexing process vertically as well as horizontally.
3.2.1. Aapproach 1
In this approach we can have multiple EC2 instances to have horizontal scaling and inside each instance we can run multiple reindexing threads.
Re-Indexing:Approach1
Setting Up Re-Indexer Instance
- Copy alfresco-elasticsearch-connector-distribution-3.2 into each instance
- We were running 6 threads on one instance and 5 threads on second instance. This can be change accordingly.
- Run below code with unique port numbers and reindex.fromId and reindex.toId to run as many threads needed in a instance.
- To fetch by IDS alfresco.reindex.jobName=reindexByIds: index nodes in an interval of database ALF_NODE.id column
3.2.2. Approach 2
Re Indexing using remote partitioning. More details can be found in Alfresco Docs. Refer: https://docs.alfresco.com/search-enterprise/latest/admin/#alfresco-elasticsearch-connector
To Start Manager, execute below.
java -jar alfresco-elasticsearch-reindexing-3.2.0-app.jar
--alfresco.reindex.jobName=reindexByIds
--alfresco.reindex.partitioning.type=manager
--alfresco.reindex.pagesize=100 --alfresco.reindex.batchSize=100
--alfresco.reindex.fromId=0
--alfresco.reindex.toId=10000
--spring.batch.datasource.url=
jdbc:postgresql://localhost:5432/alfresco
--spring.batch.datasource.username=alfresco
--spring.batch.datasource.password=alfresco
--spring.batch.datasource.driver-class-name=org.postgresql.Driver
--spring.datasource.url=jdbc:postgresql://localhost:5432/alfresco
--spring.datasource.username=alfresco
--spring.datasource.password=alfresco
--alfresco.reindex.partitioning.grid-size=20
--spring.batch.drop.script=
classpath:/org/springframework/batch/core/schema-drop-postgresql.sql
--spring.batch.schema.script=
classpath:/org/springframework/batch/core/schema-postgresql.sql
To Start Worker, execute below.
java -jar alfresco-elasticsearch-reindexing-3.2.0-app.jar
--alfresco.reindex.partitioning.type=worker
--alfresco.reindex.pagesize=100 --alfresco.reindex.batchSize=100
--alfresco.reindex.concurrentProcessors=2
--spring.batch.datasource.url=
jdbc:postgresql://localhost:5432/alfresco
--spring.batch.datasource.username=alfresco
--spring.batch.datasource.password=alfresco
--spring.batch.datasource.driver-class-name=org.postgresql.Driver
--spring.datasource.url=jdbc:postgresql://localhost:5432/alfresco
--spring.datasource.username=alfresco
--spring.datasource.password=alfresco
--spring.batch.drop.script=
classpath:/org/springframework/batch/core/schema-drop-postgresql.sql
--spring.batch.schema.script=
classpath:/org/springframework/batch/core/schema-postgresql.sql
--server.port=9091
Note: If you are re-indexing only metadata/ AND Path with remote partitioning approach, make sure to set the related properties while executing Worker command.
4. Comparison of re-indexing approaches
| Pros | Cons |
Approach 1: Multi-threading | Less time consuming, best suit for customers with larger repositories. | Considerable manual work involved setting up threads, however as re-indexing is just one time process this can be highly disregard. |
Approach 2: Remote Partitioning | Slower therefore suit for customers with smaller repositories. | Easy to Manage. Number of workers/partitions can be easily managed by setting alfresco.reindex.partitioning.grid-size. Manager thread automatically assign fromId and toId values on worker nodes. |
5. Reference: