Offline/parallel re-indexing with ElasticSearch

cancel
Showing results for 
Search instead for 
Did you mean: 

Offline/parallel re-indexing with ElasticSearch

asirika
Partner
1 0 1,143

Overview

Alfresco Content Services supports the Elasticsearch(ES) platform for searching within the repository using Alfresco Search Enterprise 3.X. While Elasticsearch is a robust technology, it has number of limitations whereas in some situations new index creation and re-indexing is required.

  • ES does not allow dynamic sharding therefore adding new shards to existing index when required is not possible.
  • ES does not allow modifying the existing field mappings, therefore modifications to existing contentModel fields will not be picked up from ES mapping. For Example, tokenisation FALSE to TRUE or field type changes.

However, considering the re-indexing speed this can be achieved within hours without downtime.

This documentation describes the steps which need to follow to build the indexes offline by connecting to same metadas store(Database).

Steps.

Elasticsearch

  1. Create the Elasticsearch index with new name (Example ‘alfresco-new’) with desired number of shards, replicas, total fields etc.

curl -XPUT 'http://ESHOST:PORT/alfresco-new?pretty' -H 'Content-Type: application/json' -d'
{
    "settings" :{
        "number_of_shards":5,
         "number_of_replicas":0,
          "index.mapping.total_fields.limit":2000
   }
}'

Alfresco Content Service (ACS)

  1. Start up new ACS instance in read only mode connecting to same database pointing to the new index created in previous step.
        elasticsearch.indexName=alfresco-new
        server.allowWrite=false

Note: If contentModel changes being made, then make sure to start the repo with new changes deployed.

2. Perform a search to start up the search subsystem, which will then automatically create the relevant mappings in newly created index.

Verification steps

a. Use curl command to Elasticsearch index and validate few of the property mappings and "dynamic" : "false",

curl -XGET 'http://ESHOST:ESPORT/alfresco-new?pretty'

b. Below loggers will appear in catalina.out

2023-06-23 12:21:47,403 INFO [elasticsearch.contentmodelsync.ContentModelSynchronizer] [elasticsearch-initializer] Successfully loaded analysers.

2023-06-23 12:21:47,543 INFO [elasticsearch.contentmodelsync.ContentModelSynchronizer] [elasticsearch-initializer] Successfully loaded basic mappings.

2023-06-23 12:21:47,553 INFO [elasticsearch.contentmodelsync.ElasticsearchInitialiser] [elasticsearch-initializer] Successfully connected to Elasticsearch index.

3. Generate the reindex.prefixes-file.json

4. Execute the re-indexing commands by passing the additional parameter ‘elasticsearch.indexName’ set to the new index. By default, this is set to alfresco.

java -Xmx4G -jar alfresco-elasticsearch-reindexing-3.3.0.1-app.jar \
     --server.port=9090 \
     --alfresco.reindex.jobName=reindexByIds \
     --spring.elasticsearch.rest.uris=http://localhost:9200/alfresco-new \
     --spring.elasticsearch.rest.username=username \
     --spring.elasticsearch.rest.password=password \
     --alfresco.accepted-content-media-types-cache.enabled=false \
     --spring.activemq.broker-url=nio://localhost:61616 \
     --alfresco.reindex.fromId=0 \
     --alfresco.reindex.toId=5000000 \
     --alfresco.reindex.multithreadedStepEnabled=true \
     --alfresco.reindex.concurrentProcessors=30 \
     --alfresco.reindex.metadataIndexingEnabled=true \
     --alfresco.reindex.contentIndexingEnabled=false \
     --alfresco.reindex.pathIndexingEnabled=false \
     --alfresco.reindex.prefixes-file=file:reindex.prefixes-file.json \
     --alfresco.reindex.pagesize=10000 \
     --alfresco.reindex.batchSize=1000\
     --elasticsearch.indexName=alfresco-new
  > reindexing.log &

 5. Validate the document count.

curl -XGET 'http://ESHOST:ESPORT/alfresco-new/_count?pretty'

6. Point ACS cluster nodes to new index: alfresco-new

7. Optional: Destroy the newly created ACS instance and old ElasticSearch index.

8. Happy searching.