AnsweredAssumed Answered

Improving the document retrieval performance?

Question asked by sepgs2004 on Oct 15, 2018
Latest reply on Oct 25, 2018 by afaust

We are using Alfresco Community 5.2. We did not do anything in addition to the configuration, indexing or whatever. It is the default repository as-is. We have a J2EE application that uses Apache Chemistry CMIS to access the documents from the repository.

 

We have a Site called abcd in there. In the document library of the site, we have several folders for each of our category of documents. There are about 10 categories. Under each category folder, there will be folders for each entity (that belongs to that category).

For example, there is a Person category, and underneath, there could be 1000s of folders, each representing a person entity record. Now inside each person entity folder, we store documents that belong to that specific person entity.

For a person entity, typically there will be less than 10 document objects under the person entity folder. The folder name is the person id. For example, the documents for person whose id is 234 would be in this path: /Sites/abcd/documentLibrary/Person/234/

 

From Share UI, the path to access the Person category folder: /Sites/abcd/documentLibrary/Person

Similarly we have a category called Bond

From Share UI, this is the path to access the category folder: /Sites/abcd/documentLibrary/Bond

 

Performance Issue

 

Now when we retrieve a person entity documents, the result comes in .02 or .03 seconds.

However, when we retrieve the a bond entity documents, the result takes 20 seconds to appear.

 

Above behavior (that is the time cost) is same when I access the documents through our J2EE application, and when I access it through a CMIS query in CMIS workbench.

 

In CMIS workbench, I tried running the provided groovy script to count the #of documents for each category.

For Person, there are about 30,000 documents.

For Bond category, the groovy script fails after counting about 100,000 objects. I am sure there are many more.

 

From this, one thing is sure. When we access the documents of a bond entity, we are dealing with millions of documents.

 

The CMIS query (that we use in our application and when I tested outside) is like this.

SELECT doc.*, slo.*

FROM cmis:document AS doc

    JOIN slo:documentProperties AS SLOALIAS

         ON doc.cmis:objectId = SLOALIAS.cmis:objectId 

WHERE 

    IN_FOLDER(doc, 'workspace://SpacesStore/ed9866d4-c99e-4f3c-a12b-aa3f757d4198')

    AND (slo.SLOALIAS:lo_link = 356310)

    AND (slo.SLOALIAS:lo_category = 'Bond')

 

What is in the IN_FOLDER clause?

workspace://SpacesStore/ed9866d4-c99e-4f3c-a12b-aa3f757d4198

This is the folder id of the folder that correspond to one bond entity.

This folder will be underneath /Sites/abcd/documentLibrary/Bond/

 

Note: even if I remove the IN_FOLDER clause totally, it has the same 20 seconds time cost.

 

How can I go about increasing the performance, or debugging what is the cause of this 20 seconds time for accessing a bond entity documents?

 

Outcomes