resplin

Large Folders in the Content Repository

Discussion created by resplin on Jul 18, 2017
Latest reply on Aug 4, 2017 by jpotts

When putting a lot of content into the Alfresco Content Repository, it is necessary to decide how to manage large folders. In this discussion, I collect information on this topic in a public location where people can easily find it.

 

Background

An Alfresco folder (or container more generally) can contain a large number of items that are typically files and/or sub-folders. A user-visible folder containing thousands of items can be accessed via any of the official Alfresco user interfaces, other applications, APIs, or protocols. The repository does not place any limits on how many items are able to be stored in a single container.

 

As a rule of thumb, a folder that contains more than 5,000 to 10,000 items should be considered for re-structuring; possibly by splitting it into a set of sub-folders. What makes a folder "too large" depends on how those items are accessed, especially when trying to browse (ie. "list") the items within the folder. Operations that list the contents of the container, such as "getChildren" on a folder, can tax the system. The size of the child items does not matter, only their number.

 

Listing large folders is inherently a resource intensive process. Over time, we have improved the system's behavior with large folders and we continue to look for more improvements. However, we do not expect major changes to this aspect of the system's behavior in the near future.

 

Use Cases

A large folder is either intended to be browseable by the user, or is non-browseable.

 

There is no immediate performance problem with having a non-browseable container with many children, but there is a risk that someone will accidentally trigger a call to "getChildren" that causes system performance to degrade until the query completes. The easiest way to guard against this risk to system performance is to hash the folder contents across sub-folders. This should not impact the user experience, as the folder is not intended to be browsed so users will never see the sub-folders. If a hashing mechanism is not implemented, then steps should be taken to prevent "getChildren" from being executed.

 

If folder is intended to be browsed, then the information architect should think carefully about the use case. It is unlikely that a folder can be usefully browsed with tens of thousands of items, and there is likely a system of categorization that will better serve the users than having everything in the same container.

 

Guidelines

When folder content approaches 5,000 to 10,000 items, we recommend using a hashing mechanism to spread content through sub-folders so as to not tax the system. Adding additional folders to hash content has minimal overhead. It would not impact any of the search queries and would only cause .1% to .2% overhead in additional nodes being generated. The management of a hashed file plan can be done automatically by using content policies, rules and actions, or a scheduled job.

 

In addition:

  • Containers that have more than tens of thousands of items can be stored within the repository. Theses items can be accessed directly by their NodeRef, ObjectID, or by a qname-path. The content can be located by Search, categories, or tags.

  • When using the APIs (eg. CMIS or REST API) to get/list children, the results should be paged using skipCount & maxItems query parameters. For example, the client may choose to list X items at a time. If maxItems is not specified then a default paging size may be used. This is 100 items for both OpenCMIS (Apache Chemistry) and the V1 REST API "list children".

  • In the case of file protocols such as WebDAV and FTP, only the first 5000 items will be returned unless the system admin has increased the "system.filefolderservice.defaultListMaxResults" property.

  • In the rare case that getChildren must be used on a large folder, running the command as the System user will avoid expensive permission checks.

  • You should be cautious about browsing "User Homes" as an admin. This could be a slow operation on systems with a large number of users. It may make sense to configure a "home folder provider" to split the user home directories across a set of sub-folders.

 

Relevant Discussions

 

Notes

  • Thank you to Rich McKnight for his help in collecting this information.

Outcomes