Large Folders in the Content Repository

resplin · ‎18 Jul 2017

When putting a lot of content into the Alfresco Content Repository, it is necessary to decide how to manage large folders. In this discussion, I collect information on this topic in a public location where people can easily find it.

Background

An Alfresco folder (or container more generally) can contain a large number of items that are typically files and/or sub-folders. A user-visible folder containing thousands of items can be accessed via any of the official Alfresco user interfaces, other applications, APIs, or protocols. The repository does not place any limits on how many items are able to be stored in a single container.

As a rule of thumb, a folder that contains more than 5,000 to 10,000 items should be considered for re-structuring; possibly by splitting it into a set of sub-folders. What makes a folder "too large" depends on how those items are accessed, especially when trying to browse (ie. "list") the items within the folder. Operations that list the contents of the container, such as "getChildren" on a folder, can tax the system. The size of the child items does not matter, only their number.

Listing large folders is inherently a resource intensive process. Over time, we have improved the system's behavior with large folders and we continue to look for more improvements. However, we do not expect major changes to this aspect of the system's behavior in the near future.

Use Cases

A large folder is either intended to be browseable by the user, or is non-browseable.

There is no immediate performance problem with having a non-browseable container with many children, but there is a risk that someone will accidentally trigger a call to "getChildren" that causes system performance to degrade until the query completes. The easiest way to guard against this risk to system performance is to hash the folder contents across sub-folders. This should not impact the user experience, as the folder is not intended to be browsed so users will never see the sub-folders. If a hashing mechanism is not implemented, then steps should be taken to prevent "getChildren" from being executed.

If folder is intended to be browsed, then the information architect should think carefully about the use case. It is unlikely that a folder can be usefully browsed with tens of thousands of items, and there is likely a system of categorization that will better serve the users than having everything in the same container.

Guidelines

When folder content approaches 5,000 to 10,000 items, we recommend using a hashing mechanism to spread content through sub-folders so as to not tax the system. Adding additional folders to hash content has minimal overhead. It would not impact any of the search queries and would only cause .1% to .2% overhead in additional nodes being generated. The management of a hashed file plan can be done automatically by using content policies, rules and actions, or a scheduled job.

In addition:

Containers that have more than tens of thousands of items can be stored within the repository. Theses items can be accessed directly by their NodeRef, ObjectID, or by a qname-path. The content can be located by Search, categories, or tags.
When using the APIs (eg. CMIS or REST API) to get/list children, the results should be paged using skipCount & maxItems query parameters. For example, the client may choose to list X items at a time. If maxItems is not specified then a default paging size may be used. This is 100 items for both OpenCMIS (Apache Chemistry) and the V1 REST API "list children".
In the case of file protocols such as WebDAV and FTP, only the first 5000 items will be returned unless the system admin has increased the "system.filefolderservice.defaultListMaxResults" property.
In the rare case that getChildren must be used on a large folder, running the command as the System user will avoid expensive permission checks.
You should be cautious about browsing "User Homes" as an admin. This could be a slow operation on systems with a large number of users. It may make sense to configure a "home folder provider" to split the user home directories across a set of sub-folders.

Relevant Discussions

Optimal folder structure‌: Useful point about this being a question of information architecture.
Folder structure in Alfresco‌: See the response by mrogers _‌ and Mittal Patoliya‌.
Issues with many (10_000s) of docs / folder?‌: Suggestion to use JavaScript to segregate by YYYY/mm.
Alfresco folders structure problem with millions of documents‌: Detailed discussion of the performance impacts of large folders.
ACE-198: Issue showing that our benchmark target for listings of folders with 5K documents is 5 seconds.

Notes

Thank you to Rich McKnight for his help in collecting this information.

cristinamr · ‎21 Jul 2017

Really nice, Richard! Thanks to share it ;-)

--
VenziaIT: helping companies since 2005! Our ECM products: AQuA & Seidoc

afaust · ‎3 Aug 2017

Looking forward to the day Alfresco will heed this advice and not put archived items / version history nodes in one big, flat list...

jpotts · ‎4 Aug 2017

This is a very helpful page, Richard, thanks for putting it together.

In case folks are curious about the impact that a "large folder" has on Alfresco Share, here are a couple of observations:

1. Share will truncate document lists to 1000. The biggest impact this has is the inability to choose a folder in a "copy to" or "move to" dialog due to the truncated list.

2. Browsing is also a bit problematic. There is essentially no way to browse to a folder that is significantly "deep" in the list. However, search is a decent workaround for this. For example, if you need to go to the 9999th folder you can search for "test folder 9999" and it'll show up in the search results.

These observations are based on a 5.2.e repo with Share 5.2.d and a Share site with a document library that contains 10,000 folders at the same level (immediate children of the documentLibrary).

Jeff Potts
https://www.metaversant.com | https://ecmarchitect.com