pmonks2

Bulk Import from a Filesystem

Blog Post created by pmonks2 on Oct 22, 2009

The Use Case



In any CMS implementation an almost ubiquitous requirement is to load existing content into the new system.  That content may reside in a legacy CMS, on a shared network drive, on individual user's hard drives or in email, but the requirement is almost always there - to inventory the content that's out there and bring some or all of it into the CMS with a minimum of effort.



Alfresco provides several mechanisms that can be used to import content, including:


Alfresco is also fortunate to have SI partners such as Technology Services Group who provide specialised content migration services and tools (their open source OpenMigrate tool has proven to be popular amongst Alfresco implementers).



That said, most of these approaches suffer from one or more of the following limitations:



  • They require the content to be massaged into some other format prior to ingestion


  • Orchestration of the ingestion process is performed external (ie. out-of-process) to Alfresco, resulting in excessive chattiness between the orchestrator and Alfresco.


  • They require development or configuration work


  • They're more general in nature, and so aren't as performant as a specialised solution



An Opinionated (but High Performance!) Alternative



For that reason I recently set about implementing a bulk filesystem import tool, that focuses on satisfying a single, highly specific use case in the most performant manner possible: to take a set of folders and files on local disk and load them into the repository as quickly and efficiently as possible.



The key assumption that allows this process to be efficient is that the source folders and files must be on disk that is locally accessible to the Alfresco server - typically this will mean a filesystem that is located on a hard drive physically housed in the server Alfresco is running on.  This allows the code to directly stream from disk into the repository, which basically devolves into disk-to-disk streaming - far more efficient than any kind of mechanism that requires network I/O.



How those folders and files got onto the local disk is left as an exercise for the reader, but most OSes provide efficient mechanisms for transferring files across a network (rsync and robocopy, for example).  Alternatively it's also possible to mount a remote filesystem using an OS-native mechanism (CIFS, NFS, GFS and the like), although doing so reintroduces network I/O overhead.



Another key differentiator of this solution is that all of the logic for ingestion executes in-process within Alfresco.  This completely eliminates expensive network RPCs while ingestion is occurring, and also provides fine grained control of various expensive operations (such as transaction commits / rollbacks).



Which leads into another advantage of this solution: like most transactional systems, there are some general strategies that should be followed when writing large amount of data into the Alfresco repository:



  1. Break up large volumes of writes into multiple batches - long running transactions are problematic for most transactional systems (including Alfresco).


  2. Avoid updating the same objects from different concurrent transactions.  In the case of Alfresco, this is particularly noticeable when writing content into the same folder, as those writes cause updates to the parent folder's modification timestamp.[EDIT] In recent versions of Alfresco, the automatic update of a folder's modification timestamp (cm:modified property) has been disabled by default.  It can be turned back on (by setting the property 'system.enableTimestampPropagation' to true), but the default is false so this is likely to be less of an impact to bulk ingestion than I'd originally thought.



The bulk filesystem import tool implements both of these strategies (something that is not easily accomplished when ingestion is coordinated by a separate process).  It batches the source content by folder, using a separate transaction per folder, and it also breaks up any folder containing more than a specific number of files (1,000 by default) into multiple transactions.  It also creates all of the children of a given folder (both files and sub-folders) as part of the same transaction, so that indirect updates to the parent folder occur from that single transaction.

But What Does this Mean in Real Life?



The benefit of this approach was demonstrated recently when an Alfresco implementation had a bulk ingestion process that regularly loaded large numbers (1,000s) of large image files (several MBs per file) into the repository via CIFS.  In one test, it took approximately an hour to load 1,500 files into the repository via CIFS.  In contrast the bulk filesystem import tool took less than 5 minutes to ingest the same content set.



Now clearly this ignores the time it took to copy the 1,500 files onto the Alfresco server's hard drive prior to running the bulk filesystem import tool, but in this case it was possible to modify the sourcing process so that it dropped the content directly onto the Alfresco server's hard drive, providing a substantial (order of magnitude) overall saving.

What Doesn't it Do (Yet)?



Despite already being in use in production, this tool is not what I would consider complete.  The issue tracker in the Google Code project has details on the functionality that's currently missing; the most notable gap being the lack of support for population of metadata (folders are created as cm:folder and files are created as cm:content). [EDIT] v0.5 adds a first cut at metadata import functionality.  The 'user experience' (I hesitate to call it that) is also very rough and could easily be substantially improved. [EDIT] v0.4 added several UI Web Scripts that significantly improve the usability of the tool (at least for the target audience: Alfresco developers and administrators).



That said, the core logic is sound, and has been in production use for some time.  You may find that it's worth investigating even in its currently rough state.



[POST EDIT] This tool seems to have attracted quite a bit of interest amongst the Alfresco implementer community.  I'm chuffed that that's the case and would request that any questions or comments you have be raised on the mailing list.  If you believe you've found a bug, or wish to request an enhancement to the tool, the issue tracker is the best place.  Thanks!

Outcomes