lcabaceira

Super Sizing your Alfresco Repository

Blog Post created by lcabaceira Employee on Sep 30, 2014

Hi everyone, i'm back to share with you a very interesting tool that can help you on your tests and benchmarks.

'have you ever wanted to test/benchmark your alfresco project implementation with millions  of documents before you deliver it for its go-live stage ?'


I'm sure you did, it's normally not that easy to create a big number of dummy documents and correspondent meta-data fields that can emulate what will be present in production. To solve this paradigm we've created a tool (opensource as allways ) that can help you to do just that. Many thanks to Alex Strachan from Alfresco support that wrote the user interface for this tool.

The tool is named 'SuperSizeMyRepo' and its available at https://github.com/lcabaceira/supersizemyrepo. It's a multi-thread tool that enables you to create a huge amount (Millions) of (bulk-import-ready) content and metadata for your alfresco repository.

https://github.com/lcabaceira/supersizemyrepo#types-of-documents-created-Types of documents created

 

    • MS Word Documents (.doc) with an average size of 1024k

 

    • MS Excel Documents(.xls) with  average size of 800k

 

    • Pdf documents(.pdf) with average size of 10MB

 

    • MS PowerPoint Presentation Documents(.ppt) with average size of 5MB

 

    • Jpeg images(.jpg) with average size of 2MB


All documents are created with their correspondent meta-data xml properties file.

ui

Configuring the Documents meta-data


As you can see from the UI screenshot above, you can configure manually the values of the meta-data fields that will created for the documents, but even more interesting is the ability to inject aspects directly into the document creation.

Injecting Aspects


You can also edit the field names, meaning that if you specify custom aspects, you can configure the remaining fields to have the properties names of the attributes present on your custom aspects.

How about Indexing ?


We've also thought about testing the search sub-system (Solr or Lucene) with big amounts of data. For this reason the documents are created with lots of random words that will get indexed into Solr or Lucene. This way you can test both the repository and the search layer.

What are the images for ?


To have the documents created with the sizes announced we also needed to include some random images. We provide you with a set of images that you can download and use as your local library for the documents creation. This set of random images is available here. You can also use your own set of images as long as they are all JPGS and they are present on the images folder root.

What is the deployment folder ?


The deployment folder is where your documents will be created, normally this is a place inside your contentStore, this way you can perform a in-place bulk import, one of the fastest ways to inject lots of content on your Alfresco repository. You can specify any folder for the documents creation.

Maximum number of files per folder


When you import the documents the folder structure (if any) will also be imported. According to Alfresco best practices, having a huge number of documents on the same folder can lead to performance degradation, mainly because of acl permission checking that happens when a user browsers a folder. Alfresco needs to determine what documents can be shown to the user and for that he needs to verify the permissions of each item on that directory. To reduce this overload, we've introduced the option to specify a maximum number of documents that the tool can create on a single folder, when this number is reached the tool will create new folders and the new documents will be created on those folders.

JumpStart with the compiled version


If you wish to run the compiled version (available in the uiJars folder) there are no pre-requirements apart from having java installed on your server to be able to execute a jar file.

Download the jar file for your OS, currently the UI is released for 3 different OS

 

 


Note for MacOs users :  To execute the jar you should open a terminal and run it with the -XstartOnFirstThread option like the example below :

java -XstartOnFirstThread -jar ./ssmr-ui-1.0.3-osx-jar-with-dependencies.jar


Want to take the Deep dive approach ?


Great, i would like to take this opportunity to invite you to participate on this project and to contribute with new features and your own ideas. This section provides guidance for you to download the source code and build it yourself.

1 - Software requirements

    • JDK 1.7

 

    • Apache Maven 3.0.4+


2 - Configuration requirements

During the installation of maven, a new file name settings.xml was created. This file is our entry point to the your local maven settings configuration, including the remote maven repositories. Edit your settings.xml file and update the server’s section including the alfresco server id and your credentials.

Note that the root pom.xml references 2 different repositories : alfresco-public, alfresco-public-snapshots . The id of each repository must match with a server id on your settings.xml (where you specify your credentials for that server).

Section from configuration settings.xml


        <server>

            <id>alfresco-public</id>

            <username>YOUR_USERNAME</username>

            <password>YOUR_PASSWORD</password>

        </server>

        <server>

            <id>alfresco-public-snapshots</id>

            <username>YOUR_USERNAME</username>

            <password>YOUR_PASSWORD</password>

        </server>

Section from pom.xml

 <repository>

            <id>alfresco-public</id>

            <url>https://artifacts.alfresco.com/nexus/content/groups/public</url>

</repository>

  <repository>

            <id>alfresco-public-snapshots</id>

            <url>https://artifacts.alfresco.com/nexus/content/groups/public-snapshots</url>

   </repository>


3 - Location/Path Where to create the files

Edit the src/main/java/super-size-my-repo.properties and configure your deployment location and the images location.

files_deployment_location : Should be a in a place inside your contentStore. This will be the root for the in-place bulkImport.

images_location : The tool randomly chooses from a folder of local images to include on the various document types. You need to set the images_location to a folder where you have jpg images. You can use the sample images by pointing the images_location to your /images. The bigger your images are, the bigger your target documents will be. For the sizes of the documents considered we expect jpg images with aprox 1.5MB

Tool Configuration files and options

 

You find the tool configuration file under src/main/java/super-size-my-repo.properties This configuration file contains the following self-explanatory properties.

files_deployment_location=<PATH_WHERE_THE_FILES_WILL_BE_CREATED>

images_location=<DEFAULT_LOCATION_FOR_BASE_IMAGES>

num_Threads=<NUMBER_OF_THREADS_TO_EXECUTE>

threadPoolSize=<SIZE_OF_THE_THREAD_POOL>

max_files_per_folder=<NUMBER_OF_MAX_FILES_IN_A_SINGLE_FOLDER>



The only 2 properties that are mandatory to adjust are files_deployment_location and images_location All of the other properties have default running values.

https://github.com/lcabaceira/supersizemyrepo#how-to-run-with-maven-How to run with maven ?


Issue the following maven command to generate the targets (executable jar) from the project root.

P.S. - Don't forget to configure your properties file.

# mvn clean install

This will build and generate the executable jar on the target directory.

To run this jar, just type :

java -jar super-size-my-repo-<YOUR_VERSION>-SNAPSHOT-jar-with-dependencies.jar

Next Steps ?


After running the tool, you will have lots of documents to import using the Alfresco bulk importer. To perform a in-place import, you need to define the files_deployment_location to a location inside your contentstore.

Now you can execute the in-place-bulk import action to add all the documents and correspond ant meta-data to a target Alfresco repository.

The Streaming bulk import url on your alfresco is : http://localhost:8080/alfresco/service/bulkfsimport

The in-place bulk import url on your alfresco is : http://localhost:8080/alfresco/service/bulkfsimport/inplace

Note that you may need to adjust localhost and the 8080 port with your server details if you not running alfresco locally or you're not running alfresco on the default 8080 port.

Check http://wiki.alfresco.com/wiki/Bulk_Importer for more details.

And that is it folks, if you like to contribute to the evolution of this tool, send me an email  and i will add you as a contributor with commit rights to the github repository.

I hope you enjoyed this article as much as i enjoyed writing it. I wish you can make use of this nice tool. Stay tuned for more Alfresco related articles and don't forget to support open-source projects.

OpenSource - Together we are stronger, One Love

Luis





Attachments

Outcomes