lcabaceira

Sizing and tuning your Alfresco Database

Blog Post created by lcabaceira Employee on Sep 16, 2014
In a human body the heart and the brain are the 2 most important organs, if those are not performing well, nothing else is. The Alfresco database and filesystem where the content store resides are the brain and heart of Alfresco. Those are the 2 most important layers of every Alfresco architecture.

 

Get to know the Alfresco Database throughput


If your project will have lots of concurrent users and operations or the number/estimate number of documents is very big (> 1M)) you need to be informed about your database throughput.


database_performance_ui


Most common throughput factor of the database is transactions per second.

DB performance in a transactional system are usually the underlying database files and the log file. Both are factors because they require disk I/O, which is slow relative to other system resources such as CPU.

In the worst-case scenario,  in big alfresco databases with a big number of documents ):

Database access is truly random and the database is too large for any significant percentage of it to fit into the cache, resulting in a single I/O per requested key/data pair.

Both the database and the log are on a single disk. This means that for each transaction, the AlfrescoDB is potentially performing several filesystem operations:

    • Disk seek to database file

 

    • Database file read

 

    • Disk seek to log file

 

    • Log file write

 

    • Flush log file information to disk

 

    • Disk seek to update log file metadata (for example, inode information)

 

    • Log metadata write

 

    • Flush log file metadata to disk


Faster Disks normally can help on such sittuations but there are lots of ways (scale up, scale out) to increase transactional throughput.

In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:

    • I/O bottlenecks in the RDBMS throughput

 

    • Excessive queue for transactions due to overload of connections

 

    • On active-active cluster configurations, excessive latency


Alfresco Database treads pool.

Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.

If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).

http://docs.oracle.com/cd/E17076_04/html/programmer_reference/transapp_throughput.html


How to calculate the size of your Alfresco database


All operations in Alfresco require a database connection, the database performance plays a crucial role on your Alfresco environment. It’s vital to have the database properly sized and tuned for your specific use case.

To size your alfresco database in terms of space we’ve done a series of tests by creating content (and metadata) on an empty repository and analyzing the database growth.

Be aware that:

    • Content is not stored in the database but is directly stored on the disk

 

    • Database size is un-affected by size of the documents or the document's content

 

    • Database size is affected by the number/type of metadata fields of the document


Hi all, back with another Alfresco related post, this time to show you how to size your alfresco database in terms of space.

The following factors are relevant to calculate the approximate size for an Alfresco database

    • Number of meta data fields
    • Permissions
    • Number of folders
    • Number of documents
    • Number of versions
    • Number of users


I've made a series of tests where i could verify how the Alfresco database grows.

I've made a bulk import with the following data.

Document creation methodIn-place bulk upload
Number of Documents Ingested148
Total Size of Documents929.14 MB
Number of metadata fields per document13 fields
Total number of metadata fields1924


The table below shows the types of documents and its average sizes

Document TypeExtensionAverage Size (KB)
MS Word Document.doc1024
Excell Sheet.xls800
Pdf document.pdf10240
PowerPoint presentation.ppt5120
Jpg image.jpg2048


Checking the diagram below we can see that the database indexes grow more than the data itself. By observing the growth of the database size we’ve concluded that the average metadata field occupation on the Alfresco database is approximately 5.5 K per metadata field


Screen Shot 2014-09-16 at 21.54.17

Also interesting is to verify the tables that grow in size(KB) after the content ingestion. Note that we are not applying any permission.

Screen Shot 2014-09-16 at 21.55.08

To size your database appropriately you must ask the right questions whose answers will help you to determine the database sizing.

  1. Estimated number of users in Alfresco
  2. Estimated number of groups in Alfresco
  3. Estimated number of documents on the first year
  4. Documents growth rate
  5. Average number of versions per document
  6. Average number of meta-data fields per document
  7. Estimated number of folders
  8. Average number of meta-data fields per folder
  9. Estimated number of concurrent users
  10. Folder based permissions (inherited to child documents)?

 

Database sizing formulas


Consider to following figures to determine your approximate database size.

-       DV = Average number of document versions

-       F = Estimated number of folders

-       FA = Estimated number of folder metadata fields (standard + custom)

-       D = Number of Documents * DV - Estimated number of documents including the versions

-       DA =Estimated number of documents metadata fields (standard + custom)

The number of records on specific alfresco tables is calculated as follows:


-       Number of records on alf_node (TN = F + D * DV)

-       Number of records records on node_properties (TNP = F * FA + D * DA)

-       Number of records records on node_status = (TNS = F + D)

-       Number of records records on alf_acl_member= (TP = D) assuming permission will be set at the folder level and inherited

The approx. number of records in the database will be TRDB = TN + TNP + TNS + TP

The following formula is based on the number of database records. On our benchmarks we’ve observed that each database record takes about 4.5k of db space.

Formula #1 Database size = TRDB * 4.5K

 

Alternatively, we can base our calculations on the number of metadata fields of the documents,  considering 5.5k for each metadata field and use the following formula.

Formula #2 Database size = (D * DA + F * FA) * 5.5K

 

The 2 formulas provided are only approximations on the size that your database will need and are based on benchmarks executed against a vanilla Alfresco version 4.2.2.

 

If we wish to consider users and groups add consider 2k for each user and 5k for each group.

 

Note that the formulas are not taking in consideration additional space for logging, rollback, redolog, etc.

Tuning your Alfresco database

 

In Alfresco the default RDBMS configurations are normally not suitable for large repositories deployments and may result into:

 

•       Wrong or improper support for ACID transaction properties

 

•       I/O bottlenecks in the RDBMS throughput

 

•       Excessive queue for transactions due to overload of connections

 

•       On active-active cluster configurations, excessive latency

 

Considering that your database layer will be used under concurrent load I’ve come up with a set of hints that will contribute to maximize your Alfresco database performance.

 

Database Thread pool configuration

 

A default Alfresco instance is configured to use up to a maximum of forty (40) database connections.  Because all operations in Alfresco require a database connection, this places a hard upper limit on the amount of concurrent requests a single Alfresco instance can service (i.e. 40), from all protocols.

 

Most Java application servers have higher default settings for concurrent access, and this, coupled with other threads in Alfresco (non-HTTP protocol threads, background jobs, etc.) can quickly result in excessive contention for database connections within Alfresco, manifesting as poor performance for users.

It’s recommended to increase the maximum size of the database connection pool to at least [number of application server worker threads] + 75.  If tomcat is being considerer this value is normally 275. The setting is called db.pool.max and should be added to your alfresco-global.properties (db.pool.max=275).

 

After increasing the size of the Alfresco database connection pool, you must also increase the number of concurrent connections your database can handle, to at least the size of the Alfresco connection pool. Alfresco recommends configuring at least 10 more connections to the database than is configured into the Alfresco connection pool, to ensure that you can still connect to the database even if Alfresco saturates its connection pool.

 

Database Validation query


By default Alfresco does not periodically validate each database connection retrieved from the database connection pool.  Validating connections is, however, very important for long running Alfresco servers, since there are various ways database connections can unexpectedly be closed (for example by transient network glitches and database server timeouts). Enabling periodic validation of database connections involves adding the db.pool.validate.query property to alfresco-global.properties and the query is specific for your database type.

DatabaseValue for db.pool.validate.query
MySQL[1]SELECT 1
PostgreSQLSELECT VERSION()
OracleSELECT 1 FROM DUAL

 

Database Scaling


Alfresco relies largely on a fast and highly transactional interaction with the RDBMS, so the health of the underlying system is vital. Considering our existing customers, the biggest running repositories are under Oracle (most of them RAC).

If your project will have lots of concurrent users and operations, consider an active-active database cluster with at least 2 machines. This can be achieved using Oracle RAC or a Mysql based solution with haproxy[1] (opensource solution) or a commercial solution like MariaDB[2] or Percona[3].

[1]http://haproxy.1wt.eu

[2]https://mariadb.org

[3]http://www.percona.com

You can use either solution, depending also you the knowledge that you have in-house. The golden rule is that the response time from the DB in general, should be around 4ms or lower. At this layer i don’t recommend to use any virtualization technology. The database servers should be physical servers.

 

The high availability and scalability for database is vendor dependent and should be addressed with the chosen vendor to achieve the maximum performance possible.

Database Monitoring


Monitoring your database performance is very important as it can detect some possible performance problems or scaling needs.

I’ve identified the following targets that should be monitored and analysed on a regular base.

    • Transactions

 

    • Number of Connections

 

    • Slow Queries

 

    • Query Plans

 

    • Critical DM database queries ( # documents of each mime type, … )

 

    • Database server health (cpu, memory, IO, Network)

 

    • Database sizing statistics (growth, etc)

 

    • Peak Period of resource usage

 

    • Indexes Size and Health

 

I hope this post can help you to understand the importance of the Alfresco database and that you can make use of it on your sizing exercise. Stay tuned for more Alfresco related posts.

 

All the best, One love,

 

Luis

Outcomes