Skip navigation
All Places > Alfresco Premier Services > Authors alxgomz

Alfresco Premier Services

6 Posts authored by: alxgomz Employee

Being able to edit a file concurrently by multiple users is a need we’re coming across more and more when dicussing with customers on the field.
At the time of writing, the out-of-the-box solution to deliver this kind of feature within Alfresco is to use the GoogleDocs module.
This module allows for content stored in Alfresco to be collaboratively edited using Google’s online application suite (text editor, spreasheet editor, presentations) and saved back in the repository.
However some customer may not want to use Google services for different reasons (e.g, cost or data sensitivity), in which case there are far less options.
If you’re concerned about your data being sent to a public cloud and prefer having them securely stored on-premise in Alfresco instead, there may be a solution to help you.

LibreOffice OnLine (LOOL)

Alfresco uses LibreOffice (and formerly OpenOffice) for a very long time. This is one of the component providing our out-of-the-box transformation service (either through OODirect or jodconverter).
After delivering an opensource productivity Suite on the desktop, the Libreoffice team has started working on a similar feature set with a more SaaS approach: LibreOffice OnLine (LOOL).
Don’t get me wrong here: LOOL is not a SaaS solution you have to subscribe to, and that’s the intersting thing with it. LOOL is a service you can install on-premise in order to provide edition tools for office documents. And of course, as that’s what’s our main interest here, it provide collaborative edition capabilities. LOOL itself already provides a solution to having collaborative edition, while keeping your data in compliance with your SSI company policy!
Here I’ll detail how to integrate LOOL with your prefered on-premise content management system, thus bringing the collaborative edition feature inside Alfresco!

Alfresco integration

LibreOffice OnLine is actually a WOPI client and needs to talk to a WOPI server.

If you want to know more about the WOPi protocol you can check its definition here

The WOPI server role will be endorsed by Alfresco itself using 2 modules (one for the repo and one for Share). Those AMPs have been created by Magenta. All credits goes to them, here, I’m just giving guidance on how to install and configure it for Alfresco Content Services:

In terms of network flows, the following diagram shows what conenction are used

network flows Alfresco Libreofice Online

In this document we’ll use Alfresco Content Service 5.2.4.

Installation

LOOL installation

Fortunately it is now very simple to install LOOL (using the CODE distribution). The simple commands bellow should work for a Debian based Linux distribution.
Alongside this document we’ll use Debian 9.

$ echo 'deb https://www.collaboraoffice.com/repos/CollaboraOnline/CODE-debian9 ./' | sudo tee /etc/apt/sources.list.d/code.list
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 0C54D189F4BA284D
$ sudo apt update
$ sudo apt install loolwsd code-brand

If you’re just testing you’ll probably be interested in using the docker image available at docker hub

LOOL configuration

By default the office online suite is configured to use SSL but the certificates are not provided. We then have to create those (or disable SSL if not targeting production).

$ sudo mkdir /etc/loolwsd/ssl

Copy to this newly created folder:

  • the certificate private key: /etc/loolwsd/ssl/loolwsd.key (make sure it’s only readable to the user runnning LOOL: lool)
  • the certificate itself: /etc/loolwsd/ssl/loolwsd.crt
  • the public CA certificate: /etc/loolwsd/ssl/cacert.pem

If you want ot use selfsigned certificates, this is a bit more tricky but do-able. Start with the commands bellow to generate the self-signed certificate:

$ sudo openssl genrsa -out /etc/loolwsd/ssl/loolwsd.key
$ sudo chown lool /etc/loolwsd/ssl/loolwsd.key
$ sudo chmod 400 /etc/loolwsd/ssl/loolwsd.key
$ cp /etc/ssl/openssl.cnf /tmp/loolwsd_ssl.cnf
$ cat >> /tmp/loolwsd_ssl.cnf <<EOT
> [ san ]
> subjectAltName = @alt_names
>
> [alt_names]
> IP.1 = 192.168.0.185
> EOT
$ sudo openssl req -new -x509 -sha256 -nodes -key /etc/loolwsd/ssl/loolwsd.key -days 9999 -out /etc/loolwsd/ssl/loolwsd.crt -config /tmp/loolwsd_ssl.cnf -extensions san

Additionnally, if you’re using self signed certificate, it is required the Alfresco JVM trusts this certificate .

$ keytool -importcert -alias lool -file /etc/loolwsd/ssl/loolwsd.crt -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -storetype JKS

Of course, keystore path, type and passwords must match your environment

If LOOL traffic is wrapped in SSL, you’ll also need to have Alfresco protected by SSL. This is because most browser today will prevent pages with mixed content (http & https) from being displayed.
It means you have to configure Alfresco for SSL. Please refer to the official documentation in order to do that: https://docs.alfresco.com/5.2/concepts/configure-ssl-intro.html.
Again if you use a selfsigned certificate (or a certificate from a private pki ) for Alfresco, it is required to let LOOL trust that certificate. The way to do it depends on the distribution the service is running on. On Debian-like systems you can do:

$ keytool -exportcert -alias ssl.alfresco.ca -keystore alf_data/keystore/ssl.keystore -storetype JCEKS -storepass kT9X6oe68t | openssl x509 -inform DER -outform PEM -in - | sudo tee /usr/local/ca-certificates/alfresco.crt
$ sudo update-ca-certificates

The example command above uses the default Alfresco Keystore, path and password which you should have changed. Make sure to use the correct ones for your environment.

Now open the service configuration file /etc/loolwsd/loolwsd.xml and edit the ssl section as shown bellow:

...
<ssl desc="SSL settings">
 <enable type="bool" desc="Controls whether SSL encryption is enable (do not disable for production deployment). If default is false, must first be compiled with SSL support to enable." default="true">true</enable>
 <termination desc="Connection via proxy where loolwsd acts as working via https, but actually uses http." type="bool" default="true">false</termination>
 <cert_file_path desc="Path to the cert file" relative="false">/etc/loolwsd/ssl/loolwsd.crt</cert_file_path>
 <key_file_path desc="Path to the key file" relative="false">/etc/loolwsd/ssl/loolwsd.key</key_file_path>
 <ca_file_path desc="Path to the ca file" relative="false">/etc/loolwsd/ca-chain.cert.pem</ca_file_path>
 <cipher_list desc="List of OpenSSL ciphers to accept" default="ALL:!ADH:!LOW:!EXP:!MD5:@STRENGTH"></cipher_list>
 <hpkp desc="Enable HTTP Public key pinning" enable="false" report_only="false">
  <max_age desc="HPKP's max-age directive - time in seconds browser should remember the pins" enable="true">1000</max_age>
<report_uri desc="HPKP's report-uri directive - pin validation failure are reported at this URL" enable="false"></report_uri>
 <pins desc="Base64 encoded SPKI fingerprints of keys to be pinned">
 <pin></pin>
 </pins>
 </hpkp>
</ssl>
...

Also edit the net section to match your needs:

...
<net desc="Network settings">
 <proto type="string" default="all" desc="Protocol to use IPv4, IPv6 or all for both">all</proto>
 <listen type="string" default="any" desc="Listen address that loolwsd binds to. Can be 'any' or 'loopback'.">any</listen>
 <service_root type="path" default="" desc="Prefix all the pages, websockets, etc. with this path."></service_root>
 <post_allow desc="Allow/deny client IP address for POST(REST)." allow="true">
 <host desc="The IPv4 private 192.168 block as plain IPv4 dotted decimal addresses.">192\.168\.[0-9]{1,3}\.[0-9]{1,3}</host>
 <host desc="Ditto, but as IPv4-mapped IPv6 addresses">::ffff:192\.168\.[0-9]{1,3}\.[0-9]{1,3}</host>
 <host desc="The IPv4 loopback (localhost) address.">127\.0\.0\.1</host>
 <host desc="Ditto, but as IPv4-mapped IPv6 address">::ffff:127\.0\.0\.1</host>
 <host desc="The IPv6 loopback (localhost) address.">::1</host>
 </post_allow>
 <frame_ancestors desc="Specify who is allowed to embed the LO Online iframe (loolwsd and WOPI host are always allowed). Separate multiple hosts by space.">192.168.0.185:7070 192.168.0.185:8080</frame_ancestors>
</net>
...

Pay attention to the post_allow element, its value has to match the IP of your clients (all the browser which may request to edit files). The default configuration is to allow local access and access from a 192.168.0.0/16 network

On the Alfresco side add the following properties to the alfresco-global.properties file:

...
lool.wopi.url=https://loolhost:9980
lool.wopi.url.discovery=https://loolhost:9980/hosting/discovery
lool.wopi.alfresco.host=https://alfrescohost:8443/alfresco/s

Where loolhost is the server name where you installed LOOL, and alfrescohost the local server where alfresco is running
It is possible to install both on the same server of course.

Only use FQDN names (matching the certificates CN if using SSL), do not use localhost.

Applying AMPs

There are 2 AMPs available. In order to turn Alfresco into a WOPI host you’ll need the repo AMP, and to add the necessary Share pages and buttons to allow UI integration the share AMP is needed.
We’ll first need to get the sources and build them:

$ git clone https://github.com/magenta-aps/alfresco-repo-libreoffice-online-module.git_
$ cd alfresco-repo-libreoffice-online-module
$ vim pom.xml
$ mvn package

When editing the pom.xml make sure to:

  • set alfresco.platform.version & alfresco.share.version to 5.2.4
  • set maven.alfresco.edition to enterprise

Copy the resulting .amp files located in target/ to the amps ans amps_share folders of your alfresco installation and run:

$ ./bin/apply_amps.sh

You can now restart the services:

$ sudo systemctl restart loolwsd
$ ./alfresco.sh restart tomcat

You can now test editing Office documents simultaneously with differents users and see how convenient LibreOffice OnLine makes it.

Below examples of spreadsheet and presentations concurrent edition by "Administrator" and "Alex" users:

 

Each user can see what the others are doing and who's editing.

 

calc collab

impress collabAs you can see in the screenshot above, the share module needs some tweaking if you're not using english locale. But that should really just be a matter of adding the right message bundle. to the share AMP

Having a broad and efficient monitoring system is key! It allows for corrective maintenance to happen asap and, if properly setup, it allows as well for pro-active maintenance.

Miguel Rodriguez already touched that topic in a very good post addressing it with ELK

Here I'll focus on a different solution and scope.

 

Within Alfresco Content Service, one of the component which is often the poor cousin of monitoring is Solr. In best case scenarios solr http interface is monitored with its Heap usage. Yet, it does need a lot of care in order to make sure it works efficiently. Enhancing monitoring of this component will make you detect problems sooner and fix it before it impacts users. It will also help you drawing a picture of how your search service evolves: Is it becoming slower than before? Is the number of segments increasing to a critical point? Are the index getting bigger for some reasons? And last but not least, having a deeper monitoring helps you in capacity planning tasks!

 

Here I'll explain why I wrote this little check for Solr and how to use it.

 

Choosing the monitoring system

As explained earlier there are already some materials regarding monitoring Alfresco with ELK, so there was no use re-inventing the wheel. However some of the things monitored here are not in the ELK setup described above, so I could have enrich the ELK project...

But at the root of this plugin is a customer request. That customer wanted to know when Solr is missing some content while indexing. He also wanted to receive alerts for that kind of events (which is not the primary role of an ELK stack). And more importantly, that customer was using an opensource monitoring system called Centreon. This solution is compatible with (and originally derived from) Nagios plugins system. That plugin system is a kind of de-facto standard with some specifications (Development Guidelines · Nagios Plugins) and many monitoring solutions supports Nagios plugins. As a consequence it made sense to work on such a plugin as I hope it can benefit to some others people.

Nagios plugins offer a wide range of probes (jmx, http, disk space, heap space, ...), some of which can be used out-of-the-box to monitor Alfresco Content Service and even provide basic checks for Solr. I won't focuse on those plugins as the goal here is to bring more monitoring capabilities than what already exists.

Those plugin provide two features:

  • Alerts: event triggered based on defined thresholds
  • Performance data: metrics used to generate graphs

 

In this blog post I'll be showing how it all works with Centreon, as it supports both features, but any other system support Nagios checks should work in a similar way.

 

Plugin Installation

 

I assume the monitoring system is already up and running and won't present how to set it up (remember it should work with any system supporting Nagios checks).

 

Prerequisites

 

The system must have python 2.7 or higher (should work with python3) and appropriate python libraries.
This plugin uses a library for Nagios plugins called nagiosplugin and urllib3 which both need to be installed.

 

On Debian-like systems the following should work:

 

$ sudo apt install python-nagiosplugin python-urllib3

 

Use python3-urllib3 & python3-nagiosplugin for systems using by default python3.


Otherwise simply install it with pip

 

$ sudo pip install nagiosplugin
$ sudo pip install urllib3

 

Plugin deployment

 

The plugin is available here on github. You can clone the repo or just download the python file check_alfresco_solr.py. Then simply copy the file  to the nagios plugin directory, usually something like /usr/lib/nagios/plugins and set the execution rights.

 

$ sudo cp check_alfresco_solr.py /usr/lib/nagios/plugins && sudo chmod +x /usr/lib/nagios/plugins/check_alfresco_solr.py

 

Setting up Monitoring

First of all Id' like to explain a little deeper what the plugin does so we better understand what kind of metrics we're tracking here.

 

Quick description

The monitoring plugin uses information gathered from the Solr status page and the Solr summary page. Both are relatively lightweight and querying them on a regular basis should not impact performance to a noticeable point.

 

Among the information we gather and monitor some are documented in existing Alfresco documentation. You can find out more reading the Unindexed Solr Transactions | Alfresco Documentation 

In addition to that, the plugin returns the total size of the index core, its number of documents and its number of segments.

Any of those metrics can trigger an alert and all of them are producing performance data.

 

Some others are not that well documented but are still important.

In the summary page Solr exposes data regarding caches. We know caches are very important for Solr to perform well. Undersize your caches and your search will be slow like an old dog. Oversize them and you may end up consuming way more expensive memory than you actually need.

Cache sizes are used in the overall memory requirement calculation for Solr. See Alfresco documentation bellow:

Calculate the memory needed for Solr nodes | Alfresco Documentation

The plugin will report data on cache usage such as:

  • number of lookups (incremental counter rested periodically)
  • cache size (number of item in the cache)
  • evictions (accumulated evictions from the cache since server startup)
  • hitratio (accumulated ratio of hits vs misses)

The plugin will return warning or critical alerts based on thresholds passed as arguments. Those thresholds are applied on the hitratio only. It means you can track and graph those metrics but only hitratio can trigger alerts (eg an email or text message depending on the monitoring system configuration.)

There is number of caches in cluded in the Solr summary pages which are:

  • /alfrescoPathCache: a cache used to speedup path queries (Alfresco specific cahe)
  • /alfrescoAuthorityCache: a cache used to compute  permissions on search results (Alfresco specific)
  • /queryResultCache: a generic Solr cache to store ordered sets of document IDs
  • /filterCache: a generic Solr cache to store unordered sets of document IDs

 

Handler data is also provided by Solr summary page. Handlers are HTTP endpoints Solr uses to handle requests... hence the name. Data provided for the handlers and used by the plugin are:

  • errors: number of queries which caused an error (500 http status code)
  • timeouts: number of queries which could not ne fullfilled before timeout
  • requests: overall number of requests
  • avgTimePerRequest:  Average time to fullfill a request
  • 75thPcRequestTime: maximum time it takes to reply to the 75% of the fastest requests
  • 99thPcRequestTime: maximum time it takes to reply to the 99% of the fastest requests

Request time data are typically useful in order to track how the search service evolves and anticipate when scaling  is needed or if something is running abnormally (e.g. slower disk access, or network latency). avgTimePerRequest is used to trigger alerts while percentile request times are only used for performance data.

A handler returning an error count higher than zero will always trigger a critical alert, and one returning a timeout will always trigger a warning alert.

While they are not really a problem (from an operation point of view), syntaxically incorrect searches will increment the error counter and thus generate critical alerts. It can be useful to have such alerts to track improper use of the API or suspicious behaviours. However having alerts for this kind of events may not be appropriate in some environments so this can be changed using a command line option (`--relaxed`), in which case only time based alerts will be triggered using the threshold provided as parameters.

There are three handlers exposed by Solr in the summary page, all three can be monitored:

  • /alfresco
  • /afts
  • /cmis

 

Plugin configuration

Now that we understand what the plugin can monitor, and it is deployed on the Centreon (Nagios or similar) server, let's take a look of what we need to do to set monitoring up.

In Centreon most configuration is done through a web interface, so first of all we'll login to the web UI as admin. In Nagios, same configuration applies but editing configuration files has to be done using a good old text editor (all configuration files should be located in /etc/nagios).

 

Check commands

With Nagios like system it all begins with a new command to add to the system. When deploying the plugin we've copied it to the Nagios plugin directory (by default /usr/lib/nagios/plugins). In Centreon most configuration is done through a web interface and the plugin directory is referred to as $CENTREONPLUGINS$. On Nagios similar configuration applies but editing configuration files has to be done through a regular text editor (all configuration files should be located in /etc/nagios).

So in the "Configuration\Commands\Checks" menu click the "Add" button to add a new command and set it as shown bellow:

 

 

the command can be explained as follow:

  • --relaxed: do not trigger alerts if handlers reports errors or timeouts
  • --host $HOSTADDRESS$: specify the Solr hostname (will be expanded from the Solr server configuration)
  • --port $_HOSTSOLRPORT$: specify the Solr port (will be expanded from the service template)
  • --scheme $_HOSTSOLRSCHEME$: specify the Solr http scheme (will be expanded from the service template)
  • --admin $_HOSTSOLRADMINURL$: specify the Solr admin URL (will be expanded from the service template)
  • --monitor $_SERVICESOLRMONITOR$: specify the kind of element we want to monitor (handlers or caches or index values)
  • --item "$_SERVICESOLRMONITORITEM$": specify the name of the item we monitor
  • -w $ARG1$: the warning threshold triggering the warning alerts (value depends on what we want to monitor
  • -c $ARG2$: the critical threshold triggering the critical alerts (value depends on what we want to monitor
  • $ARG3$: the name of the Solr core to monitor (will be expanded from the service template.

 

Service templates

Each item we want monitor will be defined as a service attached to a host. In the end, we will need to attach one service per solr host, per core, per cache and/or handler, per index metrics we want to monitor.

This can lead to numerous services and can be long to configure. For this reason it is not desirable to trigger alters for every single metric we have access to. For example triggering alerts for "Alfresco Nodes in Index" doesn't make much sense while "Alfresco Error Nodes in Index" does.

Nagios provides inheritance and template features, this is also very handy to avoid duplicating configuration for each service. To be concrete, it means we can define a service using a template which inherits values from each other. For instance, in order to define a service to monitor the "/queryresultCache" cache, we can define a general template for caches (e.g. setting $SOLRMONITOR$ , solr core name together with the warning and critical thresholds), and a more specific template for the "/queryresultCache" (setting the $SOLRMONITORITEM$) which inherit from the general one.

 

Below I explain how to setup the Solr "/queryResultCache". For a broader monitoring you should also set more service and service templates for the needed items. See the help message of the plugin by using (or see documentation on the git repo):

 

$ python check_alfresco_solr.py --help

 

So in the admin, let's go to the "Configuration\Services\Templates" section and create a first template called "alfresco-solr-caches":

 

  • Warning threshold is set so that a hitratio lower than 40% will trigger a warning alert
  • Critical threshold is set so that a hitratio lower than 20% will trigger a critical alert

 

Then we create a second, more specific, template called "alfresc-solr-queryResultCache" which inherits from the general one:

 

Here we just set the specific cache we want to monitor. Also note the "Template" field which reference the general template. Due to the template inheritance we can leave other fields blank and inherit values.

 

Host Template

Just like we created service templates, we will create a host template. This is to avoid manually attaching the service templates we created to each individual Solr server we want to monitor.

In the Centreon admin navigate to "Configuration\Host\Template" and create a new template called "alfresco-solr" (for instance). Then go in the "Relations" tab of this host and add the service templates we defined earlier and are relevant for you:

 

Click Save.

 

At this point we can define the host and monitoring will be ready. Either you can create a new host (if it doesn't exist already) or you can just apply the newly defined host template to an existing host.

 

Doing so will add Solr specific macros  you will need to fill when to validate the host:

 

 

As an example bellow is the monitoring service page of a solr host:

 

 

And here the graph you can expect:

For index:

 

For FTS:

 

For a cache:

 

For a handler:

 

Of course monitoring doesn't fix anything by itself. It requires thoughtful configuration (to match your instance workload) and to be watched carefully by admins who know what to do in case of alerts.

For example in case your monitoring system reports "Alfresco Error Nodes in Index" your first action will probably be to trigger a "FIX" action in Solr admin console.

For general Solr troubleshooting please refer to the Alfresco documentation here.

One of the recurring issues we see raised by customers are regarding slow SQL queries.

Most of the time those are first witnessed within the Share UI page load time or through long running CMIS queries.

Pinpointing those queries can be a pain and this blog post aims at providing some helps in the troubleshooting process.

We will cover:

  • Things to check in the first place
  • Different ways of getting informations on the query execution time
  • Isolating where the RDBMS is spending more time.
  • Present some tools and configuration to help proactively track this kind of issues.

We hope this content will be useful in real life but really having a DBA who takes care of the Alfresco database is the best you can offer to your Alfresco application!

Preliminary checks

Before blaming the database, it's always a good thing to check that the database engine has appropriate resources in order to deliver good performances. You cannot expect any DB engine to perform well with Alfresco with limited resources. For PostgreSQL there are plenty of resource on the web about sizing a database cluster (here I use cluster in the PostgreSQL meaning, which is different from what we call a cluster in Alfresco).

 

Latency

Network latency can be a performance killer. Opening connections is a quite intensive process and if your network is lame, it will impact the application. Simple network tests can make you sure the network is delivering a good enough transport layer. The ping utility is really the first thing to look at. A ping test between Alfresco and its DB server must show a latency under 1ms on a directly connected network (ethernet Gb), or between 1 & 5 ms if you DB server and the Alfresco server are connected through routed networks. Definitely a value around or above 10ms is not what Alfresco expects from a DB server.

alxgomz@alfresco:~$ ping -c 5 -s500 192.168.0.68
PING 192.168.0.68 (192.168.0.68) 500(528) bytes of data.
508 bytes from 192.168.0.68: icmp_seq=1 ttl=64 time=0.436 ms
508 bytes from 192.168.0.68: icmp_seq=2 ttl=64 time=0.364 ms
508 bytes from 192.168.0.68: icmp_seq=3 ttl=64 time=0.342 ms
508 bytes from 192.168.0.68: icmp_seq=9 ttl=64 time=0.273 ms
508 bytes from 192.168.0.68: icmp_seq=10 ttl=64 time=0.232 ms

--- 192.168.0.68 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8997ms
rtt min/avg/max/mdev = 0.232/0.329/0.436/0.066 ms

 

Some more advanced utilities allow to send TCP packets which are more representative to the actual time spent on opening a tcp session (again you should not have values above 10 ms):

alxgomz@alfresco:~$ sudo hping3 -s 1025 -p 80 -S -c 5 192.168.0.68
HPING 192.168.0.68 (eth0 192.168.0.68): S set, 40 headers + 0 data bytes
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=0 win=29200 rtt=3.5 ms
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=1 win=29200 rtt=3.4 ms
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=2 win=29200 rtt=3.4 ms
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=3 win=29200 rtt=3.6 ms
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=4 win=29200 rtt=3.5 ms

--- 192.168.0.68 hping statistic ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 3.4/3.5/3.6 ms

 

Overall network quality is also something to check. If your network devices spend their time reassembling unordered packets or retransmiting them the application performances will suffer from it. This can be checked by taking a network dump during query execution and opening it using wireshark.

In order to take the dump you can use the following tcpdump command:

alxgomz@alfresco:~$ tcpdump -ni any port 5432 and host 192.168.0.68 -w /tmp/pgsql.pcap

Opening it in wireshark should give you an idea very quickly. If you see a dump with a lot of red/black lines then it might be an issue and needs further investigations (those lines are colored this way if wireshark has the right syntaxic coloration rules applied).

 

RDBMS configuration

PostgreSQL comes with a relatively low end configuration. This is intended to allow for it to run on a wide range of hardware (or VM) configurations. However, if you are running your Alfresco (and its database) on high end systems, you most certainly want to tune the configuration in order get the best out of your resources.

This wiki page present in details many parameters you may wan to tweak Tuning Your PostgreSQL Server - PostgreSQL wiki

The first one to look at is the shared_buffers size. It sets the size of the area PostgreSQL uses to cache data, thus improving performance. Among all those parameters some should be "in sync" with your Alfresco configuration. For example by default Alfresco allow 275 tomcat threads at peak time. Each of these threads should be able to open a database connection. As a consequence PostgreSQL (when installed using the installer) sets the max_connections parameter to 300. However we need to understand that each connection will consume resources, and in the first place: memory. The amount of memory dedicated to a PostgreSQL process (that handles a SQL query) is controlled by the work_mem parameter. By default it has a value of 4MB, meaning we can calculate the amount of physical RAM needed by the database server in order to handle peak load:

work_mem * max_connections =  4MB * 300 = 1.2GB

Add the size of the shared_buffers to this and you'll have a good estimate of the amount of RAM Postgres needs to handle peak loads with default configuration. There are some other important values to fiddle with (like effective_cache_size,  checkpoint_completion_target, ...) but making sure those above are aligned with both your alfresco configuration and the hardware resources of your database host is really where to start (refer to the link above).

A qualified DBA should configure and maintain Alfresco's database to ensure continuous performance and stability of the database. If you don't have a DBA internally there are also dozens of companies offering good services around postgreSQL configuration and tuning.

Monitoring

This is a key in the troubleshooting process. Although monitoring will not give you solution to a performance issue it will help you getting on the right track. In this case having a monitoring in place on the DB server is important. If you can correlate an increasingly slow application with a global load increase on the database server, then you've got a good suspect. There are different things to monitor on the BD server but you should at least have the bare minimum:

  • CPU
  • RAM usage
  • disk IO & disk space

Spikes in CPU and IO disk usage can be the sign of a table that grew large without appropriate indexes.

Spikes in the used disk space can be explained because of the RDBMS creating temporary work files due to a lack of physical memory.

Monitoring the RAM can help you anticipate cache disk memory starvation (PostgreSQL heavily rely on this kind of memory).

Alfresco has some tables that are known to be potentially growing very large. A DBA should monitor their size, both in terms of number of rows and disk size. The query bellow is an example of how you can do it:

SELECT table_name as tableName,
  (total_bytes / 1024 / 1024) AS total,
   row_estimate as rowEstimate,
   (index_bytes / 1024 / 1024) AS INDEX,
   (table_bytes / 1024 / 1024) AS TABLE
   FROM (
     SELECT *,
     total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes
      FROM (
      SELECT c.oid,
       nspname AS table_schema,
       relname AS TABLE_NAME,
       c.reltuples AS row_estimate,
       pg_total_relation_size(c.oid) AS total_bytes,
       pg_indexes_size(c.oid) AS index_bytes,
       pg_total_relation_size(reltoastrelid) AS toast_bytes
       FROM pg_class c LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
      WHERE

       relkind = 'r') a) a
     WHERE table_schema = 'public' order by total_bytes desc;

Usually the alf_prop_*, alf_audit_entry and possibly alf_node_properties are the tables that may appear in the resultset. There is no rule of thumb which dictate when a table is too large. This is more matter of monitoring how the table grow in time.

Another useful thing to monitor are the creation/usage of temporary files. When your DB is not correctly tuned, or doesn't have enough memory allocated to it, it may created temporary files on disk if it needs to further work on a big resultset. This is obvioulsy not as fast as working in-memory and should be avoided. If you don't know how to monitor that with your usual monitoring system there are some good tools that help a DBA be aware of such things hapenning.

pgbadger is an opensource tool which does that, and many other things. If you don't already use it, I strongly encourgae you to deploy it!

 

Debugging what takes long and how long is it?

Monitoring should have helped you pinpoint your DB server being over loaded, making your application slow. This could be because it is undersized for its workload (in which case you would probably see a steady high resource usage, or it could be some specific operations are too expensive for some reasons. in the former case, there is not much you can do apart from upgrading either the resources or the DB architecture (but that's not topics I want to cover here). In the later case, getting to know how long a query takes is really what people usually want to know. We accomplish that by using one of the debug option bellow.

 

At the PostgreSQL level

In my opinion, the best place to get this information is really on the DB server. That's where we get to know the actual execution time, without accounting for network round trip time and other delays. More over, PostgreSQL makes it very easy to enable this debug, and you don't even need to restart the server. There are many ways to enable debug in PostgreSQL, but the one that's the most interesting to us is log_min_duration_statement. By default it has a value of "-1" which means nothing will be logged based on its execution time. But for example, if we set in the postgresql.conf file:

log_min_duration_statement = 250

log_line_prefix = '%t [%p-%l] %q%u@%d '

Any query that takes more than 250 milliseconds to execute will be logged.

Setting the log_min_duration_statement value to zero will cause the system to log every single query. While this can be useful for debugging or for temporary audit this will not be very helpful here as we really want to target slow queries only.

      If interested in profiling your DB have a look at the great pgbadger tool from Dalibo
The Alfresco installer sets by default the log_min_messages parameter to fatal. This prevent the log_min_duration_statement from working. Make sure it is set back to its default value or to a value that's higher than LOG.

Then without interrupting the service, PostgreSQL can be reloaded in order for changes to take effect:

$ pg_ctl -U postgres -W -D /data/postgres/9.4/main reload

adapt the command above to your needs with appropriate paths

                 

 

This will produce an output such as bellow:

2017-12-12 19:54:55 CET [5305-1] LOG: duration: 323 ms execution : select

      pv.id as prop_id,

      pv.actual_type_id as prop_actual_type_id,

      pv.persisted_type as prop_persisted_type,

      pv.long_value as prop_long_value,

      sv.string_value as prop_string_value

   from

      alf_prop_value pv

      join alf_prop_string_value sv on (sv.id = pv.long_value and pv.persisted_type = $1)

   where

      pv.actual_type_id = $2 and

      sv.string_end_lower = $3 and

      sv.string_crc = $4

DETAIL: parameters: $1 = '3', $2 = '1', $3 = '4ed-f2f1d29e8ee7', $4 = '593150149'

Here we gather important informations:

  1. The date and time the query was executed at the beginning of the first line.
  2. The process ID and Session line number. As PostgreSQL forks a new process for each connection, we can map process ID and pool connections. A single connection may contain different transactions, which in turn will contain several statements. Each new statement processing increments the session line number
  3. The execution time on the first line
  4. The execution stage. A query is executed in several steps. With Alfresco making heavy usage of bind parameters, we will often see several lines for the same query (one for each step):
    1. prepare (when the query is parsed),
    2. bind (when parameters are replaced by their values and execution is planned),
    3. execution (when the query is actually executed).
  5. The query itself starting at the end of the first line and continuing on subsequent lines. Here it contains parameters and can't be executed as is.
  6. The bind parameters on the last line.

In order to get the overall execution time we have to sum up the execution time of the different steps. This seems painful but delivers fined grained breakdown of the query execution. However most of the time the majority of the execution time is spent on the execute stage, to better understand what's going on at this stage, we need to dive deeper into the RDBMS (see next chapter about explain plans).

 

At the application level

It is also possible to debug SQL queries in a very granular manner at the Alfresco level. However it is important to note that this method is way more intrusive as it required, adding additional jar files, modifying the configuration of the application and restarting the application server. It may not be well suited for production environments where any downtime is a problem.

Also note that execution times reported with this method include network round-trip times. In normal circumstances this should be few additionnal milliseconds, but could much more on a lame network.

To allow debugging at the application level we will use a jdbc proxy: p6spy

Impact on the application performances largely depends on the amount of queries that will be logged.

 

First of all we will get the latest p6spy jar file from the github repository.

Copy this file to the tomcat lib/ directory and add a spy.properties file in the same location containing the lines bellow:

driverlist:org.postgresql.Driver

executionThreshold=250

This will mimic the behaviour we had previously when debugging with PostgreSQL, meaning only queries that take more than 250 milliseconds will be logged.

We then need to tweak the alfresco-global.properties file in order to make it use the p6spy driver instead of the actual driver:

 

db.driver=com.p6spy.engine.spy.P6SpyDriver
db.url=jdbc:p6spy:postgresql://${db.host}/${db.name}

 

Alfresco must now be restarted and we will have a new file called spy.log which should now be available and contain lines like the one shown bellow:

1513100550017|410|statement|connection 14|update alf_lock set version = version + 1, lock_token = ?, start_ti
me = ?, expiry_time = ? where excl_resource_id = ? and lock_token = ?|update alf_lock set
version = version + 1, lock_token = 'not-locked', start_time = 0, expiry_time = 0 where excl_resource_id
= 9 and lock_token = 'f7a21222-64f9-40ea-a00a-ef95052dafe9'

We find here similar values to what we had with PostgreSQL:

  1. timestamp the query was executed on the application server.
  2. execution time in milliseconds
  3. The connection ID
  4. The query string without bind parameters
  5. The query string with evaluated bind parameters

 

Understanding the execution plan

Now that we have pinpointed the problematic query(ies) we can deep dive into PostgreSQL's logic and understand why the query is slow. RDBMS rely on their query planner to decide how to deal with a query.

The query planner itself makes decision based on the structure of the query, the structure of the database (e.g presence and types of indexes) and also based on statistics the system maintains during its execution. The more accurate those statistics are, the more efficient the query planer will be.

 

Explain plans

In order to know what the query planner would do for a specific query, it is possible to run it prefixed with the "EXPLAIN ANALYZE " statement.

To make this chapter more hands-on we'll proceed with an example. Let's consider a query which is issued while browsing the repository (get node informations based on parent). Using one of the methods we've seen above, we have identified that query an running it prefixed with "EXPLAIN ANALYZE" returns the following:

                                                                                 QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=6614.54..13906.94 rows=441 width=676) (actual time=1268.230..1728.432 rows=421 loops=1)
   ->  Hash Left Join  (cost=6614.26..13770.40 rows=441 width=639) (actual time=1260.966..1714.577 rows=421 loops=1)
         Hash Cond: (childnode.store_id = childstore.id)
         ->  Hash Right Join  (cost=6599.76..13749.84 rows=441 width=303) (actual time=1251.427..1704.866 rows=421 loops=1)
               Hash Cond: (prop1.node_id = childnode.id)
               ->  Bitmap Heap Scan on alf_node_properties prop1  (cost=2576.73..9277.09 rows=118749 width=71) (actual time=938.409..1595.742 rows=119062 loops=1)
                     Recheck Cond: (qname_id = 26)
                     Heap Blocks: exact=5205
                     ->  Bitmap Index Scan on fk_alf_nprop_qn  (cost=0.00..2547.04 rows=118749 width=0) (actual time=934.132..934.132 rows=119178 loops=1)
                           Index Cond: (qname_id = 26)
               ->  Hash  (cost=4017.52..4017.52 rows=441 width=232) (actual time=90.488..90.488 rows=421 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 83kB
                     ->  Nested Loop  (cost=0.83..4017.52 rows=441 width=232) (actual time=8.228..90.239 rows=421 loops=1)
                           ->  Index Scan using idx_alf_cass_pri on alf_child_assoc assoc  (cost=0.42..736.55 rows=442 width=8) (actual time=2.633..58.377 rows=421 loops=1)
                                 Index Cond: (parent_node_id = 31890)
                                 Filter: (type_qname_id = 33)
                           ->  Index Scan using alf_node_pkey on alf_node childnode  (cost=0.42..7.41 rows=1 width=232) (actual time=0.075..0.075 rows=1 loops=421)
                                 Index Cond: (id = assoc.child_node_id)
                                 Filter: (type_qname_id = ANY ('{142,24,51,200,204,206,81,213,97,103,231,104,107}'::bigint[]))
         ->  Hash  (cost=12.00..12.00 rows=200 width=344) (actual time=9.523..9.523 rows=6 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 1kB
               ->  Seq Scan on alf_store childstore  (cost=0.00..12.00 rows=200 width=344) (actual time=9.517..9.518 rows=6 loops=1)
   ->  Index Scan using alf_transaction_pkey on alf_transaction childtxn  (cost=0.28..0.30 rows=1 width=45) (actual time=0.032..0.032 rows=1 loops=421)
         Index Cond: (id = childnode.transaction_id)
Planning time: 220.119 ms
Execution time: 1728.608 ms

Although I never faced it with PostgreSQL (more with Oracle), there are cases where the explain plans is different depending on whether you pass the query as a complete string or you use bind parameters.

In that case the parameterized query that may be found to be slow in the SQL debug logs might appear fast when executed manually.

To get the explain plan of these slow queries PostgreSQL has a loadable module which can log explain plans the same way we did with the log_duration. See the auto_explain documentation for more details.

    

We can indeed see that the query is taking rather long just by looking at the end of the output (highlighted in bold). However reading the full explain plan and more importantly understanding it can be challenging.

In the explain plan, PostgreSQL is breaking down the query into "nodes". Those "nodes" represent actions the RDBMS has to run through in order to execute the query. For example, at the bottom of the plan we will find the "scan nodes", which are the statements that actually return rows from tables. And upper in the plan we have "nodes" that correspond to aggregations or ordering. In the end we have an indented/hierarchical tree of the query from which we can detail each step. Each node (line starting with "-->") is shown with:

  • its type, whether the system using indexes and what kind (Index scan) or not any (sequential scan)
  • its estimated cost, an arbitrary representation of how costly an operation is.
  • the estimated number of lines the operation would return

And much more details that make it somewhat hard to read.

And to make it more confusing, some values, like "cost" or "actual time", have two different values. To make it short you should only consider the second one.

      

The purpose of this article is not to learn how to fully understand a query plan, so instead,we will use a very handy online tool which ill parse the output for us and point out the problems we may have: New explain | explain.depesz.com 

Only by pasting the same output and submitting the form we will get a better view of what's going on and what we need to look at.

The "exclusive" color mode gives the best representation of how efficient each individual node is.

"Inclusive" mode is cumulative (so the top row will always be dark red as it's equal to the total execution time).

"rows x" shows how accurately the query planner is able to guess the number of rows.

If using "mixed" color mode, each cell in each mode's column will have its own color (which can be a little bit harder to read).

So here we can see straight away that nodes #4 & #5 are where we are spending most time. We can see that those nodes return a big amount of rows (more than 100 000 while there are only 421 of then in the final result set) meaning that the available indexes and statistics are not good enough.

Alfresco normally provide all the necessary indexes for the database to deliver good performance in most of the cases so it is very likely that queries under-performing because of indexes are actually missing indexes. Fortunately, Alfresco also deliver a convenient way to check the database schema for any in consistency.

 

Alfresco schema validation

When connected to the JMX interface, in the MBeans tab, it is possible to trigger a schema validation while alfresco is running (go to "Alfresco \ DatabaseInformation \ SchemaValidator \ Operations" and launch "valideateSchema()").

This will produce the output bellow in the Alfresco log file:

2017-12-18 15:21:32,194 WARN [domain.schema.SchemaBootstrap] [RMI TCP Connection(6)-10.1.2.101] Schema validation found 1 potential problems, results written to: /opt/alfresco/bench/tomcat/temp/Alfresco/Alfresco-PostgreSQLDialect-Validation-alf_-4191170423343040157.txt
2017-12-18 15:21:32,645 INFO [domain.schema.SchemaBootstrap] [RMI TCP Connection(6)-10.1.2.101] Compared database schema with reference schema (all OK): class path resource [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-ACT.xml]

The log file points us to another file where we can see the details of the validation process. Here for example we can see that an index is indeed missing :

alfresco@alfresco:/opt/alfresco/bench$ cat /opt/alfresco/bench/tomcat/temp/Alfresco/Alfresco-PostgreSQLDialect-Validation-alf_-4191170423343040157.txt
Difference: missing index from database, expected at path: .alf_node_properties.alf_node_properties_pkey

The index can now be re-created either taking a fresh install as a model or by getting in touch with alfresco support to know how to create the index.

The resulting - more efficient - query plan if much better:

 

Database statistics

Statistics are really critical to PostgreSQL performance as it's what mainly offer the query planner efficiency. With accurate statistics PostgreSQL will make good decision when planning a query. And of course, inaccurate statistics leads to bad decisions and so bad performances.

PostgreSQL has an internal process in charge of keeping statistics up to date (in addition to other house keeping tasks): the autovacuum process.

All versions of PostgreSQL Alfresco supports have this capability and it should always be active! By default this process will try to update statistics according to configuration options set in postgresql.conf. The options bellow can be useful to fine tune the autovacuum behaviour (those are the defaults):

autovacuum=true #Enable the autovacuum daemon

autovacuum_analyze_threshold=50 #Number of tuples modifications that trigger ANALYZE

autovacuum_analyze_scale_factor = 0.1 #Fraction of table modified to trigger ANALYZE

default_statistics_target = 100 # Amount of information to store in the statistics

autovacuum_analyze_threshold & autovacuum_analyze_scale_factor must be summed up in order to know when an ANALYZE will be triggered

         

If you saw a slow, but constant degradation in queries performances it maybe that some tables grew large enough to make the default parameters not as efficient as they used to be. As the tables grow, the scale factor can make statistics update very infrequent, lowering  autovacuum_analyze_scale_factor will make statistics updates more frequent, thus ensuring stats are more up to date.

The type of data distribution within a database tables can also change during its lifetime, either because of a data model change of simply because of new use-cases. Raising default_statistics_target will make the daemon collect and process more data from the tables when generating or updating statistics, thus making the statistics more accurate.

Of course asking for more frequent dates and more accurate statistics as an impact on the resources needed by the autovacuum process. Such tweaking should be carefully done by your DBA.

Also it is important to note that the options above are applied for every table. You may want to do something more targeted to known big tables. This is doable by changing the storage options for the specific tables:

=# ALTER TABLE alf_node_prop_string_value

-#     ALTER COLUMN string_value SET STATISTICS 1000;

Above SQL statement is really just an example and must not be used without prior investigations

         

Introduction

Alfresco recently released a new module that allow modern SSO to be setup using the SAML protocol. SAML is a standard which  has a set of specifications defined by the OASIS consortium.

Like kerberos SAML is considered a secure approach to SSO, it involves signing messages and possibly encrypting them ; but unlike kerberos - which is more targeted to local networks or VPNs extended networks - SAML is really good fit for internet and SaaS services. Mainly SAML requires an Identity Provider (often referred to as IdP) and a Service Provider (SP) to communicate together. Many cloud services offer the SAML Service Provider features, and even sometimes the IdP feature (for example google: Set up your own custom SAML application - G Suite Administrator Help).

LemonLDAP::NG is an open-source software that is actually a handler for the httpd Apache webserver. LemonLDAP::NG supports a wide variety of authentication protocol (HTTP header based, CAS, OpenID Connect, OAuth, kerberos, ...) and backends (MySQL, LDAP, Flat files).

 

Pre-requisites & Context

LemonLDAP must be installed and configured with an LDAP backend.
Doing so is out of the scope of this document. Please refer to:

LemonLDAP::NG Download
LemonLDAP::NG DEB install page
LemonLDAP::NG LDAP backend configuration

If you just want to test SAML with LemonLDAP::NG and you don’t want the burden of setting up LDAP and configuring LemonLDAP::NG accordingly, you can use the default “demo” backend which is used be default “out of the box”.
In this case you can use the demo user “dwho” (password “dwho”).

At the moment the Alfresco SAML module doesn’t handle the user registry part of the repo. This means that users have to exist prior to login using SAML.
As a consequence, either Alfresco must be setup with ldap synchronisation enabled - synchronisation should be done against the same directory LemonLDAP::NG uses as an LDAP backend for authentication - or users must have been created by administrators (e.g. using the Share admin console, csv import, People API…)

Both the SAML Identity Provider and the Service Provider must be time synchronized using NTP.

In the document bellow we assert that the ACME company setup their SSO system using LemonLDAP::NG on the acme.com domain.

ComponentURL
authentication portal (where users are redirected in order to login)

https://auth.acme.com

manager (for administration purposes - used further in this document)

https://manager.acme.com


On the other end, their ECM system is hosted on a Debian-like system at alfresco.myecm.org (possibly a SaaS provider or their AWS instance of Alfresco). ACME wants to integrate the Share UI interface with their SSO system.

The Identity Provider

SAMLv2 required libraries

While Alfresco uses opensaml java bindings for its SAML module, LemonLDAP::NG uses the LASSO library perl bindings. Even though LemonLDAP::NG is installed and running, required library may not be installed as they are not direct dependencies.
LASSO is a pretty active project and bugs are fixed regularly. I would then advice to use the latest & greatest version available on their website instead of the one provided by your distribution.
For example if using a Debian based distribution:

$ cat <<EOT | sudo tee /etc/apt/source.list.d/lasso.list deb http://deb.entrouvert.org/ jessie main deb-src http://deb.entrouvert.org/ jessie main EOT
$ sudo wget -O - https://deb.entrouvert.org/entrouvert.gpg | sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install liblasso-perl

Make sure you are using the latest version of the LASSO library  and its perl bindings (2.5.1-2 fixes some important issues with SHA2)

LemonLDAP::NG SAMLv2 Identity Provider

As you may know SAML extensively uses XML Dsig. As a specification Dsig provides guidance on how to hash, sign and encrypt XML contents.
In SAML, signing and encrypting rely on asymmetric cryptographic keys.
We will then need to generate such keys (here RSA) to sign and possibly encrypt SAML messages. LemonLDAP offers the possibility to uses different keys for signing and encrypting SAML messages. If you plan to use both signing and encryption, please use the same key for both (follow the procedure bellow only once for signing, encryption will use the same key).

Login to the LemonLDAP::NG manager (usually manager.acme.com), in the menu “SAML 2 Service \ Security Parameters \ Signature” and click on “New keys”
Type in a password that will be used to encrypt the private key and remember it!

LemonLDAP::NG signing keys

you’ll need the password in order the generate certificates later on!

We now need to setup the SAML metadata that every Service Provider will use (among which Alfresco Share, and possibly AOS and Alfresco REST APIs).
In the LemonLDAP::NG manager, inside the menu “SAML 2 Service \ Organization”, fill the form with:

Display Name: the ACME company
Name: acme
URL: http://acme.com

Of course you will use values that match your environment

Next in the “General Parameters \ Issuer modules \ SAML” menu, make sure the SAML Issuer module is configured as follow:

Activation: On
Path: ^/saml/
Use rule: On

Note that it is possible to allow SAML connection only under certain condition, by using the “Special rule” option.
You then need to define a Perl expression that return either true or false (more information here).

And That’s it, LemonLDAP::NG is now a SAML Identity Provider!

In order to configure Alfresco Service providers we need to export the signing key as a certificate. To do so, copy the private key that was generated in the LemonLDAP::NG manager to a file (e.g. saml.key) and generate a self-signed cert using this private key.

$ openssl req -new -days 365 -nodes -x509 -key saml.key -out saml.crt

use something like CN=LemonLDAP, OU=Premier Services, O=Alfresco, L=Maidenhead, ST=Berkshire, C=UK as a subject

Keep the saml.crt file somewhere you can find it for later use.

SAMLv2 Service Provider

Install SAML Alfresco module package

The Alfresco SAML module can be downloaded from the Alfresco support portal. Only enterprise customers are entitled to this module.
So, we download alfresco-saml-1.0.x.zip and unzip it to some place. Then, after stopping Alfresco, we copy the amp files to the each amps directories within the alfresco install directory and deploy them.

$ cp alfresco-saml-repo-1.0.1.amp <ALFRESCO_HOME>/amps
$ cp alfresco-saml-share-1.0.1.amp <ALFRESCO_HOME>/amps_share
$ ./bin/apply_amp.sh

We now have to generate the certificate we will be using on the SP side:

$ keytool -genkeypair -alias my-saml-key -keypass change-me -storepass change-me -keystore my-saml.keystore -storetype JCEKS

You can use something like CN=Share, OU=Premier Services, O=Alfresco, L=Maidenhead, ST=Berkshire, C=UK as a subject

You can of course choose to use a different password and alias, just remember them for later use.

The keystore must be copied somewhere and Alfresco configured to retrieve it.

$ mv my-saml.keystore alf_data/keystore
$ cat <<EOT > alf_data/keystore/my-saml.keystore-metadata.properties
aliases=my-saml-key
keystore.password=change-me
my-saml-key.password=change-me
EOT
$ cat <<EOT >> tomcat/shared/classes/alfresco-global.properties

saml.keystore.location=\${dir.keystore}/my-saml.keystore
saml.keystore.keyMetaData.location=\${dir.keystore}/my-saml.keystore-metadata.properties
EOT

Make sure that:

  • the keystore file is readable to Alfresco (and only to alfresco).
  • the alias and passwords match the one you use when generating the keystore with the keytool command

Next step is to merge the whole <filter/> element provided in the saml distribution (in the share-config-custom.xml.sample file), to your own share-config-custom.xml (which should be located in your {extensionroot} directory).
Bellow is an example section of the CSRF policy:

...
    <config evaluator="string-compare" condition="CSRFPolicy" replace="true">

 

    <!--
        If using https make a CSRFPolicy with replace="true" and override the properties section.
        Note, localhost is there to allow local checks to succeed.

 

        I.e.
        <properties>
            <token>Alfresco-CSRFToken</token>
            <referer>https://your-domain.com/.*|http://localhost:8080/.*</referer>
            <origin>https://your-domain.com|http://localhost:8080</origin>
        </properties>
    -->

 

        <filter>

 

            <!-- SAML SPECIFIC CONFIG -  START -->

 

            <!--
             Since we have added the CSRF filter with filter-mapping of "/*" we will catch all public GET's to avoid them
             having to pass through the remaining rules.
             -->
            <rule>
                <request>
                    <method>GET</method>
                    <path>/res/.*</path>
                </request>
            </rule>

 

            <!-- Incoming posts from IDPs do not require a token -->
            <rule>
                <request>
                    <method>POST</method>
                    <path>/page/saml-authnresponse|/page/saml-logoutresponse|/page/saml-logoutrequest</path>
                </request>
            </rule>

 

            <!-- SAML SPECIFIC CONFIG -  STOP -->

 

            <!-- EVERYTHING BELOW FROM HERE IS COPIED FROM share-security-config.xml -->

 

            <!--
             Certain webscripts shall not be allowed to be accessed directly form the browser.
             Make sure to throw an error if they are used.
             -->
            <rule>
                <request>
                    <path>/proxy/alfresco/remoteadm/.*</path>
                </request>
                <action name="throwError">
                    <param name="message">It is not allowed to access this url from your browser</param>
                </action>
            </rule>

 

            <!--
             Certain Repo webscripts should be allowed to pass without a token since they have no Share knowledge.
             TODO: Refactor the publishing code so that form that is posted to this URL is a Share webscript with the right tokens.
             -->
            <rule>
                <request>
                    <method>POST</method>
                    <path>/proxy/alfresco/api/publishing/channels/.+</path>
                </request>
                <action name="assertReferer">
                    <param name="referer">{referer}</param>
                </action>
                <action name="assertOrigin">
                    <param name="origin">{origin}</param>
                </action>
            </rule>

 

            <!--
             Certain Surf POST requests from the WebScript console must be allowed to pass without a token since
             the Surf WebScript console code can't be dependent on a Share specific filter.
             -->
            <rule>
                <request>
                    <method>POST</method>
                    <path>/page/caches/dependency/clear|/page/index|/page/surfBugStatus|/page/modules/deploy|/page/modules/module|/page/api/javascript/debugger|/page/console</path>
                </request>
                <action name="assertReferer">
                    <param name="referer">{referer}</param>
                </action>
                <action name="assertOrigin">
                    <param name="origin">{origin}</param>
                </action>
            </rule>

 

            <!-- Certain Share POST requests does NOT require a token -->
            <rule>
                <request>
                    <method>POST</method>
                    <path>/page/dologin(\?.+)?|/page/site/[^/]+/start-workflow|/page/start-workflow|/page/context/[^/]+/start-workflow</path>
                </request>
                <action name="assertReferer">
                    <param name="referer">{referer}</param>
                </action>
                <action name="assertOrigin">
                    <param name="origin">{origin}</param>
                </action>
            </rule>

 

            <!-- Assert logout is done from a valid domain, if so clear the token when logging out -->
            <rule>
                <request>
                    <method>POST</method>
                    <path>/page/dologout(\?.+)?</path>
                </request>
                <action name="assertReferer">
                    <param name="referer">{referer}</param>
                </action>
                <action name="assertOrigin">
                    <param name="origin">{origin}</param>
                </action>
                <action name="clearToken">
                    <param name="session">{token}</param>
                    <param name="cookie">{token}</param>
                </action>
            </rule>

 

            <!-- Make sure the first token is generated -->
            <rule>
                <request>
                    <session>
                        <attribute name="_alf_USER_ID">.+</attribute>
                        <attribute name="{token}"/>
                        <!-- empty attribute element indicates null, meaning the token has not yet been set -->
                    </session>
                </request>
                <action name="generateToken">
                    <param name="session">{token}</param>
                    <param name="cookie">{token}</param>
                </action>
            </rule>

 

            <!-- Refresh token on new "page" visit when a user is logged in -->
            <rule>
                <request>
                    <method>GET</method>
                    <path>/page/.*</path>
                    <session>
                        <attribute name="_alf_USER_ID">.+</attribute>
                        <attribute name="{token}">.+</attribute>
                    </session>
                </request>
                <action name="generateToken">
                    <param name="session">{token}</param>
                    <param name="cookie">{token}</param>
                </action>
            </rule>

 

            <!--
             Verify multipart requests from logged in users contain the token as a parameter
             and also correct referer & origin header if available
             -->
            <rule>
                <request>
                    <method>POST</method>
                    <header name="Content-Type">multipart/.+</header>
                    <session>
                        <attribute name="_alf_USER_ID">.+</attribute>
                    </session>
                </request>
                <action name="assertToken">
                    <param name="session">{token}</param>
                    <param name="parameter">{token}</param>
                </action>
                <action name="assertReferer">
                    <param name="referer">{referer}</param>
                </action>
                <action name="assertOrigin">
                    <param name="origin">{origin}</param>
                </action>
            </rule>

 

            <!--
             Verify that all remaining state changing requests from logged in users' requests contains a token in the
             header and correct referer & origin headers if available. We "catch" all content types since just setting it to
             "application/json.*" since a webscript that doesn't require a json request body otherwise would be
             successfully executed using i.e."text/plain".
             -->
            <rule>
                <request>
                    <method>POST|PUT|DELETE</method>
                    <session>
                        <attribute name="_alf_USER_ID">.+</attribute>
                    </session>
                </request>
                <action name="assertToken">
                    <param name="session">{token}</param>
                    <param name="header">{token}</param>
                </action>
                <action name="assertReferer">
                    <param name="referer">{referer}</param>
                </action>
                <action name="assertOrigin">
                    <param name="origin">{origin}</param>
                </action>
            </rule>
        </filter>
    </config>
...

Configure SAML Alfresco module

We can now configure the SAML service providers we need. Alfresco offers 3 different service providers that can be configured/enabled separately:

  • Share (the Alfresco collaborative UI)
  • AOS (the new Sharepoint protocol interface)
  • REST api (the Alfresco RESTful api)

Configuration can be done in several ways.

Configuring SAML SP using subsystem files:

The alfresco SAML distribution comes with examples of the SAML configuration files. Reusing them is very convenient and allow quick setup.
We’ll copy the files for the required SP and configure each SP as needed.

$ cp -a ~/saml/alfresco/extension/subsystems tomcat/shared/classes/alfresco/extension

Then to configure Share SP, for example, make sure to rename sample files and make sure they contain the needed properties:

$ mv tomcat/shared/classes/alfresco/extension/subsystems/SAML/share/share/my-custom-share-sp.properties.sample tomcat/shared/classes/alfresco/extension/subsystems/SAML/share/share/my-custom-share-sp.properties

my-custom-share-sp.properties:

saml.sp.isEnabled=true
saml.sp.isEnforced=false
saml.sp.idp.spIssuer.namePrefix=
saml.sp.idp.description=LemonLDAP::NG
saml.sp.idp.sso.request.url=https://auth.acme.com/saml/singleSignOn
saml.sp.idp.slo.request.url=https://auth.acme.com/saml/singleLogout
saml.sp.idp.slo.response.url=https://auth.acme.com/saml/singleLogoutReturn
saml.sp.idp.spIssuer=http://alfresco.myecm.org:8080/share
saml.sp.user.mapping.id=Subject/NameID
saml.sp.idp.certificatePath=${dir.keystore}/saml.crt

Of course you should use URLs matching your domain name!

As the configuration points to the IdP certificate, we’ll also need to copy it to the Alfresco server as well (we generated this certificate earlier) in the alf_data/keystore folder (or any other folder you may have used as dir.keystore).

$ cp saml.crt alf_data/keystore

Configuring SAML SP using the Alfresco admin console:

Configure SAML service provider using the Alfresco admin console (/alfresco/s/enterprise/admin/admin-saml).
Set the following parameters:

Of course you should use URLs matching your domain name!

Bellow is a screenshot of what it would look like:

Force SAML connection unset lets the user login either as a SAML authenticated user, or as another user, using a different subsystem.

Download the metadata and certificates from the bottom of the page, and then import the certificate you generated earlier using openssl in the Alfresco admin console.
To finish with Alfresco configuration, tick the “Enable SAML authentication (SSO)” box.

Create the SAML Service provider on the Identity Provider

The identity provider must be aware of the SP and its configuration. Using the LemonLDAP manager go to the “SAML Service Provider” section and add a new service provider.
Give it a name like “Alfresco-Share”.

Upload the metadata file exported from Alfresco admin console.

Under the newly created SP, in the “Options \ Authentication response” menu set the following parameters:

Default NameID Format: Unspecified
Force NameID session key: uid

Note that you could use whatever session key that is available and fits your needs. Here, uid makes sense for use with Alfresco logins and works for the “Demo” authentication backend in LemonLDAP::NG. If a real LDAP backend is available and Alfresco is syncing users from that same LDAP directory, then the value for the session key used as NameID value should match the ldap.synchronization.userIdAttributeName defined in the Alfresco’s ldap authentication subsystem.

Optionnally, you can also send some more informations in the authentication response to the Share SP. To do so, under the newly created SP is a section called “Exported attributes”. Configure it as follow:

This requires that the appropriate keys are exported as variables whose names are used as "Key name".

So here, it we would have the following LemonLDAP::NG exported variables:

  • fisrtname
  • lastname
  • mail

Hacks ‘n tweaks

At this point, and given you met the pre-requisites, you should be able to login without any problems. However, there may still be some issues with SP-initiated logout (initiating logout from the IdP should work though), depending on the version of SP and IdP you use. Logouts rely on the SAML SLO profile and the way it's implemented in both Alfresco and LemonLDAP at the moment still have some interoperability issues.

On the Alfresco, SAML module version 1.0.1 is impacted by MNT-18064, which prevents SLO from working properly with LemonLDAP::NG. There is a small patch attached to the JIRA that can be used and adapted to match the NameID format used by your IdP (for the configuration described here, that would be "UNSPECIFIED"):

This JIRA is to be fixed in the next alfresco-saml module (probably 1.0.2). In the mean time you can use the patch in the JIRA 

The LemonLDAP::NG project crew kindly considered re-writing their sessionIndex generation algorythm in order to avoid interoperability problems and security issues. This is needed in order to work with Alfresco and should be added in 1.9.11. Thus, previous versions won’t work:

In the mean time you can use the patch attached to LEMONLDAP-1261

If you are serious about Alfresco in your IT infrastructure you most certainly have a "User Acceptance Tests" environment. If you don't... you really should consider setting one up (don't make Common mistakes)!

When you initially set up your environments everything is quiet, users are not using the system yet and you just don't care about data freshness. However, soon after you go live, the production system will start being fed with data. Basically it is this data, stored in Alfresco Content Service, that make your system or application valuable.

When the moment comes to upgrade or deploy a new customization/application, you will obviously test it first on your UAT (or pre-production or test, whatever you call it) environment. Yes you will!

When you do so, having a UAT environment that doesn't have an up-to-date set of data can make the tests pointless or more difficult to interpret. This is also true if you plan to do kick off performance tests. If the tests are done on a data set that is only one third the size of the production data, it's pointless.

Basically that's why you need to refresh your UAT data with production data every now and then or at least when you know it's going to be needed.

The scope of this document is not to provide you with a step by step guide on how to refresh your repository. Alfresco Content Services being a platform, this highly depends on what you actually do with your repository, the kind of customization you are using and 3rd parties apps that could be link to Alfresco anyhow. This document will mainly highlight things you should thoroughly check when refreshing your dataset.

 

Prerequisites

Some thing must be validated before going further:

  • Production & UAT environments should have the same architecture (same number of servers, same components installed, and so on and so forth...)
  • Production & UAT environments should have the same sizing (while you can forget this for functional tests only, this is a true requirement for performance tests of course)
  • Production & UAT environments should be hosted on different, clearly separated networks (mostly if using cluster)

 

What is it about?

In order to refresh your UAT repository with data from production you will simply go through the normal restore process of an Alfresco repository.

Here I consider backup strategy is not a topic... If you don't have proper backups already set up, that's where you should start: Performing a hot backup | Alfresco Documentation

      

The required assets to restore are:

  • Alfresco's database
  • Filesystem repository
  • Indexes

Before you start your fresh UAT

There a re a number of things you should check before starting your refreshed environment.

 

Reconfigure cluster

Although the recommendation is to isolate environments it is better to specify different cluster configuration for both environments. That will allow for a less confusing administration and log analysis and also prevent information leaking from one network to another in case isolation is not that good.

When starting a refreshed UAT cluster, you should always make sure you are setting a cluster password or a cluster name that is different from production cluster. Doing so you prevent yourself from cluster communication to happen between nodes that are actually not part of the same cluster:

alfresco.hazelcast.password=someotherpassword

Alfresco 4.2 onward

   
alfresco.cluster.name=uatCluster

Alfresco pre-4.2

   

On the Share side, it is possible to change more parameters in order to isolate clusters but we will still apply the same logic for the sake of simplicity. Here you would change the Hazelcast password in the custom-slingshot-application-context.xml configuration file inside the {web-extension} directory.

<hz:topic id="topic" instance-ref="webframework.cluster.slingshot" name="slingshot-topic"/>
   <hz:hazelcast id="webframework.cluster.slingshot">
     <hz:config>
       <hz:group name="slingshot" password="notthesamepsecret"/>
       <hz:network port="5801" port-auto-increment="true">
         <hz:join>
           <hz:multicast enabled="true" multicast-group="224.2.2.5" multicast-port="54327"/>
           <hz:tcp-ip enabled="false">
             <hz:members></hz:members>
           </hz:tcp-ip>
        </hz:join>
...

Email notifications

It's very unlikely that your UAT environment needs to send emails or notifications to real users. Your production system is already sending digest and other emails to users and you don't want them to get confused because they received similar emails from other systems. So you have to make sure emails are either:

  • Sent to a black hole destination
  • Sent to some other place where users can't see them

If you really don't care about emails generated by Alfresco, then you can choose the "black hole" option. There are many different ways to do that, among which configuring your local MTA to send all emails to a single local user and optionally link his mailbox to /dev/null (with postfix you could use canonical_maps directive and mbox storage). Another way  to do that would be to use the java DevNull SMTP server. It is very simple to use as it is just a jar file you can launch

java -jar -console -p 10025 DevNull.jar

On the other hand, as part of your users tests, you may be interested in knowing and analyzing what emails generated by your Alfresco instance. In this case you could still use previous options. Both are indeed able to store emails instead of swallowing them, postfix by not linking the mbox storage to /dev/null, and DevNull SMTP server by using the "-s /some/path/" option. However storing emails on the filesystem is not really handy if you want to check their content and the way it renders (for instance).

If emails is a matter of interest then you can use other products like mailhog or mailtrap.io. Both offer an SMTP server that stores emails for you instead of sending it to the outside world, but they also offer a neat way to visualize them, just like a webmail would do.

 

Mailhog WebUI

Mailtrap.io is a service that also offer advanced features like POP3 (so you can see emails in "real-life" clients), SPAM score testing, content analysis and for subscription based users, collaboration features.

 

Whatever option is yours, and based on the chosen configuration you'll have to switch the following properties for you UAT Alfresco nodes:

mail.host
mail.port
mail.smtp.auth
mail.smtps.auth
mail.username
mail.password
mail.smtp.starttls.enable
mail.protocol

Jobs & FSTR synchronisation

Alfresco allows an administrator to schedule jobs and setup replication to another remote repository.

Scheduled jobs are carried over from production environments to UAT if you cloned environments or proceeded to a backup/restore of production data. However you most certainly don't want the same job to run twice from two different environments.

Defining whether or not a job should run in UAT depends on a lot of factor and is very much related to what the job actually does. Here we cannot give a list of precise actions to take in order to avoid problem. It is the administrator's call to review Scheduled jobs and decide whether or not he should/can disable them.

Jobs to review can be found a in spring bean definitions file like ${extensionRoot}/extension/my-scheduler-context.xml.

One easy way to disable jobs can be to set a cron expression to a far future (or past)

 

<property name="cronExpression">
    <value>0 50 0 * * 1970</value>
</property>

The repository can also hold synchronization jobs. Mainly those jobs that are used in File Transfer Receiver setups.

In that case the administrator surely have to disable such jobs (or at least reconfigure them) as you do not want UAT frozen data to be synced to a remote location where live production data is expected!

Disabling this kind of jobs is pretty simple. You can do it using the Share web UI by going to the "Repository \ Data Dictionary \ Transfers \ Default Target group \ Group1" and edit properties of the "Group1" folder. In the property editor form, just untick the "Activated" checkbox.

 

Repository ID & Cloud synchronization

Alfresco repository IDs must be universally unique. And of course if you clone environments,  you create duplicated repository IDs. One of the well known issue that can be triggered by duplicate IDs is for hybrid cloud setups where synchronization is enabled between the production environment and the Cloud my.alfresco.com. If your UAT servers connect to the cloud with the production's ID you can be sure synchronization will fail at some point and could even trigger data loss on your production system. You really want to avoid that from happening!

One very easy way to prevent this from happening is to simply disable clouds sync on the UAT environment.

system.serverMode=UAT

Any string other than "PRODUCTION" can be used here. Also be aware that this property can only be set in alfresco-global.properties file

 

Also if you are using APIs that need to specify the repository ID in order to request Alfresco (like old CMIS endpoint used to) then such API calls may stop working in UAT as the repo ID is now the one from production (in the case the calls where initially written with a previous ID, and it is not gathered previously - which would be a poor  approach in most cases).

Starting with Alfresco 4.2, CMIS now returns the string "-default-" as a  repository ID, for all new API endpoints (e.g. atompub /alfresco/api/-default-/public/cmis/versions/1.1/atom), while previous endpoint (e.g. atompub /alfresco/cmisatom) returns a Universally Unique IDentifier.

 

If you think you need to change the repository ID, please contact Alfresco support. It makes the procedure heavier (a re-index is expected) and should be thoroughly planned.

 

Carrying unwanted configuration

If you stick to the best practices for production, you probably try to have all your configuration in properties or xml files in the {extensionRoot} directory.

But on the other hand, you may sometimes use the great facilities offered by Alfresco enterprise, such you as JMX interface or the admin console. You must then remember those tools will persist configuration information to the database. This means that, when restoring a database from one environment to another one, you may end up starting an Alfresco instance with wrong parameters. 

Here is a quite handy SQL query you can use *before* starting your new Alfresco UAT. It will report all the properties that are stored in the database. You can then make sure none of them is harmful or points to a production system.

SELECT APSVk.string_value AS property, APSVv.string_value AS value
  FROM alf_prop_link APL
    JOIN alf_prop_value APVv ON APL.value_prop_id=APVv.id
    JOIN alf_prop_value APVk ON APL.key_prop_id=APVk.id
    JOIN alf_prop_string_value APSVk ON APVk.long_value=APSVk.id
    JOIN alf_prop_string_value APSVv ON APVv.long_value=APSVv.id
WHERE APL.key_prop_id <> APL.value_prop_id
    AND APL.root_prop_id IN (SELECT prop1_id FROM alf_prop_unique_ctx);

                 property                 |                value
------------------------------------------+--------------------------------------
alfresco.port                            | 8084

Do not try to delete those entries from the database straight away this is likely to brake things!

   

If any property is conflicting with the new environment, it should be removed.

Do it wisely! An administrator should ALWAYS prefer using the "Revert" operations available through the JMX interface!

The "revert()" method is available using jconsole, in the Mbean tab, within the appropriate section:

revert property using jconsole

"revert()" may revert more properties than just the one you target. If unsure how to get rid of a single property, please contact alfresco support.

   

Other typical properties to change:

When moving from environment the properties bellow are likely to be different in UAT environment (that may not be the case for you or you may have others). As said earlier they should be set in the ${extensionroot} folder to a value that is specific to UAT (and they should not be present in database):

ldap.authentication.java.naming.provider.url
ldap.synchronization.java.naming.security.principal
ldap.synchronization.java.naming.security.credentials
solr.host
solr.port

Load balancing a network protocol is something quite common nowadays. There are loads of ways to do that for HTTP for instance, and generally speaking all "single flow" protocols can be load-balanced quite easily. However, some protocols are not as simple as HTTP and require several connections. This is exactly what is FTP.

 

Reminder: FTP modes

Let's take a deeper look at the FTP protocol, in order to better understand how we can load-balance it. In order for an FTP client to work properly, two connections must be opened between the client and the server:

  • A control connection
  • A data connection

The control connection is initiated by the FTP client to the TCP port 21 on the server. On the other end, the data connection can be created in different ways. The first way is the through an "active" FTP session. In this mode the client sends a "PORT" command which randomly opens one of its network port and instruct the server to connect to it using port 20 as source port. This mode is usually discouraged or even server configuration prevent it for security reasons (the server initiate the data connection to the client). The second FTP mode is the "passive" mode. When using the passive mode a client sends a "PASV" command to the server. As a response the server opens a TCP port and sends the number and IP address as part of the PASV response so the client knows what socket to use. Modern FTP clients usually use this mode first if supported by the server. There is a third mode which is the "extended opassive" mode. It is very similar to the "passive" mode but the client sends an "EPSV" command (instead of "PASV") and the server respond with only the number of the TCP port that has been chosen for data connection (without sending the IP address).

 

Load balancing concepts

So now that we know how FTP works we also know that load-balancing FTP requires balancing both the control connections and the data connections. The load balancer must also make sure that data connections are sent the right backend server, the one which replied to the client command.

 

Alfresco configuration

From your ECM side, there is not much to do but there are some pre-requisites:

  • Alfresco nodes must belong to the same (working) cluster
  • Alfresco nodes must be reachable from the load balancer on the FTP ports
  • No FTP related properties should have been persisted in database

The Alfresco configuration presented bellow is valid for both load balancing method presented later. Technically every bit of this Alfresco configuration is not required, depending on the method you choose, but applying the config as shown will work on both cases.

First of all you should prefer setting FTP options in the alfresco-global.properties file as Alfresco cluster nodes have different settings, which you may not set using either the admin-console or the JMX interface.

If you have already set FTP parameters using JMX (or the admin-console), those parameters are persisted in the database and need to be remove from there (using the "revert" action in JMX for example).

Add the following to your alfresco-global.properties and restart Alfresco:

 

### FTP Server Configuration ###
ftp.enabled=true
ftp.port=2121
ftp.dataPortFrom=20000
ftp.dataPortTo=20009

 

ftp.dataPortFrom and ftp.dataPortTo properties need to be different on all servers. So if there were 2 Alfresco nodes alf1 and alf2, the properties for alf2 could be:

ftp.dataPortFrom=20010
ftp.dataPortTo=20019

 

Load balancing with LVS/Keepalived

 

Keepalived is a Linux based load-balancing system. It wraps the IPVS (also called LVS) software stack from the Linux-HA project and offer additional features like backend monitoring and VRRP redundancy. The schema bellow shows how Keepalived proceed with FTP load-balancing. It tracks control connection on port 21 and dynamically handles the data connections using a Linux kernel module called "ip_vs_ftp" which inspect the control connection in order to be aware of the port that will be used to open the data connection.

 

 

Configuration steps are quite simple.

 

First install the software:

sudo apt-get install keepalived

Then create a configuration file using the sample:

sudo cp /usr/share/doc/keepalived/samples/keepalived.conf.sample /etc/keepalived/keepalived.conf

Edit the newly created file in order to add a new virtual server and the associated backend servers: virtual_server

 

192.168.0.39 21 {

    delay_loop 6

    lb_algo rr

    lb_kind NAT

    protocol TCP

    real_server 10.1.2.101 2121 {

        weight 1

        TCP_CHECK {

            connect_port 2121

            connect_timeout 3

        }

    }

    real_server 10.1.2.102 2121 {

        weight 1

        TCP_CHECK {

            connect_port 2121

            connect_timeout 3

        }

    }

}

In a production environment you will most certainly want to use an additional VRRP instance to ensure a highly available load balancer. Please refer to the Keepalived documentation in order to set that up or just use the example given in the distribution files.

The example above defines a virtual server that listen on socket 192.168.0.39:21. Connections sent to this socket are redirected to backend servers using round-robin algorithm (others are available) and after masquerading source IP address. Additionally we need to load the FTP helper in order to track FTP data connections:

 

echo 'ip_vs_ftp' >> /etc/modules

It is important to note that this setup leverage the ftp kernel helper which reads the content of FTP frames. This means that it doesn't work when FTP is secured using SSL/TLS

 

Secure FTP load-balancing

 

Before you go any further:

 

This method has a huge advantage: it can handle FTPs (SSL/TLS). However, it also have a big disadvantage: it doesn't work when the load balancer behaves as a NAT gateway (which is basically what HAProxy does).
This is mainly because at the moment Alfresco doesn't comply with the necessary pre-requisites for secure FTP to work.

 

Some FTP clients may work even with this limitation. It may happen to work if server is using ipv6 or for clients using the "Extended Passive Mode" on ipv4 (which is normally used for ipv6 only). To better understand how, please see FTP client and passive session behing a NAT.

 

This means that what's bellow will mainly only work with macOSX ftp command line and probably no other FTP client!

Don't spend time on it and use previous method if you need other FTP clients or if you have no control over what FTP client your users have.

 

Load balancing with HAProxy

 

This method can also be adapted to Keepalived using iptables mangling and "fwmark" (see Keepalived secure FTP), but you should only need it if you are bound to FTPs as normal FTP is much better handled by previous method.

HAProxy is a modern and widely used load balancer. It provides similar features as Keepalived and much more. Nevertheless HAProxy is not able to track data connections as related to the global FTP session. For this reason we have to trick the FTP protocol in order to provide connection consistency within the session. Basically we will split the load balancing in several parts:

  • control connection load-balancing
  • data connection load balancing or each backend server

So if we have 2 backend servers - as shown in the schema bellow - we will create 3 load balancing connection pools (let's called it like this for now).

 

 

First install the software:

sudo apt-get install haproxy

HAProxy has the notion of "frontends" and "backends". Frontends allow to define specific sockets (or set of sockets) each of which can be linked to different backends. So we can use the configuration bellow:

frontend alfControlChannel

    bind *:21

    default_backend alfPool

frontend alf1DataChannel

    bind *:20000-20009

    default_backend alf1

frontend alf2DataChannel

    bind *:20010-20019

    default_backend alf2

backend alfPool

    server alf1 10.1.2.101:2121 check port 2121 inter 20s

    server alf2 10.1.2.102:2121 check port 2121 inter 20s

backend alf1

    server alf1 10.1.2.101:2121 check port 2121 inter 20s

backend alf2

    server alf2 10.1.2.102:2121 check port 2121 inter 20s

 

So in this case the frontend that handle the control connection load-balancing (alfControlChannel) alternatively sends requests to all backend server (alfPool). Each server (alf1 & alf2) will negotiate a data transfer socket on a different frontend (alf1DataChannel & alf2DataChannel). Each of this frontend will only forward data connection to the only corresponding backend (alf1 or alf2), thus making the load balancing sticky. And... job done!