Skip navigation
All Places > Alfresco Content Services (ECM) > Blog > Authors spoogegibbon

What?



This blog post is about speeding up the delivery of content from Alfresco.



The example I'll discuss here might not make any noticeable difference to the end user, more that it will free up resource on the Alfresco server so it can get on with the job of delivering information.



This is done using a cache in front of the application server running Alfresco.

Background:



Running Alfresco in the Cloud has meant we've had to invest in monitoring solutions for our application that we wouldn't normally have needed for our internal instances - in this case one of the tools we are using is called AppDynamics.



One of the immediate things it showed was that a high percentage of all calls to Alfresco were for static assets - javascript files, css files, images etc that are used to build the static parts of the viewed page.



By offloading these to a caching layer, the Alfresco application can just concentrate on serving the dynamic content to the user - hopefully faster too :)

What to use:



After researching the different tools to use to cache the static assets, I opted for Varnish. https://www.varnish-cache.org/



Varnish says this about itself on its website:



'Varnish Cache is a web accelerator, sometimes referred to as a HTTP accelerator or a reverse HTTP proxy, that will significantly enhance your web performance and web content delivery. Varnish Cache speeds up a website by storing a copy of the page served by the web server the first time a user visits that page. The next time a user requests the same page, Varnish will serve the copy instead of requesting the page from the web server. This means that your web server needs to handle less traffic and your website’s performance and scalability go through the roof. In fact Varnish Cache is often the single most critical piece of software in a web based business. '



To integrate this into our service, I hooked it into the existing HAProxy configuration which you can read about here: https://www.alfresco.com/blogs/devops/2014/07/16/haproxy-for-alfresco-updated-fror-haproxy-1-5/



The interaction of these two services can be visualised as:



Varnish Layout



The reason I integrated the two this way was that the HAProxy service already has the knowledge of the various proxy routes needed to run our service so it was sensible to keep this knowledge and not duplicate. The end result is that the configuration for Varnish is really simple.



Below is the Varnish config - it listens on 127.0.0.1 so can't be accessed inappropriately from the network.



It caches all static assets, and also doclib thumbnails. It strips cookies off these to ensure that all users share one cache for best performance.



It has a health check return result so HAProxy can monitor the health of the cache and bypass it if the Varnish service has any issues.



It also removes some of the standard headers that Varnish sets - to remove the risk of an information disclosure vulnerability (see https://www.owasp.org/index.php/Information_Leakage), and then sets a custom header that can be used to determine a cache hit or miss.



Config:

#

# varnish config

# caches all static files (images, js, css, txt, flash)

# but requests from backend dynamic content.

# Note, only static asset urls should end up here anyway.

#

backend default {

.host = '127.0.0.1';

.port = '8000';

.first_byte_timeout = 300s;

}

# what files to cache

sub vcl_recv {

#Health Checking

if (req.url == '/varnishcheck') {

error 751 'health check OK!';

}

# grace period (stale content delivery while revalidating)

set req.grace = 5s;

# Accept-Encoding header clean-up

if (req.http.Accept-Encoding) {

   #use gzip when possible, otherwise use deflate

   if (req.http.Accept-Encoding ~ 'gzip') {

     set req.http.Accept-Encoding = 'gzip';

   } elsif (req.http.Accept-Encoding ~ 'deflate') {

     set req.http.Accept-Encoding = 'deflate';

   } else {

     # unknown algorithm, remove accept-encoding header

     unset req.http.Accept-Encoding;

   }

   # Microsoft Internet Explorer 6 is well know to be buggy with compression and css / js

   if (req.url ~ '\.(css|js)' && req.http.User-Agent ~ 'MSIE 6') {

     remove req.http.Accept-Encoding;

   }

}



#Cache all the cachable stuff!

return(lookup);

}

# strip the cookie before the image is inserted into cache

sub vcl_fetch {

if (req.url ~ '\.(png|gif|jpg|swf|css|js)$') {

   unset beresp.http.set-cookie;

}

if (req.url ~ '/content/thumbnails/') {

   unset beresp.http.set-cookie;

}

if (beresp.http.content-type ~ '(text|application)') {

   set beresp.do_gzip = true;

}

if (beresp.status == 404) {

   set beresp.ttl = 0s;

   return (hit_for_pass);

}

return (deliver);

}

# add response header to see if document was cached

sub vcl_deliver {

unset resp.http.via;

unset resp.http.x-varnish;

if (obj.hits > 0) {

set resp.http.V-Cache = 'HIT';

} else {

set resp.http.V-Cache = 'MISS';

}

}

sub vcl_error {

# Health check

if (obj.status == 751) {

set obj.status = 200;

return (deliver);

}

}


To be able to use Varnish, we modified our HAProxy configuration to include a new route for static assets to pass through to Varnish:

## Add a new Frontend for Varnish to connect to

## All this does is send traffic to the share backend

# Front end for Varnish connections

frontend httpvarnish

bind 127.0.0.1:8000

acl is_share path_reg ^/share

use_backend share if is_share

## This bit needs to go in the main Frontend, serving port 443 for example.

# acl to match on static asset paths, or content types

acl static_assets path_reg ^/share/-default-/res/.*

acl static_assets path_end .gif .png .jpg .css .js .swf

acl static_assets path_reg /content/thumbnails/.*

#Varnish service check

acl varnish_available nbsrv(varnish_cache) ge 1



##Route traffic to Varnish if the Varnish service check has returned positive, and we are serving a static asset

#Make sure this is the first use_backend in the list

use_backend varnish_cache if static_assets varnish_available



## Backend for connecting to varnish

backend varnish_cache

  option redispatch

  cookie JSESSIONID

  # Varnish must tell it's ready to accept traffic

  option httpchk HEAD /varnishcheck

  http-check expect status 200

  # client IP information

  option forwardfor

  server varnish-1 localhost:6081 cookie share1 check inter 2000


Once this configuration is up and running, when you access your service if you take a look at the response headers using your browsers developer tools you should see a header like this:

v-cache:HIT


This shows that the asset is now being served from Varnish and hasn't had to be served by Alfresco.



The last three days information for one of our 3 web nodes shows:

      660140         0.00         2.24 client_req - Client requests received

      441094         0.00         1.49 cache_hit - Cache hits


So, if all our web nodes served that many cache hits over that time period we've served ~18,000 cache hits per hour (or ~300 per minute) over those last 3 days. That's quite a lot of load shifted away from the Share/Alfresco service.

Extras:



There are a load of commands that come with Varnish that can be used to monitor the cache.



Here are a few of them - the best place to read more about them is the Varnish website listed above.



Check varnish config:



  • varnishd -C -f /etc/varnish/default.vcl


  • varnishlog -b


  • varnishlog -c


  • varnishtop -i txurl #will show you what your backend is being asked the most


  • varnishtop -i rxurl #will show you what URLs are being asked for by the client


  • varnishtop -i rxurl -i RxHeader


  • varnishtop -i RxHeader -I Accept-Encoding #will show the most popular Accept-Encoding header the client are sending you


  • varnishtop -i rxurl -i txurl -i TxStatus -i TxResponse


  • varnishhist #utility reads varnishd(1) shared memory logs and presents a continuously updated histogram showing the distribution of the last N requests by their processing.


  • varnishsizes #does the same as varnishhist, except it shows the size of the objects and not the time take to complete the request.


  • varnishstat #varnish has lots of counters. We count misses, hits, information about the storage, threads created, deleted objects. Just about everything - this command will dump these counters.


Notice:

Now that HAProxy 1.5.x has been released, I thought I'd update this to bring the configuration in line with some of the syntax changes.

The changes are small but without them the haproxy service won't start.



-----------------------------------------------------------------------------------------



For the cloud service we (Alfresco DevOps) used to use apache for all our load balancing and reverse proxy use, but more recently we switched to use HAProxy for this task.



In this article I'll list some of the settings we use, and give a final example that could be used (with a bit of environment specific modifications) for a general Alfresco deployment.



The main website for HAProxy is: http://haproxy.1wt.eu/

The docs can be found here: http://cbonte.github.io/haproxy-dconv/configuration-1.5.html



I suggest that for any of the settings covered in the rest of this article, the HAProxy docs are consulted to gain a deeper understanding of what they do.



The 'global' section:

global

pidfile /var/run/haproxy.pid

log 127.0.0.1 local2 info

stats socket /var/run/haproxy.stat user nagios group nagios mode 600 level admin


A quick breakdown of these:



  • global - defines global settings.


  • pidfile - Writes pids of all daemons into file <pidfile>.


  • log - Adds a global syslog server. Optional


  • stats socket - Sets up a statistics output socket. Optional



The 'defaults' section:

defaults

mode http

log global


A quick breakdown of these:



  • defaults - defines the default settings


  • mode - sets the working mode to http (rather than tcp)


  • log - sets the log context



Now we configure some options that specify how HAProxy works, these options are very important to get your service working properly:

option httplog

option dontlognull

option forwardfor

option http-server-close

option redispatch

option tcp-smart-accept

option tcp-smart-connect


These options do the following:



  • option httplog - this enables logging of HTTP request, session state and timers.


  • option dontlognull - disable logging of null connections as these can pollute the logs.


  • option forwardfor - enables the insertion of the X-Forwarded-For header to requests sent to servers.


  • option http-server-close - enable HTTP connection closing on the server side. See the HAProxy docs for more info on this setting.


  • option redispatch - enable session redistribution in case of connection failure, which is important in a HA environment.


  • option tcp-smart-accept - this is a performance tweak, saving one ACK packet during the accept sequence.


  • option tcp-smart-connect - this is a performance tweak, saving of one ACK packet during the connect sequence.



Next we define the timeouts - these are fairly self-explanatory:

timeout http-request 10s

timeout queue 1m

timeout connect 5s

timeout client 2m

timeout server 2m

timeout http-keep-alive 10s

timeout check 5s

retries 3


We then configure gzip compression to reduce the amount of data being sent across the wire - I'm sure no configuration ever misses out this easy performance optimisation:

compression algo gzip

compression type text/html text/html;charset=utf-8 text/plain text/css text/javascript application/x-javascript application/javascript application/ecmascript application/rss+xml application/atomsvc+xml application/atom+xml application/atom+xml;type=entry application/atom+xml;type=feed application/cmisquery+xml application/cmisallowableactions+xml application/cmisatom+xml application/cmistree+xml application/cmisacl+xml application/msword application/vnd.ms-excel application/vnd.ms-powerpoint


The next section is some error message housekeeping. Change these paths to wherever you want to put your error messages:

errorfile 400 /var/www/html/errors/400.http

errorfile 403 /var/www/html/errors/403.http

errorfile 408 /var/www/html/errors/408.http

errorfile 500 /var/www/html/errors/500.http

errorfile 502 /var/www/html/errors/502.http

errorfile 503 /var/www/html/errors/503.http

errorfile 504 /var/www/html/errors/504.http


Now we have finished setting up all our defaults, we can start to define our front ends (listening ports).



We first define our frontend on port 80. This just does a redirect to the https frontend:

# Front end for http to https redirect

frontend http

bind *:80

redirect location https://my.yourcompany.com/share/


Next we define our https frontend which is where all traffic to Alfresco is handled:

# Main front end for all services

frontend https

bind *:443 ssl crt /path/to/yourcert/yourcert.pem

capture request header X-Forwarded-For len 64

capture request header User-agent len 256

capture request header Cookie len 64

capture request header Accept-Language len 64


We now get into the more 'fun' part of configuring HAProxy - setting up the acls.

These acls are the mechanism used to match requests to the service to the appropriate backend to fulfil those requests, or to deny unwanted traffic from the service. I suggest that if you are unfamiliar with HAProxy that you have a good read of the docs for acls and what they can achieve (section 7 in the docs).



We separate out all the different endpoints for Alfresco into their own sub-domain name, e.g. my.alfresco.com for share access, webdav.alfresco.com for webdav, sp.alfresco.com for sparepoint access.

I'll use these three endpoints in the examples below, using the following mapping:



  • Share - my.yourcompany.com


  • Webdav - webdav.yourcompany.com


  • Sharepoint - sp.yourcompany.com



We first set up some acls that check the host name being accessed and match on those. Anything coming in that doesn't match these won't get an acl associated (and therefore won't get forwarded to any service).

# ACL for backend mapping based on host header

acl is_my hdr_beg(host) -i my.yourcompany.com

acl is_webdav hdr_beg(host) -i webdav.yourcompany.com

acl is_sp hdr_beg(host) -i sp.yourcompany.com


These are in the syntax:

acl acl_name match_expression case_insensitive(-i) what_to_match

So, acl is_my hdr_beg(host) -i my.yourcompany.com states:



  • acl - define this as an acl.


  • is_my - give the acl the name 'is_my'.


  • hdr_beg(host) - set the match expression to use the host HTTP header, checking the beginning of the value.


  • -i - set the check to be case insensitive


  • my.yourcompany.com - the value to check for.



We then do some further mapping based on url paths in the request using some standard regex patterns:

# ACL for backend mapping based on url paths

acl robots path_reg ^/robots.txt$

acl alfresco_path path_reg ^/alfresco/.*

acl share_path path_reg ^/share/.*/proxy/alfresco/api/solr/.*

acl share_redirect path_reg ^$|^/$


These do the following:



  • acl robots - checks for a web bot harvesting the robots.txt file


  • acl alfresco_path - checks whether the request is trying to access the alfresco webapp. We deny direct access to the Alfresco Explorer webapp so you can remove this check if you want that webapp available for use.


  • acl share_path - We use this to deny direct access to the Solr API.


  • acl share_redirect - this checks whether there is any context at the end of the request (e.g. /share)



We next add in some 'good practice' - a HSTS header. You can find out more about HSTS here: https://www.owasp.org/index.php/HTTP_Strict_Transport_Security

Note, my.alfresco.com is in the internal HSTS list in both Chrome and Firefox so neither of these browsers will ever try to access the service using plain http (see http://www.chromium.org/sts).

# Changes to header responses

rspadd Strict-Transport-Security:\ max-age=15768000


We next set up some deny settings, you can ignore these if you don't want to limit access to any service. The example below denies access to the Alfresco Explorer app from public use via the 'my.yourcompany.com' route. These use matched acls from earlier, and can include multiple acls that must all be true.

# Denied paths

http-request deny if alfresco_path is_my


Now we redirect to /share/ if this wasn't in the url path used to access the service.

# Redirects

redirect location /share/ if share_redirect is_my


Next we set up the list of backends to use, matched against the already defined acls.

# List of backends

use_backend share if is_my

use_backend webdav if is_webdav

use_backend sharepoint if is_sp


Then we set up the default backend to use as a catch-all:

default_backend share


Now we define the backends, the first being for share:

backend share


On this backend, enable the stats page:

# Enable the stats page on share backend

stats enable

stats hide-version

stats auth <user>:<password>

stats uri /monitor

stats refresh 2s


The stats page gives you a visual view on the health of your backends and is a very powerful monitoring tool.

option httpchk GET /share

balance leastconn

cookie JSESSIONID prefix

server tomcat1 server1:8080 cookie share1 check inter 5000

server tomcat2 server2:8080 cookie share2 check inter 5000


These define the following:



  • backend share - this defines a backend called share, which is used by the use_backend config from above.


  • option httpchk GET /share - this enables http health checks, using a http GET, on the /share path. Server health checks are one of the most powerful feature of HAProxy and works hand in hand with tomcat session replication to move an active session to another server if the server your active session on fails healthchecks.


  • balance leastconn - this sets up the balancing algorithm. leastconn selects the server with the lowest number of connections to receive the connection.


  • cookie JSESSIONID prefix - this enables cookie-based persistence in a backend. Share requires a sticky session and this also is used in session replication.


  • server tomcat1 server1:8080 cookie share1 check inter 5000 - this breaks down into:


  • server - this declares a server and its parameters


  • tomcat1 - this is the server name and appears in the logs


  • server1:8080 - this is the server address (and port)


  • cookie share1 - this checks the cookie defined above and if matched routes the user to the relevant server. The 'share1' value has to match the jvmroute set on the appserver for Share/Alfresco (for Tomcat see http://tomcat.apache.org/tomcat-7.0-doc/cluster-howto.html)


  • check inter 5000 - this sets the health check, with an inter(val) of 5000 ms



Define the webdav backend.

Here we hide the need to enter /alfresco/webdav on the url path which gives a neater and shorter url needed to access webdav, and again we enable server health checking:

backend webdav

option httpchk GET /alfresco

reqrep ^([^\ ]*)\ /(.*) \1\ /alfresco/webdav/\2

server tomcat1 server1:8080 check inter 5000

server tomcat2 server2:8080 check inter 5000


Define the SPP backend.

Here we define the backend for the sharepoint protocol, again with health checks:

backend sharepoint

balance url_param VTISESSIONID check_post

cookie VTISESSIONID prefix

server tomcat1 server1:7070 cookie share1 check inter 5000

server tomcat2 server2:7070 cookie share2 check inter 5000


Once this is all in place you should be able to start HAProxy. If you get any errors you will be informed on which lines of the config these are in. Or, if you have HAProxy as a service, you should be able to run 'service haproxy check' to check the config without starting HAProxy.



There are many more cool things you can do with HAProxy, so give it a go and don't forget to have a good read of the docs!
spoogegibbon

HAProxy for Alfresco

Posted by spoogegibbon Nov 13, 2013
Since this post was published there has been a HAProxy 1.5(.x) release version, so this post is now out of date.

An updated post with the changes relevant to HAProxy 1.5 can be are here: https://www.alfresco.com/blogs/devops/?p=8



-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



For the cloud service we (Alfresco DevOps) used to use apache for all our load balancing and reverse proxy use, but more recently we switched to use HAProxy for this task.



In this article I'll list some of the settings we use, and give a final example that could be used (with a bit of environment specific modifications) for a general Alfresco deployment.



The main website for HAProxy is: http://haproxy.1wt.eu/

The docs can be found here: http://cbonte.github.io/haproxy-dconv/configuration-1.5.html



I suggest that for any of the settings covered in the rest of this article, the HAProxy docs are consulted to gain a deeper understanding of what they do.



The 'global' section:

global

pidfile /var/run/haproxy.pid

log 127.0.0.1 local2 info

stats socket /var/run/haproxy.stat user nagios group nagios mode 600 level admin


A quick breakdown of these:



  • global - defines global settings.


  • pidfile - Writes pids of all daemons into file <pidfile>.


  • log - Adds a global syslog server. Optional


  • stats socket - Sets up a statistics output socket. Optional



The 'defaults' section:

defaults

mode http

log global


A quick breakdown of these:



  • defaults - defines the default settings


  • mode - sets the working mode to http (rather than tcp)


  • log - sets the log context



Now we configure some options that specify how HAProxy works, these options are very important to get your service working properly:

option httplog

option dontlognull

option forwardfor

option http-server-close

option redispatch

option tcp-smart-accept

option tcp-smart-connect


These options do the following:



  • option httplog - this enables logging of HTTP request, session state and timers.


  • option dontlognull - disable logging of null connections as these can pollute the logs.


  • option forwardfor - enables the insertion of the X-Forwarded-For header to requests sent to servers.


  • option http-server-close - enable HTTP connection closing on the server side. See the HAProxy docs for more info on this setting.


  • option redispatch - enable session redistribution in case of connection failure, which is important in a HA environment.


  • option tcp-smart-accept - this is a performance tweak, saving one ACK packet during the accept sequence.


  • option tcp-smart-connect - this is a performance tweak, saving of one ACK packet during the connect sequence.



Next we define the timeouts - these are fairly self-explanatory:

timeout http-request 10s

timeout queue 1m

timeout connect 5s

timeout client 2m

timeout server 2m

timeout http-keep-alive 10s

timeout check 5s

retries 3


We then configure gzip compression to reduce the amount of data being sent across the wire - I'm sure no configuration ever misses out this easy performance optimisation:

compression algo gzip

compression type text/html text/html;charset=utf-8 text/plain text/css text/javascript application/x-javascript application/javascript application/ecmascript application/rss+xml application/atomsvc+xml application/atom+xml application/atom+xml;type=entry application/atom+xml;type=feed application/cmisquery+xml application/cmisallowableactions+xml application/cmisatom+xml application/cmistree+xml application/cmisacl+xml application/msword application/vnd.ms-excel application/vnd.ms-powerpoint


The next section is some error message housekeeping. Change these paths to wherever you want to put your error messages:

errorfile 400 /var/www/html/errors/400.http

errorfile 403 /var/www/html/errors/403.http

errorfile 408 /var/www/html/errors/408.http

errorfile 500 /var/www/html/errors/500.http

errorfile 502 /var/www/html/errors/502.http

errorfile 503 /var/www/html/errors/503.http

errorfile 504 /var/www/html/errors/504.http


Now we have finished setting up all our defaults, we can start to define our front ends (listening ports).



We first define our frontend on port 80. This just does a redirect to the https frontend:

# Front end for http to https redirect

frontend http

bind *:80

redirect location https://my.yourcompany.com/share/


Next we define our https frontend which is where all traffic to Alfresco is handled:

# Main front end for all services

frontend https

bind *:443 ssl crt /path/to/yourcert/yourcert.pem

capture request header X-Forwarded-For len 64

capture request header User-agent len 256

capture request header Cookie len 64

capture request header Accept-Language len 64


We now get into the more 'fun' part of configuring HAProxy - setting up the acls.

These acls are the mechanism used to match requests to the service to the appropriate backend to fulfil those requests, or to block unwanted traffic from the service. I suggest that if you are unfamiliar with HAProxy that you have a good read of the docs for acls and what they can achieve (section 7 in the docs).



We separate out all the different endpoints for Alfresco into their own sub-domain name, e.g. my.alfresco.com for share access, webdav.alfresco.com for webdav, sp.alfresco.com for sparepoint access.

I'll use these three endpoints in the examples below, using the following mapping:



  • Share - my.yourcompany.com


  • Webdav - webdav.yourcompany.com


  • Sharepoint - sp.yourcompany.com



We first set up some acls that check the host name being accessed and match on those. Anything coming in that doesn't match these won't get an acl associated (and therefore won't get forwarded to any service).

# ACL for backend mapping based on host header

acl is_my hdr_beg(host) -i my.yourcompany.com

acl is_webdav hdr_beg(host) -i webdav.yourcompany.com

acl is_sp hdr_beg(host) -i sp.yourcompany.com


These are in the syntax:

acl acl_name match_expression case_insensitive(-i) what_to_match

So, acl is_my hdr_beg(host) -i my.yourcompany.com states:



  • acl - define this as an acl.


  • is_my - give the acl the name 'is_my'.


  • hdr_beg(host) - set the match expression to use the host HTTP header, checking the beginning of the value.


  • -i - set the check to be case insensitive


  • my.yourcompany.com - the value to check for.



We then do some further mapping based on url paths in the request using some standard regex patterns:

# ACL for backend mapping based on url paths

acl robots path_reg ^/robots.txt$

acl alfresco_path path_reg ^/alfresco/.*

acl share_path path_reg ^/share/.*/proxy/alfresco/api/solr/.*

acl share_redirect path_reg ^$|^/$


These do the following:



  • acl robots - checks for a web bot harvesting the robots.txt file


  • acl alfresco_path - checks whether the request is trying to access the alfresco webapp. We block direct access to the Alfresco Explorer webapp so you can remove this check if you want that webapp available for use.


  • acl share_path - We use this to block direct access to the Solr API.


  • acl share_redirect - this checks whether there is any context at the end of the request (e.g. /share)



We next add in some 'good practice' - a HSTS header. You can find out more about HSTS here: https://www.owasp.org/index.php/HTTP_Strict_Transport_Security

Note, my.alfresco.com is in the internal HSTS list in both Chrome and Firefox so neither of these browsers will ever try to access the service using plain http (see http://www.chromium.org/sts).

# Changes to header responses

rspadd Strict-Transport-Security:\ max-age=15768000


We next set up some blocks, you can ignore these if you don't want to limit access to any service. The example below blocks access to the Alfresco Explorer app from public use via the 'my.yourcompany.com' route. These use matched acls from earlier, and can include multiple acls that must all be true.

# Blocked paths

block if alfresco_path is_my


Now we redirect to /share/ if this wasn't in the url path used to access the service.

# Redirects

redirect location /share/ if share_redirect is_my


Next we set up the list of backends to use, matched against the already defined acls.

# List of backends

use_backend share if is_my

use_backend webdav if is_webdav

use_backend sharepoint if is_sp


Then we set up the default backend to use as a catch-all:

default_backend share


Now we define the backends, the first being for share:

backend share


On this backend, enable the stats page:

# Enable the stats page on share backend

stats enable

stats hide-version

stats auth <user>:<password>

stats uri /monitor

stats refresh 2s


The stats page gives you a visual view on the health of your backends and is a very powerful monitoring tool.

option httpchk GET /share

balance leastconn

cookie JSESSIONID prefix

server tomcat1 server1:8080 cookie share1 check inter 5000

server tomcat2 server2:8080 cookie share2 check inter 5000


These define the following:



  • backend share - this defines a backend called share, which is used by the use_backend config from above.


  • option httpchk GET /share - this enables http health checks, using a http GET, on the /share path. Server health checks are one of the most powerful feature of HAProxy and works hand in hand with tomcat session replication to move an active session to another server if the server your active session on fails healthchecks.


  • balance leastconn - this sets up the balancing algorithm. leastconn selects the server with the lowest number of connections to receive the connection.


  • cookie JSESSIONID prefix - this enables cookie-based persistence in a backend. Share requires a sticky session and this also is used in session replication.


  • server tomcat1 server1:8080 cookie share1 check inter 5000 - this breaks down into:


  • server - this declares a server and its parameters


  • tomcat1 - this is the server name and appears in the logs


  • server1:8080 - this is the server address (and port)


  • cookie share1 - this checks the cookie defined above and if matched routes the user to the relevant server. The 'share1' value has to match the jvmroute set on the appserver for Share/Alfresco (for Tomcat see http://tomcat.apache.org/tomcat-7.0-doc/cluster-howto.html)


  • check inter 5000 - this sets the health check, with an inter(val) of 5000 ms



Define the webdav backend.

Here we hide the need to enter /alfresco/webdav on the url path which gives a neater and shorter url needed to access webdav, and again we enable server health checking:

backend webdav

option httpchk GET /alfresco

reqrep ^([^\ ]*)\ /(.*) \1\ /alfresco/webdav/\2

server tomcat1 server1:8080 check inter 5000

server tomcat2 server2:8080 check inter 5000


Define the SPP backend.

Here we define the backend for the sharepoint protocol, again with health checks:

backend sharepoint

balance url_param VTISESSIONID check_post

cookie VTISESSIONID prefix

server tomcat1 server1:7070 cookie share1 check inter 5000

server tomcat2 server2:7070 cookie share2 check inter 5000


Once this is all in place you should be able to start HAProxy. If you get any errors you will be informed on which lines of the config these are in. Or, if you have HAProxy as a service, you should be able to run 'service haproxy check' to check the config without starting HAProxy.



There are many more cool things you can do with HAProxy, so give it a go and don't forget to have a good read of the docs!

Introduction



So you've installed Alfresco in Amazon AWS, and your contentstore is on either a local ephemeral disk or it's on EBS.

This guide is to help you migrate from these to S3 using the Alfresco S3 Connector.



The information in this guide was compiled during the contentstore migration from EBS to S3 for one of our large AWS Alfresco users.



There are a variety of reasons to migrate the contentstore to S3. The main one is to increase the resilience of the store - during most of the AWS outages it has been EBS that has been most affected, including data loss (search google for 'ebs data loss').

With S3's 'Designed for 99.999999999% durability and 99.99% availability of objects over a given year' sla, and 'Amazon S3 Server Side Encryption (SSE)' , putting your content on S3 means you will have secure and available content items at all times.

Set up S3



First of all, create a new S3 bucket for you to use.

Make a note of the Bucket name. This will be used in all places tagged <s3_bucket>.



It is also a good idea to secure the bucket more than the default, using IAM - see the AWS documentation for this.



Next, install a tool that will allow you to migrate your existing content to S3, such as the S3tools.



If you are using RHEL6, the instructions are (for other operating systems follow the instructions on the s3tools website):

as root;

cd /etc/yum.repos.d

wget http://s3tools.org/repo/RHEL_6/s3tools.repo

yum install s3cmd

s3cmd --configure


Follow the instructions and enter the credentials asked for so you can connect to your bucket.



Once set up, check connectivity using:

s3cmd ls


This should list your buckets.

Copy your content to S3



Navigate to your contentstore directory:

cd /<dir_path>/alf_data/contentstore


If you want to check to see what will be uploaded to S3, perform a dry run first:

s3cmd sync --dry-run ./ s3://<s3_bucket>/contentstore/


Once you are happy that all is well, start the upload:

s3cmd sync ./ s3://<s3_bucket>/contentstore/


Navigate to your contentstore.deleted directory (these steps are optional if you want to keep your deleted files):

cd /<dir_path>/alf_data/contentstore.deleted


If you want to check to see what will be uploaded to S3, perform a dry run first:

s3cmd sync --dry-run ./ s3://<s3_bucket>/contentstore.deleted/


Once you are happy that all is well, start the upload:

s3cmd sync ./ s3://<s3_bucket>/contentstore.deleted/-system-/


If your contentstore is not massive and you have space on your ephemeral disks, you can copy your contentstore to 'cachedcontent' - this will mean that the S3 cached content is pre-populated. It is much better to have this on the local ephemeral disk than on EBS.

cp -r contentstore cachedcontent


Alfresco S3 Connector



Download the Alfresco S3 Connector.

Once downloaded, follow the rest of the steps in the above help to install the module into Alfresco.

alfresco-global.properties



There are some changes you will need to do to your 'alfresco-global.properties'. These are all documented in the Alfresco S3 Connector information. These changes are:

s3.accessKey=<put your account access key or IAM key here>

s3.secretKey=<put your account secret key or IAM secret here>

s3.bucketName=<s3_bucket>

s3.bucketLocation=<see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region>

s3.flatRoot=false

s3.encryption=AES256

dir.contentstore=contentstore

dir.contentstore.deleted=contentstore.deleted





If you are using lucene, set the following if it is not already set:

index.recovery.mode=AUTO


Make sure that Alfresco is stopped before you progress any further.

DB Update



Once your content is all in S3, and your Alfresco properties are all configured to use S3 as the contentstore location, there is one final step that is needed to be performed - update the Database!

One of the tables Alfresco uses in the Database has a property that links an item of content to its location. Since we have moved the content to S3, we need to update all these links in the DB. Luckily it's easy :)



First, get the details of you database configuration from 'alfresco-global.properties'.

db.name=<db.name>

db.username=<db.username>

db.password=<db.password>

db.host=<db.host>


If the mysql tools are not already installed on your box, install them, e.g.

yum install mysql


Run Mysqldump, connecting to your DB, and dump the table called 'alf_content_url'.

The command below does this (you will be prompted for your user's pwd):

mysqldump -u <db.username> -p -h <db.host> <db.name> alf_content_url > s3_migration.sql


Next, make a backup of this dump in case anything goes hideously wrong :)

cp s3_migration.sql s3_migration.sql.bak


Then, we need to change every store location for each file to point to S3.

This involves changing the values of the 'content_url' column from 'store://...' to 's3://...'

Here's a command I made earlier to do this (if you are on linux):

find s3_migration.sql -type f -exec sed -i 's/store:\/\//s3:\/\//g' {} \;


Once that completes successfully, you now need to re-import this table data.

Connect to your mysql db (you will be prompted to enter the user's pwd):

mysql -u <db.username> -p -h <db.host>


Switch to use the database that Alfresco uses:

use <db.name>;


Import your modified sql file:

source s3_migration.sql;


Exit mysql.



So, to recap:

S3 bucket has been created.

S3 cmd line tool such as s3cmd has been installed.

Content has been copied to S3.

The 'Alfresco S3 Connector' module has been installed into your Alfresco instance.

alfresco-global.properties has been updated.

Alfresco has been stopped

A dump of the 'alf_content_url' has been made, and a backup of that made.

The store location has been modified in the sql dump.

The modified dump file has been re-imported into your mysql db.



You are now ready to restart Alfresco...



There are a few methods to check that the S3 connector is all working:

1. monitor the 'cachedcontent' directory - it is used as a cache for the S3 content so that Alfresco doesn't have to request frequently used content from S3 each time it is used.

2. Upload some new content and check the S3 bucket.

3. Enable logging for jets3t as below and see what the logs say.



If things don't work, you could do the following:



Try re-synching your content.

Pro tips



You can enable JMX Instrumentation on the S3 connector by adding the following JAVA_OPTS to your Alfresco start scripts:

'-Djets3t.mx -Djets3t.bucket.mx=true -Djets3t.object.mx=true'



Logging - The S3 connector is based on jets3t, so follow the logging information for this tool:

http://jets3t.s3.amazonaws.com/toolkit/guide.html



Filter Blog

By date: By tag: