Skip navigation
All Places > Alfresco Content Services (ECM) > Blog > 2014 > July
2014

What?



This blog post is about speeding up the delivery of content from Alfresco.



The example I'll discuss here might not make any noticeable difference to the end user, more that it will free up resource on the Alfresco server so it can get on with the job of delivering information.



This is done using a cache in front of the application server running Alfresco.

Background:



Running Alfresco in the Cloud has meant we've had to invest in monitoring solutions for our application that we wouldn't normally have needed for our internal instances - in this case one of the tools we are using is called AppDynamics.



One of the immediate things it showed was that a high percentage of all calls to Alfresco were for static assets - javascript files, css files, images etc that are used to build the static parts of the viewed page.



By offloading these to a caching layer, the Alfresco application can just concentrate on serving the dynamic content to the user - hopefully faster too :)

What to use:



After researching the different tools to use to cache the static assets, I opted for Varnish. https://www.varnish-cache.org/



Varnish says this about itself on its website:



'Varnish Cache is a web accelerator, sometimes referred to as a HTTP accelerator or a reverse HTTP proxy, that will significantly enhance your web performance and web content delivery. Varnish Cache speeds up a website by storing a copy of the page served by the web server the first time a user visits that page. The next time a user requests the same page, Varnish will serve the copy instead of requesting the page from the web server. This means that your web server needs to handle less traffic and your website’s performance and scalability go through the roof. In fact Varnish Cache is often the single most critical piece of software in a web based business. '



To integrate this into our service, I hooked it into the existing HAProxy configuration which you can read about here: https://www.alfresco.com/blogs/devops/2014/07/16/haproxy-for-alfresco-updated-fror-haproxy-1-5/



The interaction of these two services can be visualised as:



Varnish Layout



The reason I integrated the two this way was that the HAProxy service already has the knowledge of the various proxy routes needed to run our service so it was sensible to keep this knowledge and not duplicate. The end result is that the configuration for Varnish is really simple.



Below is the Varnish config - it listens on 127.0.0.1 so can't be accessed inappropriately from the network.



It caches all static assets, and also doclib thumbnails. It strips cookies off these to ensure that all users share one cache for best performance.



It has a health check return result so HAProxy can monitor the health of the cache and bypass it if the Varnish service has any issues.



It also removes some of the standard headers that Varnish sets - to remove the risk of an information disclosure vulnerability (see https://www.owasp.org/index.php/Information_Leakage), and then sets a custom header that can be used to determine a cache hit or miss.



Config:

#

# varnish config

# caches all static files (images, js, css, txt, flash)

# but requests from backend dynamic content.

# Note, only static asset urls should end up here anyway.

#

backend default {

.host = '127.0.0.1';

.port = '8000';

.first_byte_timeout = 300s;

}

# what files to cache

sub vcl_recv {

#Health Checking

if (req.url == '/varnishcheck') {

error 751 'health check OK!';

}

# grace period (stale content delivery while revalidating)

set req.grace = 5s;

# Accept-Encoding header clean-up

if (req.http.Accept-Encoding) {

   #use gzip when possible, otherwise use deflate

   if (req.http.Accept-Encoding ~ 'gzip') {

     set req.http.Accept-Encoding = 'gzip';

   } elsif (req.http.Accept-Encoding ~ 'deflate') {

     set req.http.Accept-Encoding = 'deflate';

   } else {

     # unknown algorithm, remove accept-encoding header

     unset req.http.Accept-Encoding;

   }

   # Microsoft Internet Explorer 6 is well know to be buggy with compression and css / js

   if (req.url ~ '\.(css|js)' && req.http.User-Agent ~ 'MSIE 6') {

     remove req.http.Accept-Encoding;

   }

}



#Cache all the cachable stuff!

return(lookup);

}

# strip the cookie before the image is inserted into cache

sub vcl_fetch {

if (req.url ~ '\.(png|gif|jpg|swf|css|js)$') {

   unset beresp.http.set-cookie;

}

if (req.url ~ '/content/thumbnails/') {

   unset beresp.http.set-cookie;

}

if (beresp.http.content-type ~ '(text|application)') {

   set beresp.do_gzip = true;

}

if (beresp.status == 404) {

   set beresp.ttl = 0s;

   return (hit_for_pass);

}

return (deliver);

}

# add response header to see if document was cached

sub vcl_deliver {

unset resp.http.via;

unset resp.http.x-varnish;

if (obj.hits > 0) {

set resp.http.V-Cache = 'HIT';

} else {

set resp.http.V-Cache = 'MISS';

}

}

sub vcl_error {

# Health check

if (obj.status == 751) {

set obj.status = 200;

return (deliver);

}

}


To be able to use Varnish, we modified our HAProxy configuration to include a new route for static assets to pass through to Varnish:

## Add a new Frontend for Varnish to connect to

## All this does is send traffic to the share backend

# Front end for Varnish connections

frontend httpvarnish

bind 127.0.0.1:8000

acl is_share path_reg ^/share

use_backend share if is_share

## This bit needs to go in the main Frontend, serving port 443 for example.

# acl to match on static asset paths, or content types

acl static_assets path_reg ^/share/-default-/res/.*

acl static_assets path_end .gif .png .jpg .css .js .swf

acl static_assets path_reg /content/thumbnails/.*

#Varnish service check

acl varnish_available nbsrv(varnish_cache) ge 1



##Route traffic to Varnish if the Varnish service check has returned positive, and we are serving a static asset

#Make sure this is the first use_backend in the list

use_backend varnish_cache if static_assets varnish_available



## Backend for connecting to varnish

backend varnish_cache

  option redispatch

  cookie JSESSIONID

  # Varnish must tell it's ready to accept traffic

  option httpchk HEAD /varnishcheck

  http-check expect status 200

  # client IP information

  option forwardfor

  server varnish-1 localhost:6081 cookie share1 check inter 2000


Once this configuration is up and running, when you access your service if you take a look at the response headers using your browsers developer tools you should see a header like this:

v-cache:HIT


This shows that the asset is now being served from Varnish and hasn't had to be served by Alfresco.



The last three days information for one of our 3 web nodes shows:

      660140         0.00         2.24 client_req - Client requests received

      441094         0.00         1.49 cache_hit - Cache hits


So, if all our web nodes served that many cache hits over that time period we've served ~18,000 cache hits per hour (or ~300 per minute) over those last 3 days. That's quite a lot of load shifted away from the Share/Alfresco service.

Extras:



There are a load of commands that come with Varnish that can be used to monitor the cache.



Here are a few of them - the best place to read more about them is the Varnish website listed above.



Check varnish config:



  • varnishd -C -f /etc/varnish/default.vcl


  • varnishlog -b


  • varnishlog -c


  • varnishtop -i txurl #will show you what your backend is being asked the most


  • varnishtop -i rxurl #will show you what URLs are being asked for by the client


  • varnishtop -i rxurl -i RxHeader


  • varnishtop -i RxHeader -I Accept-Encoding #will show the most popular Accept-Encoding header the client are sending you


  • varnishtop -i rxurl -i txurl -i TxStatus -i TxResponse


  • varnishhist #utility reads varnishd(1) shared memory logs and presents a continuously updated histogram showing the distribution of the last N requests by their processing.


  • varnishsizes #does the same as varnishhist, except it shows the size of the objects and not the time take to complete the request.


  • varnishstat #varnish has lots of counters. We count misses, hits, information about the storage, threads created, deleted objects. Just about everything - this command will dump these counters.


Notice:

Now that HAProxy 1.5.x has been released, I thought I'd update this to bring the configuration in line with some of the syntax changes.

The changes are small but without them the haproxy service won't start.



-----------------------------------------------------------------------------------------



For the cloud service we (Alfresco DevOps) used to use apache for all our load balancing and reverse proxy use, but more recently we switched to use HAProxy for this task.



In this article I'll list some of the settings we use, and give a final example that could be used (with a bit of environment specific modifications) for a general Alfresco deployment.



The main website for HAProxy is: http://haproxy.1wt.eu/

The docs can be found here: http://cbonte.github.io/haproxy-dconv/configuration-1.5.html



I suggest that for any of the settings covered in the rest of this article, the HAProxy docs are consulted to gain a deeper understanding of what they do.



The 'global' section:

global

pidfile /var/run/haproxy.pid

log 127.0.0.1 local2 info

stats socket /var/run/haproxy.stat user nagios group nagios mode 600 level admin


A quick breakdown of these:



  • global - defines global settings.


  • pidfile - Writes pids of all daemons into file <pidfile>.


  • log - Adds a global syslog server. Optional


  • stats socket - Sets up a statistics output socket. Optional



The 'defaults' section:

defaults

mode http

log global


A quick breakdown of these:



  • defaults - defines the default settings


  • mode - sets the working mode to http (rather than tcp)


  • log - sets the log context



Now we configure some options that specify how HAProxy works, these options are very important to get your service working properly:

option httplog

option dontlognull

option forwardfor

option http-server-close

option redispatch

option tcp-smart-accept

option tcp-smart-connect


These options do the following:



  • option httplog - this enables logging of HTTP request, session state and timers.


  • option dontlognull - disable logging of null connections as these can pollute the logs.


  • option forwardfor - enables the insertion of the X-Forwarded-For header to requests sent to servers.


  • option http-server-close - enable HTTP connection closing on the server side. See the HAProxy docs for more info on this setting.


  • option redispatch - enable session redistribution in case of connection failure, which is important in a HA environment.


  • option tcp-smart-accept - this is a performance tweak, saving one ACK packet during the accept sequence.


  • option tcp-smart-connect - this is a performance tweak, saving of one ACK packet during the connect sequence.



Next we define the timeouts - these are fairly self-explanatory:

timeout http-request 10s

timeout queue 1m

timeout connect 5s

timeout client 2m

timeout server 2m

timeout http-keep-alive 10s

timeout check 5s

retries 3


We then configure gzip compression to reduce the amount of data being sent across the wire - I'm sure no configuration ever misses out this easy performance optimisation:

compression algo gzip

compression type text/html text/html;charset=utf-8 text/plain text/css text/javascript application/x-javascript application/javascript application/ecmascript application/rss+xml application/atomsvc+xml application/atom+xml application/atom+xml;type=entry application/atom+xml;type=feed application/cmisquery+xml application/cmisallowableactions+xml application/cmisatom+xml application/cmistree+xml application/cmisacl+xml application/msword application/vnd.ms-excel application/vnd.ms-powerpoint


The next section is some error message housekeeping. Change these paths to wherever you want to put your error messages:

errorfile 400 /var/www/html/errors/400.http

errorfile 403 /var/www/html/errors/403.http

errorfile 408 /var/www/html/errors/408.http

errorfile 500 /var/www/html/errors/500.http

errorfile 502 /var/www/html/errors/502.http

errorfile 503 /var/www/html/errors/503.http

errorfile 504 /var/www/html/errors/504.http


Now we have finished setting up all our defaults, we can start to define our front ends (listening ports).



We first define our frontend on port 80. This just does a redirect to the https frontend:

# Front end for http to https redirect

frontend http

bind *:80

redirect location https://my.yourcompany.com/share/


Next we define our https frontend which is where all traffic to Alfresco is handled:

# Main front end for all services

frontend https

bind *:443 ssl crt /path/to/yourcert/yourcert.pem

capture request header X-Forwarded-For len 64

capture request header User-agent len 256

capture request header Cookie len 64

capture request header Accept-Language len 64


We now get into the more 'fun' part of configuring HAProxy - setting up the acls.

These acls are the mechanism used to match requests to the service to the appropriate backend to fulfil those requests, or to deny unwanted traffic from the service. I suggest that if you are unfamiliar with HAProxy that you have a good read of the docs for acls and what they can achieve (section 7 in the docs).



We separate out all the different endpoints for Alfresco into their own sub-domain name, e.g. my.alfresco.com for share access, webdav.alfresco.com for webdav, sp.alfresco.com for sparepoint access.

I'll use these three endpoints in the examples below, using the following mapping:



  • Share - my.yourcompany.com


  • Webdav - webdav.yourcompany.com


  • Sharepoint - sp.yourcompany.com



We first set up some acls that check the host name being accessed and match on those. Anything coming in that doesn't match these won't get an acl associated (and therefore won't get forwarded to any service).

# ACL for backend mapping based on host header

acl is_my hdr_beg(host) -i my.yourcompany.com

acl is_webdav hdr_beg(host) -i webdav.yourcompany.com

acl is_sp hdr_beg(host) -i sp.yourcompany.com


These are in the syntax:

acl acl_name match_expression case_insensitive(-i) what_to_match

So, acl is_my hdr_beg(host) -i my.yourcompany.com states:



  • acl - define this as an acl.


  • is_my - give the acl the name 'is_my'.


  • hdr_beg(host) - set the match expression to use the host HTTP header, checking the beginning of the value.


  • -i - set the check to be case insensitive


  • my.yourcompany.com - the value to check for.



We then do some further mapping based on url paths in the request using some standard regex patterns:

# ACL for backend mapping based on url paths

acl robots path_reg ^/robots.txt$

acl alfresco_path path_reg ^/alfresco/.*

acl share_path path_reg ^/share/.*/proxy/alfresco/api/solr/.*

acl share_redirect path_reg ^$|^/$


These do the following:



  • acl robots - checks for a web bot harvesting the robots.txt file


  • acl alfresco_path - checks whether the request is trying to access the alfresco webapp. We deny direct access to the Alfresco Explorer webapp so you can remove this check if you want that webapp available for use.


  • acl share_path - We use this to deny direct access to the Solr API.


  • acl share_redirect - this checks whether there is any context at the end of the request (e.g. /share)



We next add in some 'good practice' - a HSTS header. You can find out more about HSTS here: https://www.owasp.org/index.php/HTTP_Strict_Transport_Security

Note, my.alfresco.com is in the internal HSTS list in both Chrome and Firefox so neither of these browsers will ever try to access the service using plain http (see http://www.chromium.org/sts).

# Changes to header responses

rspadd Strict-Transport-Security:\ max-age=15768000


We next set up some deny settings, you can ignore these if you don't want to limit access to any service. The example below denies access to the Alfresco Explorer app from public use via the 'my.yourcompany.com' route. These use matched acls from earlier, and can include multiple acls that must all be true.

# Denied paths

http-request deny if alfresco_path is_my


Now we redirect to /share/ if this wasn't in the url path used to access the service.

# Redirects

redirect location /share/ if share_redirect is_my


Next we set up the list of backends to use, matched against the already defined acls.

# List of backends

use_backend share if is_my

use_backend webdav if is_webdav

use_backend sharepoint if is_sp


Then we set up the default backend to use as a catch-all:

default_backend share


Now we define the backends, the first being for share:

backend share


On this backend, enable the stats page:

# Enable the stats page on share backend

stats enable

stats hide-version

stats auth <user>:<password>

stats uri /monitor

stats refresh 2s


The stats page gives you a visual view on the health of your backends and is a very powerful monitoring tool.

option httpchk GET /share

balance leastconn

cookie JSESSIONID prefix

server tomcat1 server1:8080 cookie share1 check inter 5000

server tomcat2 server2:8080 cookie share2 check inter 5000


These define the following:



  • backend share - this defines a backend called share, which is used by the use_backend config from above.


  • option httpchk GET /share - this enables http health checks, using a http GET, on the /share path. Server health checks are one of the most powerful feature of HAProxy and works hand in hand with tomcat session replication to move an active session to another server if the server your active session on fails healthchecks.


  • balance leastconn - this sets up the balancing algorithm. leastconn selects the server with the lowest number of connections to receive the connection.


  • cookie JSESSIONID prefix - this enables cookie-based persistence in a backend. Share requires a sticky session and this also is used in session replication.


  • server tomcat1 server1:8080 cookie share1 check inter 5000 - this breaks down into:


  • server - this declares a server and its parameters


  • tomcat1 - this is the server name and appears in the logs


  • server1:8080 - this is the server address (and port)


  • cookie share1 - this checks the cookie defined above and if matched routes the user to the relevant server. The 'share1' value has to match the jvmroute set on the appserver for Share/Alfresco (for Tomcat see http://tomcat.apache.org/tomcat-7.0-doc/cluster-howto.html)


  • check inter 5000 - this sets the health check, with an inter(val) of 5000 ms



Define the webdav backend.

Here we hide the need to enter /alfresco/webdav on the url path which gives a neater and shorter url needed to access webdav, and again we enable server health checking:

backend webdav

option httpchk GET /alfresco

reqrep ^([^\ ]*)\ /(.*) \1\ /alfresco/webdav/\2

server tomcat1 server1:8080 check inter 5000

server tomcat2 server2:8080 check inter 5000


Define the SPP backend.

Here we define the backend for the sharepoint protocol, again with health checks:

backend sharepoint

balance url_param VTISESSIONID check_post

cookie VTISESSIONID prefix

server tomcat1 server1:7070 cookie share1 check inter 5000

server tomcat2 server2:7070 cookie share2 check inter 5000


Once this is all in place you should be able to start HAProxy. If you get any errors you will be informed on which lines of the config these are in. Or, if you have HAProxy as a service, you should be able to run 'service haproxy check' to check the config without starting HAProxy.



There are many more cool things you can do with HAProxy, so give it a go and don't forget to have a good read of the docs!

Filter Blog

By date: By tag: