Skip navigation
All Places > Alfresco Content Services (ECM) > Blog > 2018 > January
2018

I am using Alfresco community version: 5.2.0 (r135134-b14)

If you have document with aspect containing cm:person property.

   <aspects>
      <aspect name="ad:approveDocument">
            <title>Approver for Approve document</title>
            <associations>
              <association name="ad:approver">
                <title>Assignee</title>
                <source>
                  <mandatory>false</mandatory>
                  <many>true</many>
                </source>
                <target>
                  <class>cm:person</class>
                  <mandatory>false</mandatory>
                  <many>false</many>
                </target>
              </association>

                                 ...

When you start workflow on document with this aspect I can read property ad:approver

When you want to set property ad:approver you have to

private static final String adNamespace = "http://www.namespace.com/model/approveDocument";

ServiceRegistry serviceRegistry = (ServiceRegistry)registeredBeans.get(ActivitiConstants.SERVICE_REGISTRY_BEAN_KEY);

NodeRef docref= serviceRegistry.getNodeService().getChildAssocs(packagenode).get(0).getChildRef();

String username = "myUser";

NodeRef person = personService.getPerson(username);

nodeService.setAssociations(docref, QName.createQName(adNamespace, "approver"), Arrays.asList(person));

blong

Large Repository Upgrades

Posted by blong Employee Jan 23, 2018

One of the most common points of contention I have experienced as an ECM consultant is around the upgrade of very large repositories.  It is typically a feared task by IT departments.  It is made more complicated when some components are not as upgradeable as others.  Alfresco is not immune to this, however, it has been better than other outdated technologies.

 

One of the biggest points of contention with Alfresco upgrades is upgrading the search engine.  Alfresco v4.0 introduced a switch from Apache Lucene to Apache Solr v1.  Alfresco v5.0 introduced Solr v4 and Alfresco v5.2 introduced Solr v6 as Alfresco Search Services v1.  With each of these changes, the whole repository must be re-indexed.  In large repositories, the re-index process may take several weeks or longer.  To perform such an upgrade and have no to little impact on the end user is very difficult.  However, there are some solutions to this problem.  To better understand these upgrade issues, let's first cover the general upgrade process.

 

When performing a major upgrade to Alfresco, you must shut down all instances of Alfresco.  A major upgrade is a move in the minor version number, like v5.1 to v5.2.  A minor upgrade is a change in the service pack or hot fix number, like v5.2.0 to v5.2.2.  In these latter cases, you still have to shut down Alfresco, but you could theoretically perform a rolling upgrade in a clustered environment.  No end user downtime is a distinct advantage of a rolling upgrade.

 

With any upgrade, major or minor, you must back up the database.  This is especially true with major upgrades, as those upgrades will inevitably change the schema.  You cannot just downgrade the version of Alfresco with an already upgraded database; you must also restore the database.

 

If possible, back up the content store.  Any repositories should already be backed up incrementally.  Some storage mechanisms provide the ability to create a snapshot.  A snapshot in this context is a zero-time operation that creates a rollback point.  This can be a quick, cheap, and easy way to prepare for an upgrade.

 

During a hot/online backup, it is important to perform the database backup to completion prior to the content store backup or snapshot.  When relying on incremental backups, the backup should be turned off before the upgrade takes place.  It can be turned back on after the upgrade is deemed successful.

 

With upgrades that switch Apache Lucene/Solr versions, do not perform an index backup.  Indexes from embedded Lucene vs Solr1 vs Solr4 vs Alfresco Search Services (Solr6) are not interchangeable, so a backup would be worthless.  Any time you make the switch, you must re-index the whole repository.  There are multiple strategies to perform these index engine switches and they are outlined below.

 

No Index Engine Change

Among the easiest strategies is to not change the indexing engine or its major version.  In these cases, you will want to backup the index in case a rollback is required.  You can follow the instructions provided by the official Alfresco documentation.  Alfresco/Apache Solr v1, Alfresco/Apache Solr v4, and Alfresco Search Services (Solr v6) each may have their own nuances.  Whether or not you are dealing with shards, you will want to back up the core configurations too.  The location of that configuration depends on your environment.

 

Any upgrade to v4.x will support Lucene and Solr1.  Any upgrade to Alfresco v5.x will support Solr1 and Solr4.  Any upgrade to Alfresco Content Services v5.2 will also support Alfresco Search Services v1.x (Solr6).  Eventually a version will not support Solr1 and then Solr4, etc...

 

This upgrade strategy is not available for large gap upgrades.  This means it is not available on an upgrade from Alfresco v3.4 using Lucene to Alfresco v5.x, as the latter does not support Lucene.  It is also likely that this strategy is not available on an upgrade from any Alfresco version using Solr1 to Alfresco CS v6.x.  In those situations, you have to perform an intermediary or parallel upgrade as covered in other sections below.

 

During a hot/online backup, it is important to perform the index backup to completion prior to the database backup.  In case of a rollback, the index can simply catch up to the restored database.  This is easy even if the index is days or weeks behind the database.  However, it will be out-of-sync and never be consistent it gets ahead of the database.

 

Lagging Index

The easiest strategy is to just start using the new indexing engine with a blank index.  In this case, the end user will receive degraded search results until the full re-index is completed.  This may be acceptable in your use case while not in others.  In most repositories the indexing takes place in minutes or a couple hours.  However, large repositories could take days or weeks or more.  It becomes less and less acceptable in those situations.

 

If you are using a new directory to store the index, there is no need to perform a backup of the existing one.  During a hot/online backup, it is important to perform the index backup to completion prior to the database backup.  In case of a rollback, the index can simply catch up to the restored database.  This is easy even if the index is days or weeks behind the database.  However, it will be out-of-sync and never be consistent it gets ahead of the database.

 

3 Stage Upgrade

When you are deciding to change the index engine, you can perform what I call a 3 stage upgrade.  In this case, do not perform a major upgrade on the Solr version.  Instead, upgrade the Alfresco platform and install the new indexing engine alongside the existing one.  For instance, if you are using Solr1, upgrade to Alfresco CS 5.2, Alfresco Solr1 v5.2, and install Alfresco Search Services v1.1.  The new indexing engine will be empty and not referenced by core Alfresco, having little/no impact on the functioning system.  That is stage 1 of the upgrade.

 

Once upgraded, start a full re-index using the new indexing engine.  This just involves creating a new search core or shards.  The template configuration should be configured to point to the Alfresco instance so it can track it.  The indexing could take hours, days, or weeks or more; depending on multiple factors, nominally the size of your repository.  That is stage 2 of the upgrade.

 

Once the new index is complete, switch the engine from the legacy one to the new one.  This can be done in alfresco-global.properties with the index.subsystem.name or through the Admin Console Search services dialog.  If you perform the latter, the configuration will be controlled by the database instead of alfresco-global.properties.  This can lead to confusion in the future, so updating index.subsystem.name is recommended instead.  After the switch is deemed successful, remove Solr1 and the old index.  This is the completion of stage 3 of the upgrade.

 

Intermediary Upgrade

If you are upgrading from Lucene to Alfresco v5.x or later, you can do it in 5 or 6 stages.  This strategy is just the application of the 3 stage upgrade multiple times.  If you are on v3.x or earlier, you will have to upgrade to v4.2 as an intermediary stage.  In that case or if you were still using Lucene on v4.x, you will have to switch from Lucene to Solr1 before upgrading further.  You can follow the principles outlined in the 3 stage upgrade to accomplish this task.

 

You may then do a 3 stage upgrade to Alfresco CS v5.x with Alfresco SS v1.x.  This process requires 2 full re-indexes of the repository.  One of them requires another intermediary version of Alfresco to run in production for a period of time.  That time depends on how long it takes to perform the 1st full re-index.  That means the intermediary version needs to be tested and verified as much as the final version.  This strategy can be very inefficient and time consuming.  However, it is a very pragmatic way to proceed.

 

Parallel Upgrade

This strategy is the primary purpose behind this blog post.  It is a rather innovative way to avoid all the issues with the intermediary upgrade strategy while remaining transparent to the end user.  In this solution, you will be creating new server instances for Alfresco.  This is always the case with virtual servers and cloud architectures anyway.  If you are going to reuse the existing servers, this strategy is not simple and should not be used.

 

When performing an upgrade, it is best to restore the production database to a non-production environment to test and verify the schema upgrade among other things.  If you use the aggregate store where a read-only mount of the production content store is a secondary store, you don't need to restore the production content store to the non-production environment.  In this non-production environment, install and configure the new Alfresco and its new indexing engine.  At this point we have a snapshot of the production environment running on the new hardware, but with an empty search index.

 

In the Solr configuration file called solrcore.properties, change alfresco.lag to some value large enough to cover the maximum amount of time it will take to test and verify the non-production environment and eventually upgrade production.  If you intend to backup/restore the production database to the non-production environment weekly, then the lag only needs to be about 8 days.  Be conservative here.  If you think it will take a maximum of 3 months to upgrade to production and you won't be routinely restoring the production database, set the alfresco.lag value to 1000 x 60 x 60 x 24 x 30 x 3 = 7,776,000,000 ms.  Make sure to set this in the core templates and/or any cores already created.

 

Now for the waiting game.  Let the index build to completion.  If you are approaching you are ready to deploy to production well ahead of your prescribed maximum, you can shut down the index server application, change the lag to a smaller value, and start it back up.  Just remember that the lag needs to be larger than the time the production snapshot was taken to the time the production deployment is scheduled to occur.  For instance, if the database snapshot was taken on Feb 1 at midnight and your worst case plan on upgrading production is Oct 1 at midnight, then use a time around 86400000 x 30 x 9.  If you find that everything is ready in early March and you want to reschedule the production upgrade on Apr 1 at midnight, then change the lag to 86400000 x (28+31+1).

 

This procedure will then properly index the repository to a certain time before the original snapshot.  You cannot let it cross over that lag time threshold.  If you do, you have to start over.  If it crosses that threshold, the new Solr index will hold nodes created by the non-production startup, which is unacceptable.

 

As stated earlier, at any time you can create a new snapshot of the production database and restore it to the parallel non-production environment.  In these cases, you can continue to use the existing Solr instances that may still be building the index.  If you do this, it effectively resets the lag time starting with the most recent database snapshot.  So in the example above, if you perform another snapshot on Mar 1 at midnight, it will effectively push the lag time out to Nov 1 instead of Oct 1.  This is a good solution when you underestimate the production deployment schedule.

 

Now you are ready to upgrade the production environment.  Shut down the production and non-production instances.  Reclassify the non-production environment as your production environment.  Point the new production environment to the production database and mount the content store read/write without the aggregate store.  Change the alfresco.lag property in the solrcore.properties files back to 1000.  And finally change any DNS entries that need to be changed to point end users to the new production servers.  Once ready, start up the Alfresco components.

 

You are now upgraded with nearly a full index.  This new upgrade will only take a few minutes to perform the database schema changes.  Once up and running, the index engine will catch up to the latest data in the repository, closing the lag gap much quicker than having to do a full re-index.  The time required to catch up can be computed based on the speed of the indexing you measured while the environment was considered non-production.  Under a good strategy, it should catch up within hours.

 

To make the catch up time as short as possible, create more frequent production snapshots for the non-production environment.  Do it often enough to use a low alfresco.lag value.  For instance, set alfresco.lag to 86400000 x 1.1 and automatically create and restore snapshots nightly.  The index will then only have to catch up on 1 day of transactions.

 

It is a great idea to create a non-production environment similar to the one used for this upgrade for longer term purposes.  It gives you a real-life environment to reproduce and study production issues.  It could create more read-only load on the content store, but the content store is typically not a bottleneck of concern.

A few years ago Samuel wrote a blog post entitled “So, when is Alfresco moving to GitHub?” In it he presented a number of reasons why it was difficult to move our code from SVN to Git. Given the recent move of Share to GitHub, I thought it would be worth writing an update on the situation.

 

The vast majority of our production code is now in Git. ACS Repository, Activiti, Search Services, Records Management, Google Docs Module, Android App, iOS App, Share, ADF Components, … and that’s not including our enterprise code which is primarily within our hosted GitLab.

 

Some of this code has always been in Git, but much of it has been migrated from SVN. I migrated the Records Management codebase in 2015, and since then we’ve had a fairly continuous stream of migrations.* The obvious question is "What’s changed since Samuel’s post?"

 

The answer is that not very much has changed. The team decided not to leave older releases in SVN for exactly the reasons mentioned by Samuel. Cherry-picking changes from SVN to Git can be done with svn diff --git and git apply, but it’s a pain, and we make a lot of service pack changes in our products. The ACS repository codebase is large and although we’ve split out several modules into smaller Git repositories, there is still over 2Gb of history in the Git repo.

 

The primary reason for migrating to Git is the popularity - Git is seven times more popular than any other version control system. Some of the reasons for this are Git’s local and lightweight branches (leading to cleaner workflows), faster access to history (since it’s all stored locally) and smaller overall repository size (leading to faster access to remote commits). Some other reasons are historical – SVN has greatly improved its merging and has got rid of the need for a .svn directory per folder. However since GitHub is the most popular and largest open source host in the world we want to use Git to make it easy for users to access our code.

 

 

* We’ve had migrations in the past too, e.g. Share extras in 2012, but the last year has seen a concerted effort to migrate projects.

One of the recurring issues we see raised by customers are regarding slow SQL queries.

Most of the time those are first witnessed within the Share UI page load time or through long running CMIS queries.

Pinpointing those queries can be a pain and this blog post aims at providing some helps in the troubleshooting process.

We will cover:

  • Things to check in the first place
  • Different ways of getting informations on the query execution time
  • Isolating where the RDBMS is spending more time.
  • Present some tools and configuration to help proactively track this kind of issues.

We hope this content will be useful in real life but really having a DBA who takes care of the Alfresco database is the best you can offer to your Alfresco application!

Preliminary checks

Before blaming the database, it's always a good thing to check that the database engine has appropriate resources in order to deliver good performances. You cannot expect any DB engine to perform well with Alfresco with limited resources. For PostgreSQL there are plenty of resource on the web about sizing a database cluster (here I use cluster in the PostgreSQL meaning, which is different from what we call a cluster in Alfresco).

 

Latency

Network latency can be a performance killer. Opening connections is a quite intensive process and if your network is lame, it will impact the application. Simple network tests can make you sure the network is delivering a good enough transport layer. The ping utility is really the first thing to look at. A ping test between Alfresco and its DB server must show a latency under 1ms on a directly connected network (ethernet Gb), or between 1 & 5 ms if you DB server and the Alfresco server are connected through routed networks. Definitely a value around or above 10ms is not what Alfresco expects from a DB server.

alxgomz@alfresco:~$ ping -c 5 -s500 192.168.0.68
PING 192.168.0.68 (192.168.0.68) 500(528) bytes of data.
508 bytes from 192.168.0.68: icmp_seq=1 ttl=64 time=0.436 ms
508 bytes from 192.168.0.68: icmp_seq=2 ttl=64 time=0.364 ms
508 bytes from 192.168.0.68: icmp_seq=3 ttl=64 time=0.342 ms
508 bytes from 192.168.0.68: icmp_seq=9 ttl=64 time=0.273 ms
508 bytes from 192.168.0.68: icmp_seq=10 ttl=64 time=0.232 ms

--- 192.168.0.68 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8997ms
rtt min/avg/max/mdev = 0.232/0.329/0.436/0.066 ms

 

Some more advanced utilities allow to send TCP packets which are more representative to the actual time spent on opening a tcp session (again you should not have values above 10 ms):

alxgomz@alfresco:~$ sudo hping3 -s 1025 -p 80 -S -c 5 192.168.0.68
HPING 192.168.0.68 (eth0 192.168.0.68): S set, 40 headers + 0 data bytes
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=0 win=29200 rtt=3.5 ms
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=1 win=29200 rtt=3.4 ms
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=2 win=29200 rtt=3.4 ms
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=3 win=29200 rtt=3.6 ms
len=44 ip=192.168.0.68 ttl=64 DF id=0 sport=80 flags=SA seq=4 win=29200 rtt=3.5 ms

--- 192.168.0.68 hping statistic ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 3.4/3.5/3.6 ms

 

Overall network quality is also something to check. If your network devices spend their time reassembling unordered packets or retransmiting them the application performances will suffer from it. This can be checked by taking a network dump during query execution and opening it using wireshark.

In order to take the dump you can use the following tcpdump command:

alxgomz@alfresco:~$ tcpdump -ni any port 5432 and host 192.168.0.68 -w /tmp/pgsql.pcap

Opening it in wireshark should give you an idea very quickly. If you see a dump with a lot of red/black lines then it might be an issue and needs further investigations (those lines are colored this way if wireshark has the right syntaxic coloration rules applied).

 

RDBMS configuration

PostgreSQL comes with a relatively low end configuration. This is intended to allow for it to run on a wide range of hardware (or VM) configurations. However, if you are running your Alfresco (and its database) on high end systems, you most certainly want to tune the configuration in order get the best out of your resources.

This wiki page present in details many parameters you may wan to tweak Tuning Your PostgreSQL Server - PostgreSQL wiki

The first one to look at is the shared_buffers size. It sets the size of the area PostgreSQL uses to cache data, thus improving performance. Among all those parameters some should be "in sync" with your Alfresco configuration. For example by default Alfresco allow 275 tomcat threads at peak time. Each of these threads should be able to open a database connection. As a consequence PostgreSQL (when installed using the installer) sets the max_connections parameter to 300. However we need to understand that each connection will consume resources, and in the first place: memory. The amount of memory dedicated to a PostgreSQL process (that handles a SQL query) is controlled by the work_mem parameter. By default it has a value of 4MB, meaning we can calculate the amount of physical RAM needed by the database server in order to handle peak load:

work_mem * max_connections =  4MB * 300 = 1.2GB

Add the size of the shared_buffers to this and you'll have a good estimate of the amount of RAM Postgres needs to handle peak loads with default configuration. There are some other important values to fiddle with (like effective_cache_size,  checkpoint_completion_target, ...) but making sure those above are aligned with both your alfresco configuration and the hardware resources of your database host is really where to start (refer to the link above).

A qualified DBA should configure and maintain Alfresco's database to ensure continuous performance and stability of the database. If you don't have a DBA internally there are also dozens of companies offering good services around postgreSQL configuration and tuning.

Monitoring

This is a key in the troubleshooting process. Although monitoring will not give you solution to a performance issue it will help you getting on the right track. In this case having a monitoring in place on the DB server is important. If you can correlate an increasingly slow application with a global load increase on the database server, then you've got a good suspect. There are different things to monitor on the BD server but you should at least have the bare minimum:

  • CPU
  • RAM usage
  • disk IO & disk space

Spikes in CPU and IO disk usage can be the sign of a table that grew large without appropriate indexes.

Spikes in the used disk space can be explained because of the RDBMS creating temporary work files due to a lack of physical memory.

Monitoring the RAM can help you anticipate cache disk memory starvation (PostgreSQL heavily rely on this kind of memory).

Alfresco has some tables that are known to be potentially growing very large. A DBA should monitor their size, both in terms of number of rows and disk size. The query bellow is an example of how you can do it:

SELECT table_name as tableName,
  (total_bytes / 1024 / 1024) AS total,
   row_estimate as rowEstimate,
   (index_bytes / 1024 / 1024) AS INDEX,
   (table_bytes / 1024 / 1024) AS TABLE
   FROM (
     SELECT *,
     total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes
      FROM (
      SELECT c.oid,
       nspname AS table_schema,
       relname AS TABLE_NAME,
       c.reltuples AS row_estimate,
       pg_total_relation_size(c.oid) AS total_bytes,
       pg_indexes_size(c.oid) AS index_bytes,
       pg_total_relation_size(reltoastrelid) AS toast_bytes
       FROM pg_class c LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
      WHERE

       relkind = 'r') a) a
     WHERE table_schema = 'public' order by total_bytes desc;

Usually the alf_prop_*, alf_audit_entry and possibly alf_node_properties are the tables that may appear in the resultset. There is no rule of thumb which dictate when a table is too large. This is more matter of monitoring how the table grow in time.

Another useful thing to monitor are the creation/usage of temporary files. When your DB is not correctly tuned, or doesn't have enough memory allocated to it, it may created temporary files on disk if it needs to further work on a big resultset. This is obvioulsy not as fast as working in-memory and should be avoided. If you don't know how to monitor that with your usual monitoring system there are some good tools that help a DBA be aware of such things hapenning.

pgbadger is an opensource tool which does that, and many other things. If you don't already use it, I strongly encourgae you to deploy it!

 

Debugging what takes long and how long is it?

Monitoring should have helped you pinpoint your DB server being over loaded, making your application slow. This could be because it is undersized for its workload (in which case you would probably see a steady high resource usage, or it could be some specific operations are too expensive for some reasons. in the former case, there is not much you can do apart from upgrading either the resources or the DB architecture (but that's not topics I want to cover here). In the later case, getting to know how long a query takes is really what people usually want to know. We accomplish that by using one of the debug option bellow.

 

At the PostgreSQL level

In my opinion, the best place to get this information is really on the DB server. That's where we get to know the actual execution time, without accounting for network round trip time and other delays. More over, PostgreSQL makes it very easy to enable this debug, and you don't even need to restart the server. There are many ways to enable debug in PostgreSQL, but the one that's the most interesting to us is log_min_duration_statement. By default it has a value of "-1" which means nothing will be logged based on its execution time. But for example, if we set in the postgresql.conf file:

log_min_duration_statement = 250

log_line_prefix = '%t [%p-%l] %q%u@%d '

Any query that takes more than 250 milliseconds to execute will be logged.

Setting the log_min_duration_statement value to zero will cause the system to log every single query. While this can be useful for debugging or for temporary audit this will not be very helpful here as we really want to target slow queries only.

      If interested in profiling your DB have a look at the great pgbadger tool from Dalibo
The Alfresco installer sets by default the log_min_messages parameter to fatal. This prevent the log_min_duration_statement from working. Make sure it is set back to its default value or to a value that's higher than LOG.

Then without interrupting the service, PostgreSQL can be reloaded in order for changes to take effect:

$ pg_ctl -U postgres -W -D /data/postgres/9.4/main reload

adapt the command above to your needs with appropriate paths

                 

 

This will produce an output such as bellow:

2017-12-12 19:54:55 CET [5305-1] LOG: duration: 323 ms execution : select

      pv.id as prop_id,

      pv.actual_type_id as prop_actual_type_id,

      pv.persisted_type as prop_persisted_type,

      pv.long_value as prop_long_value,

      sv.string_value as prop_string_value

   from

      alf_prop_value pv

      join alf_prop_string_value sv on (sv.id = pv.long_value and pv.persisted_type = $1)

   where

      pv.actual_type_id = $2 and

      sv.string_end_lower = $3 and

      sv.string_crc = $4

DETAIL: parameters: $1 = '3', $2 = '1', $3 = '4ed-f2f1d29e8ee7', $4 = '593150149'

Here we gather important informations:

  1. The date and time the query was executed at the beginning of the first line.
  2. The process ID and Session line number. As PostgreSQL forks a new process for each connection, we can map process ID and pool connections. A single connection may contain different transactions, which in turn will contain several statements. Each new statement processing increments the session line number
  3. The execution time on the first line
  4. The execution stage. A query is executed in several steps. With Alfresco making heavy usage of bind parameters, we will often see several lines for the same query (one for each step):
    1. prepare (when the query is parsed),
    2. bind (when parameters are replaced by their values and execution is planned),
    3. execution (when the query is actually executed).
  5. The query itself starting at the end of the first line and continuing on subsequent lines. Here it contains parameters and can't be executed as is.
  6. The bind parameters on the last line.

In order to get the overall execution time we have to sum up the execution time of the different steps. This seems painful but delivers fined grained breakdown of the query execution. However most of the time the majority of the execution time is spent on the execute stage, to better understand what's going on at this stage, we need to dive deeper into the RDBMS (see next chapter about explain plans).

 

At the application level

It is also possible to debug SQL queries in a very granular manner at the Alfresco level. However it is important to note that this method is way more intrusive as it required, adding additional jar files, modifying the configuration of the application and restarting the application server. It may not be well suited for production environments where any downtime is a problem.

Also note that execution times reported with this method include network round-trip times. In normal circumstances this should be few additionnal milliseconds, but could much more on a lame network.

To allow debugging at the application level we will use a jdbc proxy: p6spy

Impact on the application performances largely depends on the amount of queries that will be logged.

 

First of all we will get the latest p6spy jar file from the github repository.

Copy this file to the tomcat lib/ directory and add a spy.properties file in the same location containing the lines bellow:

driverlist:org.postgresql.Driver

executionThreshold=250

This will mimic the behaviour we had previously when debugging with PostgreSQL, meaning only queries that take more than 250 milliseconds will be logged.

We then need to tweak the alfresco-global.properties file in order to make it use the p6spy driver instead of the actual driver:

 

db.driver=com.p6spy.engine.spy.P6SpyDriver
db.url=jdbc:p6spy:postgresql://${db.host}/${db.name}

 

Alfresco must now be restarted and we will have a new file called spy.log which should now be available and contain lines like the one shown bellow:

1513100550017|410|statement|connection 14|update alf_lock set version = version + 1, lock_token = ?, start_ti
me = ?, expiry_time = ? where excl_resource_id = ? and lock_token = ?|update alf_lock set
version = version + 1, lock_token = 'not-locked', start_time = 0, expiry_time = 0 where excl_resource_id
= 9 and lock_token = 'f7a21222-64f9-40ea-a00a-ef95052dafe9'

We find here similar values to what we had with PostgreSQL:

  1. timestamp the query was executed on the application server.
  2. execution time in milliseconds
  3. The connection ID
  4. The query string without bind parameters
  5. The query string with evaluated bind parameters

 

Understanding the execution plan

Now that we have pinpointed the problematic query(ies) we can deep dive into PostgreSQL's logic and understand why the query is slow. RDBMS rely on their query planner to decide how to deal with a query.

The query planner itself makes decision based on the structure of the query, the structure of the database (e.g presence and types of indexes) and also based on statistics the system maintains during its execution. The more accurate those statistics are, the more efficient the query planer will be.

 

Explain plans

In order to know what the query planner would do for a specific query, it is possible to run it prefixed with the "EXPLAIN ANALYZE " statement.

To make this chapter more hands-on we'll proceed with an example. Let's consider a query which is issued while browsing the repository (get node informations based on parent). Using one of the methods we've seen above, we have identified that query an running it prefixed with "EXPLAIN ANALYZE" returns the following:

                                                                                 QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=6614.54..13906.94 rows=441 width=676) (actual time=1268.230..1728.432 rows=421 loops=1)
   ->  Hash Left Join  (cost=6614.26..13770.40 rows=441 width=639) (actual time=1260.966..1714.577 rows=421 loops=1)
         Hash Cond: (childnode.store_id = childstore.id)
         ->  Hash Right Join  (cost=6599.76..13749.84 rows=441 width=303) (actual time=1251.427..1704.866 rows=421 loops=1)
               Hash Cond: (prop1.node_id = childnode.id)
               ->  Bitmap Heap Scan on alf_node_properties prop1  (cost=2576.73..9277.09 rows=118749 width=71) (actual time=938.409..1595.742 rows=119062 loops=1)
                     Recheck Cond: (qname_id = 26)
                     Heap Blocks: exact=5205
                     ->  Bitmap Index Scan on fk_alf_nprop_qn  (cost=0.00..2547.04 rows=118749 width=0) (actual time=934.132..934.132 rows=119178 loops=1)
                           Index Cond: (qname_id = 26)
               ->  Hash  (cost=4017.52..4017.52 rows=441 width=232) (actual time=90.488..90.488 rows=421 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 83kB
                     ->  Nested Loop  (cost=0.83..4017.52 rows=441 width=232) (actual time=8.228..90.239 rows=421 loops=1)
                           ->  Index Scan using idx_alf_cass_pri on alf_child_assoc assoc  (cost=0.42..736.55 rows=442 width=8) (actual time=2.633..58.377 rows=421 loops=1)
                                 Index Cond: (parent_node_id = 31890)
                                 Filter: (type_qname_id = 33)
                           ->  Index Scan using alf_node_pkey on alf_node childnode  (cost=0.42..7.41 rows=1 width=232) (actual time=0.075..0.075 rows=1 loops=421)
                                 Index Cond: (id = assoc.child_node_id)
                                 Filter: (type_qname_id = ANY ('{142,24,51,200,204,206,81,213,97,103,231,104,107}'::bigint[]))
         ->  Hash  (cost=12.00..12.00 rows=200 width=344) (actual time=9.523..9.523 rows=6 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 1kB
               ->  Seq Scan on alf_store childstore  (cost=0.00..12.00 rows=200 width=344) (actual time=9.517..9.518 rows=6 loops=1)
   ->  Index Scan using alf_transaction_pkey on alf_transaction childtxn  (cost=0.28..0.30 rows=1 width=45) (actual time=0.032..0.032 rows=1 loops=421)
         Index Cond: (id = childnode.transaction_id)
Planning time: 220.119 ms
Execution time: 1728.608 ms

Although I never faced it with PostgreSQL (more with Oracle), there are cases where the explain plans is different depending on whether you pass the query as a complete string or you use bind parameters.

In that case the parameterized query that may be found to be slow in the SQL debug logs might appear fast when executed manually.

To get the explain plan of these slow queries PostgreSQL has a loadable module which can log explain plans the same way we did with the log_duration. See the auto_explain documentation for more details.

    

We can indeed see that the query is taking rather long just by looking at the end of the output (highlighted in bold). However reading the full explain plan and more importantly understanding it can be challenging.

In the explain plan, PostgreSQL is breaking down the query into "nodes". Those "nodes" represent actions the RDBMS has to run through in order to execute the query. For example, at the bottom of the plan we will find the "scan nodes", which are the statements that actually return rows from tables. And upper in the plan we have "nodes" that correspond to aggregations or ordering. In the end we have an indented/hierarchical tree of the query from which we can detail each step. Each node (line starting with "-->") is shown with:

  • its type, whether the system using indexes and what kind (Index scan) or not any (sequential scan)
  • its estimated cost, an arbitrary representation of how costly an operation is.
  • the estimated number of lines the operation would return

And much more details that make it somewhat hard to read.

And to make it more confusing, some values, like "cost" or "actual time", have two different values. To make it short you should only consider the second one.

      

The purpose of this article is not to learn how to fully understand a query plan, so instead,we will use a very handy online tool which ill parse the output for us and point out the problems we may have: New explain | explain.depesz.com 

Only by pasting the same output and submitting the form we will get a better view of what's going on and what we need to look at.

The "exclusive" color mode gives the best representation of how efficient each individual node is.

"Inclusive" mode is cumulative (so the top row will always be dark red as it's equal to the total execution time).

"rows x" shows how accurately the query planner is able to guess the number of rows.

If using "mixed" color mode, each cell in each mode's column will have its own color (which can be a little bit harder to read).

So here we can see straight away that nodes #4 & #5 are where we are spending most time. We can see that those nodes return a big amount of rows (more than 100 000 while there are only 421 of then in the final result set) meaning that the available indexes and statistics are not good enough.

Alfresco normally provide all the necessary indexes for the database to deliver good performance in most of the cases so it is very likely that queries under-performing because of indexes are actually missing indexes. Fortunately, Alfresco also deliver a convenient way to check the database schema for any in consistency.

 

Alfresco schema validation

When connected to the JMX interface, in the MBeans tab, it is possible to trigger a schema validation while alfresco is running (go to "Alfresco \ DatabaseInformation \ SchemaValidator \ Operations" and launch "valideateSchema()").

This will produce the output bellow in the Alfresco log file:

2017-12-18 15:21:32,194 WARN [domain.schema.SchemaBootstrap] [RMI TCP Connection(6)-10.1.2.101] Schema validation found 1 potential problems, results written to: /opt/alfresco/bench/tomcat/temp/Alfresco/Alfresco-PostgreSQLDialect-Validation-alf_-4191170423343040157.txt
2017-12-18 15:21:32,645 INFO [domain.schema.SchemaBootstrap] [RMI TCP Connection(6)-10.1.2.101] Compared database schema with reference schema (all OK): class path resource [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-ACT.xml]

The log file points us to another file where we can see the details of the validation process. Here for example we can see that an index is indeed missing :

alfresco@alfresco:/opt/alfresco/bench$ cat /opt/alfresco/bench/tomcat/temp/Alfresco/Alfresco-PostgreSQLDialect-Validation-alf_-4191170423343040157.txt
Difference: missing index from database, expected at path: .alf_node_properties.alf_node_properties_pkey

The index can now be re-created either taking a fresh install as a model or by getting in touch with alfresco support to know how to create the index.

The resulting - more efficient - query plan if much better:

 

Database statistics

Statistics are really critical to PostgreSQL performance as it's what mainly offer the query planner efficiency. With accurate statistics PostgreSQL will make good decision when planning a query. And of course, inaccurate statistics leads to bad decisions and so bad performances.

PostgreSQL has an internal process in charge of keeping statistics up to date (in addition to other house keeping tasks): the autovacuum process.

All versions of PostgreSQL Alfresco supports have this capability and it should always be active! By default this process will try to update statistics according to configuration options set in postgresql.conf. The options bellow can be useful to fine tune the autovacuum behaviour (those are the defaults):

autovacuum=true #Enable the autovacuum daemon

autovacuum_analyze_threshold=50 #Number of tuples modifications that trigger ANALYZE

autovacuum_analyze_scale_factor = 0.1 #Fraction of table modified to trigger ANALYZE

default_statistics_target = 100 # Amount of information to store in the statistics

autovacuum_analyze_threshold & autovacuum_analyze_scale_factor must be summed up in order to know when an ANALYZE will be triggered

         

If you saw a slow, but constant degradation in queries performances it maybe that some tables grew large enough to make the default parameters not as efficient as they used to be. As the tables grow, the scale factor can make statistics update very infrequent, lowering  autovacuum_analyze_scale_factor will make statistics updates more frequent, thus ensuring stats are more up to date.

The type of data distribution within a database tables can also change during its lifetime, either because of a data model change of simply because of new use-cases. Raising default_statistics_target will make the daemon collect and process more data from the tables when generating or updating statistics, thus making the statistics more accurate.

Of course asking for more frequent dates and more accurate statistics as an impact on the resources needed by the autovacuum process. Such tweaking should be carefully done by your DBA.

Also it is important to note that the options above are applied for every table. You may want to do something more targeted to known big tables. This is doable by changing the storage options for the specific tables:

=# ALTER TABLE alf_node_prop_string_value

-#     ALTER COLUMN string_value SET STATISTICS 1000;

Above SQL statement is really just an example and must not be used without prior investigations

         

I’m excited to announce that we’ve completed the move of Share’s entire codebase to GitHub.

 

The observant amongst you will notice that a couple of months back we moved the old Share GitHub mirror and stopped updating it - that’s because we’ve been transitioning both the code and our internal development processes to GitHub natively.

 

As of today, the new repositories are now public. The Share codebase is split across four repositories:

Share: https://github.com/Alfresco/Share

Surf: https://github.com/Alfresco/Surf

Surf Web Scripts: https://github.com/Alfresco/surf-webscripts

Aikau: https://github.com/Alfresco/aikau (this project has always been openly developed)

 

What this means for everyone who uses or develops against Share is that you’ve now got a much greater level of transparency over how we’re working, earlier visibility of what we’re doing and an increased opportunity to have input to that process. It makes it significantly easier for us to accept contributions as pull requests and for those of you who want to apply custom patches to your own forks. These benefits apply whether your Community or Enterprise users, as the codebase is exactly the same across the two versions.

 

We’ve written contribution guidelines to help, so please take a look at those: https://github.com/Alfresco/share/blob/develop/CONTRIBUTING.md

 

Those of you who are coming to DevCon, I look forward to discussing this there and maybe even working with you on some Share PRs at the Hack-a-thon.

At the end of my last post I alluded to an improved ingestion pipeline using Step Functions, this post looks at how we can use the recently announced AWS service Comprehend to analyse text files.

 

The updated architecture is shown below (click to enlarge).

 

Demo Architecture

 

The Lambda function that fetches the content sets a flag to indicate whether the content is an image, this is used by the Step Function definition to decide whether to call the ProcessImage or ProcessText Lambda function as shown in the Step Function definition below:

 

Step Function Definition

 

We covered the behaviour of the ProcessImage in the last post, the new ProcessText Lambda function takes the text and sends it Comprehend to detect entities and to perform semantic analysis. The function then looks for Person, Location and Date entities in the text and compares the positive and negative values from the sentiment analysis to determine the values for the properties on the acme:insuranceClaimReport custom type.

 

Everything required to deploy and run the demo is available in this GitHub repository. Clone the repository to your local machine using the command below and follow the deployment instructions.

 

git clone https://github.com/gavincornwell/firehose-step-functions-demo.git .

 

Once the repository is up and running follow the detailed demo steps to upload images and text files to the system and see the metadata in Alfresco get updated automatically.

 

I'll be a doing a live demo of this at the forthcoming DevCon in Lisbon, Portugal, hope to see you there!

This post is a short technical overview of Alfresco Search and Discovery. It accompanies the related Architect/Developer Whiteboard Video. In a nutshell, the purpose of Search & Discovery is to enable users to find and analyse content quickly regardless of scale.

 

Below is a visual representation of the key software components that make up Alfresco Search and Discovery.

 

 

 

Search and Discovery can be split into five parts:

  • a shared data model - components shown in green;
  • the index and the underlying data that is created by indexing and used for query - components shown in orange;
  • querying those indexes  - components shown in red;
  • building indexes - components shown in yellow; and
  • some enterprise only components shown in blue.

 

What is an index?

 

An index is a collection of document and folder states. Each state represents a document or folder at some instance in time. This state includes where the document or folder is filed, its metadata, its content, who has access rights to find it, etc..

 

The data model that defines how information is stored in the repository also defines how it is indexed and queried. Any changes made to a model in the repository, like adding a property, are reflected in the index and, in this case, the new property is available to query in a seamless and predictable way.

 

We create three indexes for three different states:

  • one for all the live states of folders and content;
  • one for all the explicitly versioned states of folders and content; and
  • one for all the archived states of folders and content.

 

 It is possible to query any combination of these states.

 

Each of these indexes can exist as a whole or be broken up into parts, with one or more copies of each part, to meet scalability and resilience requirements. These parts are often referred to as shards and replicas. There are several approaches to breaking up a large index into smaller parts. This is usually driven by some specific customer requirement or use case. The options include: random assignment of folders and documents to a shard, assignment by access control, assignment by a date property, assignment by creation, assignment by a property value, and more. Another blog covers index sharding options in detail.

 

For example, sequential assignment to a shard at document creation time gives a simple auto-scalable configuration. Use one shard until it is full and then add the next shard when required. Alternatively,  if all your queries specify a customer by id it would make sense to split up your data by customer id. All the information for each customer would then be held together in a single shard.

 

Indexes typically exist as single shard up to around 50M files and folders. Combining all the shards gives an effective overall index. The data, memory, IO load, etc can then distributed in a way that can be scaled.  It is common to have more then one replica for each shard to provide scalability, redundancy and resilience.

 

Search Public API

 

The search public REST API is a self-describing API that queries the information held in the indexes outlined above. Any single index or combination of indexes can be queried. It supports two query languages. A SQL like query language defined by the CMIS standard and a Google like query language we refer to as Alfresco Full Text Search (AFTS).  Both reflect the data model defined in the repository. The API supports aggregation of query results for facet driven navigation, reporting and analysis.  

 

The results of any query and related aggregation always respect the access rights of the user executing the query. This is also true when using Information Governance where security markings are also enforced at query and aggregation time.

 

The search public API and examples are covered in several existing blogs. See:

 

Introducing Solr 6.3 and Alfresco Search Services 

v1 REST API - Part 9 - Queries & Search 

Structure, Tags, Categories and Query in the public API 

Basic Content Reporting using the 5.2.1 Search API 

 

For enterprise customers, we support also JDBC access using a subset of SQL via a thin JDBC driver. This allows integration with BI and reporting tools such as Apache Zeppelin.

 

The Content Model

 

All indexes on an instance of Alfresco Search Services share the same content model. This content model is a replica of the content model held in Alfresco Content Services. If you add a type, aspect, or property to the content model any data added will be indexed according to the model and be available to query.

 

The Alfresco full text search query language and CMIS query language are backed by the content model. Any property in the model can be used in either query language. The details of query implementation are hidden behind this schema. We may change the implementation and ”real” index schema in the future but this virtual schema, defined by the data model and the query syntax, will remain the same.

 

The data model in the repository defines types, aspects and properties. A property type defines much of its index and query behaviour.  Text properties in particular require some additional thought when they are defined. It is important to consider how a property will be used.

 

  • … as an identifier?
  • … for full text search?
  • … for ordering?
  • … for faceting and aggregation?
  • … for localised search?
  • … for locale independent search?

 

… or any combination of the above.

 

A model tracker maintains an up to date replica of the repository model on each search services instance.

 

Building Indexes

 

When any document is created, updated or deleted the repository keeps a log of when the last change was made in the database. Indexing can use this log to follow state changes in the order those changes were made. Each index, shard or replica records its own log information describing what it has added. This can be compared with the log in the database. The indexing process can replay the changes that have happened on the repository to create an index that represents the current state of the repository and resume this process at any point. The database is the source of truth: the index is a read view optimised for search and discovery.

 

Trackers compare various aspects of the index log with the database log and work out what information to pull from the database and add to the index state. The ACL tracker fetches read access control information. The metadata for nodes is pulled in batches from the repository in the order in which they were changed by the metadata tracker. If a node includes content, the content tracker adds that to the existing metadata information sometime after the metadata has been indexed. The cascade tracker asynchronously updates any information on descendant nodes when their ancestors change. This cascade is caused by operations such as rename and move, or when ancestors are linked to other nodes creating new structure and new paths to content. The commit tracker ensures that transactional updates made to the database are also transactionally applied to the index. No part transactions are exposed by search and transactions are applied in the order expected. The commit tracker also coordinates how information is added to the index and when and how often it goes live. The index state always reflects a consistent state that existed at some time in the database.

 

As part of the tracking process, information about each index and shard is sent back to the digital business platform. This information is used to dynamically group shards and replicas into whole indexes for query. Each node in the Digital Business Platform can determine the best overall index to use for queries.

 

All shards and replicas are built independently based on their own configuration. There is no lead shard that has to coordinate synchronous updates to replicas. Everything is asynchronous. Nothing ever waits for all the replicas of a shard to reach the same state. The available shards and replicas are intelligently assembled into a whole index.

 

Replicas of shards are allowed to be unbalanced - each shard does not have to have the same number of replicas. Each replica of a shard does not have to be in the same state. It is simple to deal with a hot shard - one that gets significantly more query load than the others - by creating more copies of that shard. For example, your content may be date sensitive with most queries covering recent information. In this case you could shard by date and have more instances of recent shards.

 

 

Query Execution

 

Queries are executed via the search endpoint of the Alfresco REST API or via JDBC. These APIs support anything from simple queries to complex multi-select faceting and reporting use cases. Via the public API each query is first analysed. Some queries can be executed against the database. If this is possible and requested that is what happens. This provides transactional query support. All queries can be executed against one or more Alfresco Search Services instances.  Here there is eventual consistency as the index state may not yet have caught up with the database. The index state however always reflects some real consistent state that existed in the database.

 

When a query reaches an Alfresco Search Services instance it may just execute a query locally to a single index or coordinate the response over the shards that make up an index. This coordination collates the results from many parts to produce an overall result set with the correct ranking, sorting, facet counts, etc.

 

JDBC based queries always go to the search index and not the database.

 

Open Source Search

 

Alfresco Search Services is based on Apache Solr 6, in which Alfresco is a leader. Alfresco is an active member of the Apache SOLR community. For example, we have Joel Bernstein on staff, who is a SOLR committer. He has led the development of the SOLR streaming API and has been involved with adding support for JDBC. Other Alfresco developers have made contributions to SOLR related to document similarity, bug fixes and patches.

 

Highly Scalable and Resilient Content Repository

 

These features combine to give a search solution that can scale up and out according to customer needs and is proven to manage one Billion documents and beyond. Customers can find content quickly and easily regardless of the scale of the repository.

gavincornwell

Steps to Rekognition

Posted by gavincornwell Employee Jan 4, 2018

In my last post we looked at a potential out-of-process extension that analysed images using AWS Rekognition. The solution used a single large Lambda function, in this post we're going to examine an improved approach using Step Functions.

 

The architecture is shown below (click to enlarge).

 

 

The use case has also been expanded since the first post, the Lambda function that processes the results from Rekognition now categorises images into Cars, Motorcycles, Boats, Electronics, Jewellery, Wristwatches, Clocks, Bicycles, Sport Equipment and Furniture. Any image that can not determined is set to Unknown rather than adding an aspect.

 

The initial part of the solution is still the same, Camel is used to route events to Kinesis Firehose, accepted  events are sent to S3, which in turn triggers a Lambda function. That Lambda function now parses the Alfresco events and executes a Step Function State Machine (shown in the diagram below) for each event.

 

 

The State Machine calls three separate smaller Lambda functions, each function does one thing and one thing only and are re-usable outside of the Step Function. This is a much more scalable solution and allows the images to be processed in parallel.

 

Everything required to deploy and run the demo is available in this  GitHub repository. Clone the repository to your local machine using the command below and follow the much simpler deployment instructions.

 

git clone https://github.com/gavincornwell/firehose-step-functions-demo.git .

 

Once the repository is up and running follow the detailed demo steps to upload images to the system and see the metadata in Alfresco get updated automatically.

 

Currently the State Machine is fairly simple and serial but it lays the foundation for a more complex ingestion pipeline which is something I may investigate in a future post.

Following the successful release of Alfresco Content Services in 2017, we have been planning our next round of innovation. In this blog post, we share some of our plans so that you can prepare for the next release and provide feedback. At the end you will find a table summarizing the actions you should take.

 

Even though we refer to Alfresco Content Services, most of this information also applies to Alfresco Community Edition.

There are three overriding architectural goals for upcoming releases of Alfresco Content Services (ACS):

 

  • Improve integration across the Alfresco Digital Business Platform - encompassing products such as Alfresco Process Services and the Alfresco Application Development Framework. One of these planned improvements is a shared authentication system that supports additional modern protocols such as OpenID Connect. This improved integration will make it easier for Alfresco Content Services customers to benefit from the power of Alfresco Process Services when their use case requires it.

  • Provide a containerized deployment option that can be hosted on a range of infrastructure, both on-premises and by cloud providers such as AWS. These containers will also be portable between deployment environments such as dev, test, and production.

  • Further enhance the REST API to allow advanced customizations to be completed outside of the repository Java process, including APIs for batching requests and subscribing to system events. Integrations using the REST APIs are easier to maintain and upgrade than customizations within the repository.

 

These goals will require some significant architectural changes to the Content Repository, and so we expect the next release to be a major version, Alfresco Content Services 6.0, which we plan to release in 2018. You will see these changes begin to enter Alfresco Community Edition immediately.

 

In order to make these improvements, we need to change some features of the product that you might be using. Specifically, we want to make sure you are aware of the following plans:

 

  • Installation Bundles: Customers have asked us to reduce the amount of effort necessary to deploy Alfresco Content Services (ACS) in a production configuration. The ACS installers will be replaced with Docker containers, using Kubernetes and Helm. This deployment technology allows us to better define a standard production configuration while giving greater flexibility to our customers as they deploy into their environments.

  • Web Application Servers: As part of providing a containerized cloud-ready deployment, we will be removing the need to manage a separate web application container. Instead, configuration will be injected into the Docker container, reducing the effort required to setup, secure, and manage the application. In the next release, we will no longer support Alfresco Content Services deployed within J2EE web application servers such as JBoss, WebSphere, and WebLogic. Over the long term, we are considering embedding the web application server within the repository and making the content repository directly executable. As a result, it is likely that support for deploying into a separate Tomcat web container will be dropped in a future release.

  • Solaris and DB2: As we focus on the most widely used deployment platforms, we will be dropping support for the Solaris operating system and IBM’s DB2 database.

  • CIFS an NTLMv1: Due to security vulnerabilities in the protocols, we will be removing the ability to access Alfresco Content Services as a shared network drive using CIFS / SMBv1 and the ability to authenticate using NTLMv1. We recommend that customers needing shared network drive access use our AOS WebDAV when using Windows clients and our standards-compliant WebDAV when using non-Windows clients. Customers should also use Kerberos instead of NTLMv1 for SSO. We will continue to improve our implementations of WebDAV and Kerberos.

  • Legacy Solr: ACS 6.0 will leverage the advanced capabilities of Solr 6. Previous versions of Solr will no longer be used—Solr 1 will be removed from the product, and Solr 4 will be deprecated and remain in the product only to support upgrades. No functionality will be lost upgrading to Solr 6, but there are some different defaults affecting the way locale is handled that will require minor adjustments in customizations.

 

In addition, we plan to remove the following capabilities:

 

  • Alfresco Process Services Share Connector: Advanced content and process applications can be built with superior user experiences using process and content components from the Application Development Framework.

  • Repository Multi-Tenancy: The multi-tenant capability of the Content Repository will only be supported as part of an OEM agreement and we are likely to remove multi-tenancy from Alfresco Community Edition. The support of multi-tenancy in Alfresco Process Services remains unchanged.

  • Encrypted Node Properties: This capability provides a label for properties that are managed by client code and is used internally in modules provided by Alfresco. With the release of Alfresco Content Services 6.0, it will be considered part of the private API. Custom clients can achieve the same capability by using a Blob or Base64 String property and managing the encryption of the content within those properties.

  • CIFS Shortcuts: Alfresco Content Services’ CIFS implementation provided Windows Explorer shortcuts for ECM tasks. These will be removed along with support for CIFS shared network drives. We do not currently plan to move them to the WebDAV implementation.

  • Meeting Workspace and Document Workspace: These Share site types are not supported by recent releases of Microsoft Office, and so will be removed.

 

With the release of Alfresco Content Services version 6.0, the following features will continue to be available but are deprecated, and you should expect them to be removed in a future version:

 

  • Some Share Features: We will gradually simplify Share to focus on the most commonly used capabilities by removing the following lesser-used site components and dashlets: site blogs, site calendars, site data lists, site links, and site discussion forums. These use cases are better met with dedicated interfaces, either through integration with third party applications or through custom development.

  • Web Quick Start: Web Quick Start provides an add-on to Alfresco Share that demonstrates how to build a website on top of Alfresco Content Services. Though customers are welcome to continue using Web Quick Start, we will not be enhancing this product. There are many ways to use the Alfresco Digital Business Platform to deliver content to the Web, and we would be happy to discuss your specific needs with you or point you to a partner.

  • Alfresco in the Cloud: As the market for content collaboration technologies has evolved, we are evaluating replacements for Alfresco in the Cloud (my.alfresco.com). We will offer different synchronization solutions to supersede Alfresco Cloud Sync to my.alfresco.com. As a result, we are no longer adding new functionality to that service. As our new products mature, we will reach out to the customers who are using Alfresco in the Cloud to outline the replacements and possible timelines.

 

We also make the following recommendations to help those building applications on top of Alfresco Content Services to prepare for future releases:

 

  • The versioned REST API for ACS covers a wide range of use cases, and is preferred over in-process APIs for extending Alfresco Content Services. Integrations and customizations that use the REST API are easier to integrate into your own development processes and are easier to maintain when upgrading ACS.

  • In order to make it easier to design, deploy, and maintain custom workflows, in a future release we will be providing a platform-wide workflow service using Alfresco Process Services (powered by Activiti). This will replace the use of embedded Activiti for custom workflows. Future custom workflows will be implemented external to the Content Repository and will leverage the REST APIs of Alfresco Content Services. To be easily upgradable, new custom workflows should make local REST API calls in order to avoid using the in-process APIs.

  • ACS workflows are intended to automate the management of content items within the Content Repository and APIs for custom workflows will continue to be available with subscriptions to Alfresco Content Services. A subscription to Alfresco Process Services (APS) is required for advanced process management use cases which is used for collecting, disseminating, integrating and coordinating information across an organization.

  • Though we continue to improve and maintain Share, we recommend that custom applications be built with the Application Development Framework (ADF). ADF components make it easier to assemble and maintain custom applications.

 

Thank you for the feedback you have previously given on our products which have informed these changes. We think you will appreciate how these changes will allow us to evolve Alfresco Content Services to meet the needs of your organization both now and in the future. If you are a customer, and have any questions, please reach out to your Customer Success Manager. If you are using one of our open source products and want to engage in the discussion, feel free to comment on this post. We look forward to continuing the conversation with you.

 

Regards,

 

The Alfresco Team

 

Table Summarizing Changes and Guidance

Architecture Change

Guidance

Timing

Improved REST APIs

Use the REST APIs instead of the in-process APIs.

Immediately

An eventual move to a platform workflow service

Custom ACS workflows should use REST calls to the Content Repository when possible.

 

Use APS for process management across the organization.

Immediately

 

 

Immediately

Simplify the Share UI

Integrate with 3rd party applications or develop custom interfaces.

Immediately

Containerized deployment

Transition your deployment from the installers toward container technology.

6.0 release

Executable content repository

Move away from separate web application servers.

6.0 release

No support for Solaris

Migrate to a supported different OS.

6.0 release

No support for DB2

Migrate to a supported database.

6.0 release

No support for CIFS or CIFS shortcuts.

Use WebDAV.

6.0 release

No support for NTLMv1

Use Kerberos.

6.0 release

Replace Solr 1 and Solr 4

Upgrade to Alfresco Search Services powered by Solr 6.

6.0 release

Discontinue the APS Share Connector

Leverage the Alfresco Development Framework.

6.0 release

Repository Multi-Tenancy only for OEMs

If you need multi-tenancy, talk to your Customer Care Representative about your use case.

6.0 release

Encrypted Node Properties

Use a Blob or Base64 String property.

6.0 release

Removal of Meeting Workspace and Document Workspace site types

Use standard collaboration sites in Share.

6.0 release

Removal of some Share features: site blogs, site calendars, site data lists, site links, and site discussion forums

Develop a dedicated interface or use one provided by a third-party.

Post 6.0

Phasing out of Web Quick Start

Transition to another web delivery platform.

Post 6.0

Phasing out of Alfresco in the Cloud

No action needed at this time. We will contact you when there is a timeline you should be aware of.

Post 6.0

Filter Blog

By date: By tag: