Skip navigation
All Places > Alfresco Content Services (ECM) > Blog > Authors pmonks2
1 2 Previous Next

Alfresco Content Services (ECM)

17 Posts authored by: pmonks2
As mentioned in my last post, the CMIS TC has been working on the issue of version-independent object identity, and that work is being tracked as CMIS-731.  At this point there are 2 basic proposals floating around (each with a number of variants), and I wanted to describe, compare & contrast them, and open the discussion up to the CMIS community for feedback.



Right now the committee is actively discussing these proposals, but my concern is that because there is minimal representation on the TC from the developers of CMIS client applications, the requirements and proposed solutions are being presented and discussed in something of a vacuum.  While I think I have a reasonable understanding of client application requirements (from my work with the Alfresco ecosystem), I'm still an indirect source - I'd much rather hear requirements from, and have proposals validated by, the developers of CMIS client applications, and while the OASIS machinery can be a little intimidating to navigate, a blog is a relatively low pressure, easy place to provide that kind of feedback - please don't hesitate to comment here!



I'll start by providing a summary of the two basic proposals and their primary variants as I understand them, and then compare and contrast those vs the requirements that the committee has discussed to date.  I'd then encourage you to provide any and all feedback you have (even if it's a short 'Peter, you're wrong and here's why: ...!') as comments on this post.



So without further ado, the proposals:

Extend applicability of cmis:versionSeriesId



This proposal is based on the observation that in the current versions of the specification (1.0 and 1.1), the cmis:versionSeriesId property (where present, and for the services that support it) already provides a version-agnostic identifier; the only gaps being that it isn't ubiquitous across all object types and services.



Based on this observation, this proposal mandates cmis:versionSeriesId for all servers, regardless of whether they support versioning or not, and all CMIS services that today accept a cmis:objectId would offer an equivalent that accepts a cmis:versionSeriesId (the semantics being 'invoke this service as if the cmis:objectId of the latest version had been provided').  This could be achieved by continuing the xxxOfLatestVersion service pattern to its ultimate conclusion, or by overloading the existing services to support either cmis:objectId or cmis:versionSeriesId as input.



This proposal also optionally renames cmis:versionSeriesId to something more descriptive (its expanded semantics no longer being limited to version series'), as well as deprecating or removing the xxxOfLatestVersion services if the alternative of overloading the existing services to accept either cmis:objectId or cmis:versionSeriesId is selected (since they would be redundant).

Basic Variant - extend the semantics for cmis:document types only



In this variant, cmis:versionSeriesId would only become mandatory for cmis:document and sub-types of it.  Other CMIS object types (cmis:folder, cmis:relationship, cmis:item and cmis:policy) would continue to not support this property, as is already the case in CMIS 1.0 and 1.1 (cmis:objectId would remain the only identifier for these object types).

Extended Variant - extend the semantics to all CMIS object types



In this variant, cmis:versionSeriesId would become mandatory for all object types - not just cmis:document but also cmis:folder, cmis:relationship, cmis:item and cmis:policy.

Add a new identifier



In this proposal, a new mandatory identifier would be added to the specification, tentatively called cmis:representativeCopyId at the time of writing (see CMIS-731).



All CMIS services that today accept a cmis:objectId would offer an equivalent that accepts a cmis:representativeCopyId, with the semantics being 'invoke this service as if the cmis:objectId of the latest version had been provided'.

Basic Variant - add the new identifier to cmis:documents types only



In this variant, cmis:representativeCopyId would only become mandatory for cmis:document and sub-types of it.  Other CMIS object types (cmis:folder, cmis:relationship, cmis:item and cmis:policy) would not support this property (cmis:objectId would remain the only identifier for these object types).

Extended Variant - add the new identifier to all CMIS object types



In this variant, cmis:representativeCopyId would become mandatory for all object types - not just cmis:document but also cmis:folder, cmis:relationship, cmis:item and cmis:policy.

Comparison Matrix



With the basic proposals outlined, we can now compare these two proposals (and their variants) vs the requirements that the committee has identified to date (additional requirements welcome!):



























































































































































Extend cmis:versionSeriesIdAdd a new identifier
cmis:document onlyAll CMIS object typescmis:document onlyAll CMIS object types
1Avoids extra round trips to the server that are required today (e.g. calls to getTypeDefinition to figure out if a type is versioned or not, calls to 'fast forward' through a version history, etc.)
2Provides a single identifier that can be used for cmis:document and all sub-types
3Provides a single identifier that can be used for all CMIS objects types
4Eliminates conditional logic around identifier handling in CMIS client applications
5Avoids identifier proliferation
6Avoids adding a 2nd identifier to object types that don't need it (cmis:folder etc.)
7Avoids potential confusion around the current semantics of 'version series'
SCORE (higher is better)4545


What's been apparent during the committee's discussions, and is borne out by this comparison, is that none of these proposals is a clear winner.  What this exercise does do, however, is focus the conversation on the key differentiating characteristics of the proposals, which are:



  • Line 3: Is it important to have a single identifier for all objects in CMIS, or is it acceptable to require client applications to deal with 2 (one for cmis:document and sub-types, and another for everything else)?


  • Line 4: What value should be placed on keeping all client applications simpler, even at the expense of more complex server side implementations?


  • Line 5: What is the value of keeping the specification simpler, by avoiding identifier proliferation?


  • Line 6: How bad is it to add another identifier property to object types that technically don't need it?


  • Line 7: Does expanding the semantics of cmis:versionSeriesId confuse or devalue the concept of 'version series'?



While I and the CMIS TC members have our own answers to these questions, I'm much more interested in hearing directly from the developers of CMIS client applications.  Which of these proposed solutions makes your life easiest?  Which requirements do you care about, and which don't matter?  What requirements are missing from the list above?



The window is closing on identifying a preferred solution to the long-standing problems of CMIS identity, and once closed it's unlikely to be reopened for a long time (if ever), so now is your chance to have a say!



As always the CMIS mailing list is the best place to leave ad-hoc feedback, but feel free to comment here and I'll pass your feedback along to the committee.

A note on Private Working Copies (PWCs)



One topic that's come up in the TC meetings is how this new mechanism should interact with Private Working Copies (PWCs).  As a refresher, a PWC is the temporary copy of a document that gets created when the checkout service is called on it.



The complication revolves around whether a PWC, while it exists, should be the target of the proposed version-independent identifier (however it is implemented).  In the description of the checkout service, the CMIS specification states:

until it is checked in (using the checkIn service), the PWC MUST NOT be considered the latest or latest major version in the version series.


which implies it should not be resolvable via the new identifier.



However my experience has been that it is a common requirement for a CMIS client application to want to retrieve the latest 'usable' version of an object, which is the PWC for those user(s) that have permission to access the PWC, and the latest non-PWC version otherwise. So there's dramatic tension here between the spec's definition that PWC's are not versions, and the reasonable expectation that the new identifier would resolve to a PWC where appropriate.



The upshot is that further consideration is needed around how the new identifier would interact with PWCs, if at all. Feedback from CMIS client application developers is, again, very welcome.

It's ok to store cmis:objectId's in my CMIS client application, right?



Ah such a simple question, yet hiding a plethora of probable pitfalls!



Over the last couple of years I've encountered (and held myself!) the misconception that cmis:objectId's are basically synonymous with NodeRef's, Alfresco's native form of identifier.  Unfortunately there is a subtle but significant difference that traps many an unwary CMIS client developer: an Alfresco NodeRef identifies an entire object including that object's version history (if any) - in effect documents and versions are fundamentally different types of 'thing', and versions don't have any independent notion of identity.  In contrast, in CMIS both documents and versions are the same (they're both cmis:documents) and are each uniquely identified by their own cmis:objectId.  From the versioning section of the spec (emphasis added):

Each version of a document object is itself a document object, i.e. has its own object id, property values, MAY be acted upon using all CMIS services that act upon document objects, etc.


So going back to our original question, the validity of storing a cmis:objectId for later use depends on what the CMIS client application is storing a reference to; some possibilities include:



  1. An unversioned object type (i.e. something other than cmis:document) => golden!


  2. An unversioned cmis:document => peachy!


  3. A specific version of a versioned cmis:document => good to go!


  4. The latest version of a versioned cmis:document => ruh roh raggy!!



1 problematic case out of 4 may not seem like too much of an issue, until we recall that versioning is enabled by default for all CMIS-accessible files in Alfresco (i.e. cmis:document and all sub-types).  Add to this that many CMIS client apps, regardless of the server they're connecting to, basically don't care about versioning (and when they do it's often limited to concurrency control via private working copies) - they simply want to treat the CMIS repository as a glorified file/folder store, reading and writing files as if they were flat, unversioned objects - and you start to appreciate the seriousness of the problem.

In short, cmis:objectId alone cannot satisfy the 80% use case of CMIS client applications i.e. version-agnostic file/folder CRUD!





So what's the alternative?



There are at least two approaches that I've come up with for working around this issue (and there may be more):



  1. Store a cmis:objectId and make additional CMIS calls to manually 'fast forward' to the latest version of the object on every subsequent CMIS call.


  2. Store a cmis:objectId for unversioned object types, and a cmis:versionSeriesId for versioned object types, and make subsequent CMIS calls appropriate to each.



Fast forward



With this approach, the CMIS client application would store the cmis:objectId as normal, but every single time it accesses the object it identifies, it would look up the cmis:objectId of the latest version of the object first, before continuing with the original operation.  In detail, this involves:



  1. Call the getObject service with the original cmis:objectId.


  2. Look for the cmis:versionSeriesId property in the response.  If the cmis:versionSeriesId property exists in the response:



    1. Call the getPropertiesOfLatestVersion service with the cmis:versionSeriesId.


    2. Pull out the cmis:objectId from the response - this is guaranteed to be the cmis:objectId of the latest version of the object, at the time of the call.


    3. Update the stored cmis:objectId with the retrieved cmis:objectId (optional).





  3. Call the desired CMIS service.



The advantages of this method is that the logic is reasonably clean and simple, but it has the downside of requiring at least 2, and sometimes 3, CMIS calls for every single 'original' CMIS call the client application wished to make (regardless of whether the object is versioned or not), as well as risking race conditions between steps 2.1 and 3 (i.e. when a new version of the object gets created by some other process between those two calls).



[UPDATE 2014-02-28] A vendor I'm working with mentioned another variation of this strategy that uses the 'cmis:isLatestVersion' property in step 2 to determine whether the cmis:objectId refers to the latest and greatest version or not.  Other than use of a different property, the logic remains much the same (the client application still needs to 'fast forward' to the latest version, using cmis:versionSeriesId).

Conditionally store either cmis:objectId or cmis:versionSeriesId



This approach involves storing cmis:objectIds for object types that are not versioned, and cmis:versionSeriesIds for object types that are.  Unversioned object types include everything that isn't a cmis:document (cmis:folder, cmis:relationship, cmis:policy and cmis:item), as well as, on a case-by-case basis, cmis:document and sub-types of cmis:document (whether such object types are versioned or not can be determined by retrieving the 'versionable' property for each cmis:document object type in the system).



For unversioned objects (i.e. those that have a cmis:objectId stored in the CMIS client application), CMIS service calls can be made directly by the client application, secure in the knowledge that the results will always refer to the latest version of the object (since, by definition, there can only ever be one version of such objects).



For versioned objects (i.e. those that have a cmis:versionSeriesId stored in the CMIS client application), one of two possible call sequences are required:



  1. If the CMIS client application only requires metadata, it can call one of the 'OfLatestVersion' services (getObjectOfLatestVersion or getPropertiesOfLatestVersion).


  2. For all other use cases:



    1. Call the getPropertiesOfLatestVersion service with the cmis:versionSeriesId.


    2. Pull out the cmis:objectId from the response - this is guaranteed to be the cmis:objectId of the latest version of the object, at the time of the call.


    3. Call the desired CMIS service with the retrieved cmis:objectId.






The advantage of this method is that it optimises the number of CMIS calls needed to perform such 'version independent' operations - often only requiring a single call.  The disadvantages are that it requires some initial 'discovery' calls to figure out exactly what's versioned vs what isn't, the client application's logic is more complex due to the two different types of CMIS identifier that must be used, and there is the risk of a race condition between steps 2.1 and 2.3 in the event of a concurrent update by another process.



You might be wondering why a CMIS client application can't simply store the cmis:versionSeriesId in all cases.  Unfortunately cmis:versionSeriesId is optional (you'll have to manually scroll down to the cmis:versionSeriesId definition in that reference) - a compliant CMIS repository does not have to provide this property for unversioned object types, and in my experience most don't.

This sux - surely there's something better?



I've been unable to come up with a better alternative based strictly on the CMIS 1.x specifications, but that doesn't mean others don't exist - I'd love to hear about them if you've come up with one.  That said, having worked fairly extensively with CMIS client application implementers over the last couple of years I'm reasonably certain there isn't a fundamentally better approach.



The good news is that the issue has been brought to the attention of the CMIS Technical Committee, and there is a proposal from Oracle for something called 'representative copies' that potentially has some overlap with this use case.



Speaking personally, I would like to see something along the lines of the following, minimally intrusive change:



  • Make cmis:versionSeriesId mandatory for all object types and rename it (e.g. to cmis:id) to show case its more general utility.


  • Update all services that receive a cmis:objectId to also support the new identifier property as an alternative.  When the new identifier is provided, the semantics would be 'perform the requested service against the latest version of the object'.


  • Remove the 'OfLatestVersion' services, as they would now be redundant.



Conclusion



CMIS is a valuable addition to the content management repertoire, but as with version 1s of most products, it has its share of flaws.  This particular flaw happens to be both subtle and of significant impact, which makes it all the more important for CMIS client application developers to understand it and factor it into their designs.



More generally, it is my opinion that this also reflects the specification's focus on addressing 'hard core ECM' requirements, to the (unintended) detriment of the 80% content management case i.e. simple file/folder CRUD.  I suspect no one on the CMIS TC realised at the time that the intersection of versioning and identity would 'bleed through' the basic file/folder CRUD use case in this way.



Ultimately the best way for problems like this to be fixed (or better yet, to not surface in the first place!) is community involvement.  I've found the CMIS TC to be an open and welcoming place, and I strongly encourage all CMIS client application implementers to get involved in the committee's good work, at the very least at the level of an observer (as I have).
Anyone who's familiar with CMIS and the Apache Chemistry project will likely be familiar with the CMIS Workbench.  It's a handy developer-oriented GUI tool that provides a low level view of any CMIS-compliant repository.



With the release of the Alfresco Public API back in October, the Alfresco Cloud is now a CMIS-compliant repository, and can hence be accessed using the CMIS Workbench.  The CMIS Workbench does not yet directly support the OAuth2 authentication mechanism used in the Alfresco Cloud however; this post describes the steps necessary to get it working.



For starters, you'll need accounts on both the Alfresco Cloud and the Alfresco Developer Portal (signup to both of these services is free).  You'll then need to download v0.8.0 or newer of the CMIS Workbench and unzip it somewhere convenient on your hard drive.  It also helps to have some understanding of the OAuth2 authentication mechanism, how the various OAuth2 codes and tokens are obtained and used, and how OAuth2 has been implemented in the Alfresco Cloud - if you're not familiar with these concepts you may find this DevCon 2012 session helpful.



Next you need to obtain an OAuth2 access token from the Alfresco Cloud.  This is best achieved using a pre-built application, such as my Grails sample app (which conveniently dumps the access token to the log).



After that we're ready to run the CMIS Workbench.  After the 'Login' dialog appears, switch to the 'Expert' tab and paste in the following text (note: on Mac OSX you need to use the Ctrl key rather than the Command key for select and paste operations):

# Alfresco Cloud (CMIS 1.0 AtomPub)

org.apache.chemistry.opencmis.binding.spi.type=atompub

org.apache.chemistry.opencmis.binding.atompub.url=https://api.alfresco.com/cmis/versions/1.0/atom

org.apache.chemistry.opencmis.binding.auth.http.basic=false



# Please provide a valid OAuth access token in the following property

# Note that Alfresco Cloud access tokens have a limited lifetime (currently 1 hour) and the OpenCMIS Workbench does not auto-refresh the access token when it expires

org.apache.chemistry.opencmis.binding.header.0=Authorization:Bearer ####ACCESS_TOKEN####



# Other optional options - compression etc. - may be provided here


Replace the text '####ACCESS_TOKEN####' with the access token you obtained previously, ensuring there is a single space character between 'Bearer' and the access token value.  This should end up looking similar to the following (click to embiggen):

Connection Information for Alfresco Cloud



Click 'Load Repositories' and if the settings are correct the Alfresco Networks you are a member of will be displayed in the repositories dropdown.  Pick one (don't be surprised if you have only one - that's the normal case), click the Login button and you should be viewing your Alfresco Cloud content using the CMIS Workbench!


tl;dr



Many open source contributor agreements (including the Alfresco Contributor Agreement) do not involve any reassignment of copyright - instead they grant the project maintainer a license to use, modify and distribute the contribution.

The Full Picture



I was recently chatting with Jennifer Venables (Alfresco's awesome general counsel) and she mentioned something in passing that I hadn't realised before, and that I'm guessing many of you may not know either.  I had always been under the impression that most, if not all, contributor agreements (whether for open source projects or not) involved the individual contributor handing over copyright on their contribution to the project maintainer (for example Alfresco Software Inc., in the case of contributions to the open source Alfresco content management system).  If you're like me, the thought of handing over 'your babies' to someone else is not particularly appealing, and in general I've made a point of not becoming involved in projects that require me to give up rights to my own creations.



So Jennifer's off-hand comment took me a little by surprise, and after she patiently explained how contributor agreements typically work, I'm looking at them in a more positive light.  Specifically, the Alfresco Contributor Agreement (and those of some other open source projects) do not involve any reassignment of copyright - you as the creator of a particular contribution retain full ownership of the copyright of that contribution.  Instead, the project maintainer is simply requesting that you license your intellectual property to them, so that they can also use, modify and distribute it - rights you've probably granted to the public anyway (at least if you've chosen one of the more popular open source licenses).



It's also worth noting that the Harmony project is an attempt by the wider open source community to try to standardise and clarify contributor agreements, as there seems to be a lot of confusion around them.  Jennifer has been keeping an eye on their progress on behalf of Alfresco, as their initiative would help to clarify what is often (and definitely was for me) a confusing legal mechanism.



<disclaimer aka IANAL>Now I'm about as far from a lawyer as it's possible to get, so everything I've said here you'd be strongly advised to double check with someone who has legal expertise in this area.</disclaimer>



What I can say with certainty is that I had completely misunderstood the intent of Alfresco's Contributor Agreement and (more importantly) the legal basis upon which it operates, and that misunderstanding has prevented me in the past from contributing to other open source projects.  I guess I'll chalk this up to 'sometimes, you just don't know what you don't know'!
Recently I transitioned from my long-standing role leading Alfresco's Professional Services team to being the in-house technologist for the Business Development team, and one of my first tasks in the new role has been to work on an integration between Alfresco and Jive Engage.  This work is being done in partnership with SolutionSet (a partner of both Alfresco and Jive), and I wanted to discuss some of the technical design work that has gone into the Toolkit, ahead of its availability (which will be soon after the Jive 5.0 launch - the version of Jive that the Toolkit is targeting).

Functional Overview, aka 'What will it do?'



As announced at Gartner's Portals, Collaboration & Content Summit this week, the integration (known as the 'Jive Toolkit') is a set of pre-built components that allows Jive to store documents in Alfresco while still offering all of the same social features as 'native' Jive documents (commenting, rating, discussions, etc.).  While not yet all-encompassing - Jive's 'social' content cannot yet be stored or managed within Alfresco - the Toolkit will provide a foundational level of document-centric integration, allowing implementers to focus on more use-case specific integrations as required (hence the positioning as a 'toolkit', rather than a fully fledged solution).



More specifically, the initial version of the Toolkit will allow users of Alfresco and/or Jive to create 'managed' documents in any of the following 3 ways:



  1. By uploading a document to Alfresco, using the Jive UI.


  2. By 'publishing' an existing document from Alfresco to Jive, using Alfresco's Share UI.


  3. By 'linking' an existing document stored in Alfresco to Jive, using the Jive UI.


In all 3 cases, the result is the same: the document is visible and accessible via the Jive UI in exactly the same way as any 'native' document, but the content of the document is stored and managed in Alfresco only.  Jive will maintain some metadata about the document - for example the document's filename and a pointer to the document in Alfresco - but it will not store the binary content of the document.  This approach ensures that the document is a first class citizen in both the Alfresco and Jive worlds, while minimising the risk of synchronisation issues between the two systems.



Here are some screenshots that demonstrate uploading a document to Alfresco using the Jive UI:

Alfresco managedocument step1

Step 1 - Navigating to a community in Jive


Alfresco managedocument step2

Step 2 - Managing a document


Alfresco managedocument step3

Step 3 - Select a file to upload


Alfresco managedocument step4

Step 4 - Select the target space in Alfresco


Alfresco managedocument step5alf

Document details (Alfresco)


Alfresco managedocument step5

Document details (Jive)




Technical Details, aka 'Rubber, meet road'



As mentioned above, there are a variety of ways that the initial 'linkage' of a document between Alfresco and Jive can be achieved, however all 3 creation mechanisms produce the same end state: Alfresco has the document in its entirety (including the filename, content, etc.) while Jive has a 'proxy object' (a structured data-only object that has the filename and a pointer to the document in Alfresco, but does not have the actual binary content).



This means that all downstream events (updates, metadata modifications, deletes) can be handled the same way, irrespective of how the content was linked between the two systems in the first place - a major simplification in the logic for those downstream events.

Integration Mechanism, aka 'CMIS, by any other name would smell as sweet...'



Another nice characteristic of this approach is that the calls from Jive to Alfresco (to create content, update and retrieve it) can be accomplished using the CMIS API.  This has several benefits, from reduced development effort in the Toolkit itself (due to the ready availability of client-side CMIS libraries), to the potential for portability to other CMIS compliant repositories in the future.



One important thing to note is that the Alfresco-to-Jive API calls are not standards-based - they make use of Jive's proprietary REST API.  Jive does not expose a standards-based API (indeed, no suitable standard exists for social business systems yet), and CMIS doesn't provide any kind of callback mechanism for clients to be notified when repository events of interest occur (i.e. a mechanism equivalent to Alfresco's Component Policies).

Tricky Bits, aka 'The Devil is in the Details'



As with any integration between complex enterprise applications, there is some trickery in some parts of the integration, and it's critical to understand these if you're evaluating the Jive Toolkit.

Deletion



The first piece of trickery involves deletion of the content, specifically deletion in Alfresco.  Because Jive maintains a pointer to the document in Alfresco (specifically, the 'cmis:id'), rather than the content itself, if the document is deleted in Alfresco without Jive being notified, attempts within Jive to retrieve that content will fail.  To prevent this, the Toolkit is currently designed to veto deletes in Alfresco if the document has been socialised in Jive.  To delete a document, it will first need to be deleted in Jive at which point it can be deleted from Alfresco too.  The reason the Toolkit doesn't simply synchronise deletes between Jive and Alfresco is that there are common use cases where the document may be removed from Jive, but needs to be retained in Alfresco - replicating deletes between the two systems would have ruled out these use cases.

Full Text Indexing



The second item of trickery revolves around full text indexing of the document in Jive.  To accomplish this, Jive will retain a copy of the content of the document just long enough to index it into Jive's full text index, and once indexing is complete the content of the document will be removed from Jive.  As you'd expect, Alfresco will also notify Jive of any updates to the document, so that the content can be re-indexed on the Jive side.

Access Control and Identity



Access control to the documents is also tricky, primarily because the Alfresco and Jive ACL models differ in their level of granularity.  Jive's access control is primarily Community-centric (i.e. defined and enforced at the level of the Community), while Alfresco has a fine grained, per-node (file or folder) ACL mechanism.  In this first release, the Toolkit will initially create the document in both systems in such a way that the ACLs are in sync, but modification of those ACLs in either system will not be replicated to the other system.  The upshot is that direct manipulation of the document's ACLs in Alfresco may cause errors in Jive (i.e. users who can see the document in the Jive UI, but are unable to download it).



Furthermore, in order for Alfresco and Jive to agree on the principal set, the initial version of the Toolkit assumes that both Alfresco and Jive are configured to use the same LDAP repository for user identity and authentication.  During the design sessions it was felt that this was likely to be a requirement for an integrated solution anyway and hence wouldn't be an impediment, but we're keen to have that assumption validated as broadly as possible.

In Conclusion



So there you have it - a whirlwind tour of the upcoming Jive Toolkit!  As a v1.0 there are some more sophisticated use cases that the Toolkit doesn't address yet, including multi-document / library based integration, and capture of Jive's social content (discussions, ratings, wiki pages, etc.) in Alfresco.  The intention with the Toolkit is to initially provide Alfresco+Jive Systems Integrators (such as SolutionSet) with a small but solid base on which such extensions could be built, and if/when common requirements are identified for these more sophisticated use cases they can be rolled back into the Toolkit.



We're keen to hear your feedback and look forward to your participation in the project!
pmonks2

Alfresco and Groovy, Baby!

Posted by pmonks2 Aug 19, 2010
For quite a few years now I've been a fan of scripted languages that run on the JVM, initially experimenting with the venerable BeanShell, then tinkering with Javascript (via Rhino), JRuby and finally discovering Groovy in late 2007.  A significant advantage that Groovy has over most of those other languages (with the possible exception of BeanShell), is that it is basically a superset of Java, so most valid Java code is also valid Groovy code and can therefore be executed by the Groovy 'interpreter'1 without requiring compilation, packaging or deployment - three things that significantly drag down one's productivity with 'real' Java.



To that end I decided to see if there was a way to implement Alfresco Web Scripts using Groovy, ideally in the hope of gaining access to the powerful Alfresco Java APIs with all of the productivity benefits of working in a scripting-like interpreted environment.



It turns out that the Spring Framework (a central part of Alfresco) moved in this direction some time ago, with support for what they refer to as dynamic-language-backed beans.  Given that a Java backed Web Script is little more than a Spring bean plus a descriptor and some view templates, initially it seemed like Groovy backed Web Scripts might be possible in Alfresco already, merely by adding the Groovy runtime JAR to the Alfresco classpath and then configuring a Java-backed Web Script with a dynamic-language-backed Spring bean.

Oh behave!



Unfortunately this approach ran into one small snag: Alfresco requires that Java Web Script beans have a 'parent' of 'webscript', as follows:



<bean id='webscript.my.web.script.get'

class='com.acme.MyWebScript'

parent='webscript'>

<constructor-arg index='0' ref='ServiceRegistry' />

</bean>



but Spring doesn't allow dynamic-language-backed beans to have a 'parent' clause.

It's freedom baby, yeah!



There are several ways to work around this issue, but the simplest was to implement a 'proxy' Web Script bean in Java that simply delegates to another Spring bean, which itself could be a dynamic-language-backed Spring bean implemented in any of the dynamic languages Spring supports.



This class ends up looking something like (imports and comments removed in the interest of brevity):



public class DelegatingWebScript

extends DeclarativeWebScript

{

private final DynamicDeclarativeWebScript dynamicWebScript;








public DelegatingWebScript(final DynamicDeclarativeWebScript dynamicWebScript)

{

this.dynamicWebScript = dynamicWebScript;

}








@Override

protected Map executeImpl(WebScriptRequest request, Status status, Cache cache)

{

return(dynamicWebScript.execute(request, status, cache));

}

}



While DynamicDeclarativeWebScript looks something like:



public interface DynamicDeclarativeWebScript

{

Map execute(WebScriptRequest request, Status status, Cache cache);

}



This Java interface defines the API the Groovy code needs to implement in order for the DelegatingWebScript to be able to delegate to it correctly when the Web Script is invoked.



The net effect of all this is that a Web Script can now be implemented in Groovy (or any of the dynamic languages Spring supports for beans), by implementing the DynamicDeclarativeWebScript interface in a Groovy class, declaring a Spring bean with the script file containing that Groovy class and then configuring a new DelegatingWebScript instance with that dynamic bean.  This may sound complicated, but as you can see in this example, is pretty straightforward:



<lang:groovy id='groovy.myWebScript'

refresh-check-delay='5000'

script-source='classpath:alfresco/extension/groovy/MyWebScript.groovy'>

<lang:property name='serviceRegistry' ref='ServiceRegistry' />

</lang:groovy>








<bean id='webscript.groovy.myWebScript'

class='org.alfresco.extension.webscripts.groovy.DynamicDelegatingWebScript'

parent='webscript'>

<constructor-arg index='0' ref='groovy.myWebScript' />

</bean>





While a little more work than I'd expected, this approach meets all of my goals of being able to write Groovy backed Web Scripts, and in the interests of sharing I've put the code up on the Alfresco forge Google Code.

I demand the sum... ...OF 1 MILLION DOLLARS!



But wait - there's more! Not content with simply providing a framework for developing custom Web Scripts in Groovy, I decided to test out this framework by implementing a 'Groovy Shell' Web Script.  The idea here is that rather than having to develop and register a new Groovy Web Script each and every time I want to tinker with some Groovy code, instead the Web Script would receive the Groovy code as a parameter and execute whatever is passed to it.



Before we go any further, I should mention one very important thing: this opens up a massive script-injection-attack hole in Alfresco, and as a result this Web Script should NOT be used in any environment where data loss (or worse!) is unacceptable!! It is trivial to upload a script that does extremely nasty things to the machine hosting Alfresco (including, but by no means limited to, formatting all drives attached to the system) so please be extremely cautious about where this Web Script gets deployed!



Getting back on track, I accomplished this using Groovy's GroovyShell class to evaluate a form POSTed parameter to the Web Script as Groovy code (this is conceptually identical to Javascript's 'eval' function, hence the warning about injection attacks).  Effectively we have a Groovy-backed Web Script that interprets an input parameter as Groovy code, and then goes ahead and dynamically executes it!  It's turtles all the way down!



The code also transforms the output of the script into JSON format, since there are existing Java libraries for transforming arbitrary object graphs (as would be returned by an arbitrary Groovy script) into JSON format.



Here's a screenshot showing the end result:



[caption id='attachment_274' align='aligncenter' width='500' caption='Alfresco Groovy Shell - Vanilla Groovy Script']Alfresco Groovy Shell[/caption]



The more observant reader will have noticed the notes in the top right corner, particularly the note referring to a 'serviceRegistry' object.  Before evaluating the script, the Web Script injects the all important Alfresco ServiceRegistry object into the execution context of the script, in a Groovy variable called 'serviceRegistry'.  The reason for doing so is obvious - this allows the script to interrogate and manipulate the Alfresco repository:



[caption id='attachment_276' align='aligncenter' width='500' caption='Alfresco Groovy Shell - Groovy Script that Interrogates the Alfresco Repository']Alfresco Groovy Shell[/caption]

Sharks with lasers strapped to their heads!



Now if you look carefully at this script, you'll notice that it (mostly) looks like Java, and this is where the value of this Groovy Shell Web Script starts to become apparent: because most valid Java code is also valid Groovy code, you can use this Web Script to prototype Java code that interacts with the Alfresco repository, without going through the usual Java rigmarole of compiling, packaging, deploying and restarting!



I recently conducted an in-depth custom code review for an Alfresco customer who had used Java extensively, and this Web Script was a godsend - not only did I eliminate the drudgery of compiling, packaging and deploying the customer's custom code (not to mention restarting Alfresco each time), I also completely avoided the time consuming (and, let's be honest, painful) task of trying to reverse engineer their build toolchain so that I could build the code in my environment.  This alone was worth the price of admission, but coupled with the rapid turnaround on changes (the mythical 'edit / test / edit / test' cycle), I was able to diagnose their issues in a much shorter time than would otherwise have been possible.

Conclusion



As always I'm keen to hear of your experiences with this project should you choose to use it, and am keen to have others join me in maintaining and enhancing the code (which is surprisingly little, once all's said and done).




Technically Groovy does not have an interpreter; rather it compiles source scripts into JVM bytecode on demand.  The net effect for the developer however is the same - the developer doesn't have to build, package or deploy their code prior to execution - a serious productivity boost.
By default, the Alfresco WCM UI allows an author to select a different workflow and even reconfigure it at submission time, as shown in the following screenshot:



[caption id='attachment_243' align='aligncenter' width='500' caption='Screenshot: Configure Workflow at Submit Time']Configure Workflow at Submit Time[/caption]



The obvious issue is that typically authors should not have the ability to influence the approval process, which, after all, is intended to ensure that any content they submit is appropriate for display on the live site.  As the feature currently exists in Alfresco, it is possible, for example, for the author to set themselves as the approver of their change set, completely circumventing the approval process that has been put in place.



While there is an open enhancement request (ENH-466) requesting that these controls be removed, many implementers need to be able to remove them immediately, on versions of Alfresco where this enhancement request has not yet been implemented.



Luckily there's a straight forward way of doing this, albeit one that requires modification of a core Alfresco JSP.  The UI for the Submit Items dialog is rendered by a single JSP in the Alfresco Explorer UI:

/jsp/wcm/submit-dialog.jsp


At around line 104 (on Enterprise 3.2r - it may be slightly earlier or later in the file on other versions) the following two <h:panelGrid> blocks appear:



<h:panelGrid columns='1' cellpadding='2' style='padding-top:12px;padding-bottom:4px;'

width='100%' rowClasses='wizardSectionHeading'>

<h:outputText value=' #{msg.workflow}' escape='false' />

</h:panelGrid>



<h:panelGrid columns='1' cellpadding='2' cellpadding='2' width='100%' style='margin-left:8px'>

<h:panelGroup rendered='#{DialogManager.bean.workflowListSize != 0}'>

<h:outputText value='#{msg.submit_workflow_selection}' />

<h:panelGrid columns='2' cellpadding='2' cellpadding='2'>

<a:selectList id='workflow-list' multiSelect='false' styleClass='noBrColumn' itemStyle='padding-top: 3px;'

value='#{DialogManager.bean.workflowSelectedValue}'>

<a:listItems value='#{DialogManager.bean.workflowList}' />

</a:selectList>

<h:commandButton value='#{msg.submit_configure_workflow}' style='margin:4px' styleClass='dialogControls'

action='dialog:submitConfigureWorkflow' actionListener='#{DialogManager.bean.setupConfigureWorkflow}' />

</h:panelGrid>

</h:panelGroup>

<h:panelGroup rendered='#{DialogManager.bean.workflowListSize == 0}'>

<f:verbatim><% PanelGenerator.generatePanelStart(out, request.getContextPath(), 'yellowInner', '#ffffcc'); %></f:verbatim>

<h:panelGrid columns='2' cellpadding='0' cellpadding='0'>

<h:graphicImage url='/images/icons/warning.gif' style='padding-top:2px;padding-right:4px' width='16' height='16'/>

<h:outputText styleClass='mainSubText' value='#{msg.submit_no_workflow_warning}' />

</h:panelGrid>

<f:verbatim><% PanelGenerator.generatePanelEnd(out, request.getContextPath(), 'yellowInner'); %></f:verbatim>

</h:panelGroup>

</h:panelGrid>




Removing the ability for authors to select a different workflow and/or reconfigure the selected workflow is as simple as commenting out both of these blocks, using JSP style comment tags (<%-- and --%>).  The result appears as follows:



[caption id='attachment_246' align='aligncenter' width='499' caption='Screenshot: No Ability to Select or Configure Workflow']No Ability to Select or Configure Workflow[/caption]



As you can see, the entire Workflow section of the Submit Items dialog has now been removed, and the user no longer has the ability to select a different workflow or reconfigure it.

A Note about Packaging



While it may be tempting to simply modify the JSP directly in the exploded Alfresco webapp, it is critically important to understand that doing so is unsafe.  Specifically, Tomcat may choose to re-explode the alfresco.war file at any time, overwriting your changes without warning and thereby reverting the Submit Items dialog to the default behaviour.



A better approach is to package up the modified JSP file into an AMP file, and deploy it to other environments (test, production, etc.) using the apply_amps script or the Module Management Tool.  Packaging the JSP as an AMP file also allows you to 'pin' the change to a specific version of Alfresco (via the module.repo.version.min and module.repo.version.max properties, described here), which is also important to prevent someone accidentally installing an older version of the JSP into a newer version of Alfresco (which can create other, difficult-to-track-down issues in Alfresco).


 


Please note that modifying core Alfresco code (even JSPs) will technically invalidate support for the installation if you are a subscriber to Alfresco's Enterprise Network - this should not be done lightly!  In this case, however, the risk of unexpected side effects is minimal and although the change will need to be manually re-applied every time the installation is upgraded, there are ways of pro-actively managing that risk.
pmonks2

Timed Deployment

Posted by pmonks2 May 7, 2010
While Alfresco WCM contains a sophisticated deployment engine, the options for initiating deployment are rather more limited, comprising the manual 'Deploy Snapshot' function in the Explorer UI, and the automatic 'Auto Deploy' function that can be configured in the Web Project Settings and then requested by an author at submission time.



While these options are useful, they each have their downsides.  Manual deployment is, well, highly manual, and in practice it's usually unacceptable to dedicate a Content Manager to monitoring promotions and deploying them as they roll in.  Auto-deployment removes the manual step (once authors are trained to check the 'auto-deploy' checkbox during submission to workflow), but has the problem that a high rate of concurrent promotion can overwhelm the deployment system (since each and every promotion is auto-deployed individually, despite the deployment engine offering a far more efficient 'batch' deployment mode).

A Better Alternative



A better approach is to have deployment initiated automatically on a scheduled basis, picking up the latest snapshot at that point in time (which will automatically include all prior snapshots since the last successful deployment).



So what would be involved in developing this as a customisation?



Not very much, as it turns out.  Alfresco already includes all of the necessary components to provide this functionality:



  1. It includes the Quartz job scheduling engine.


  2. It includes an Action for initiating deployment.



All that's needed is some glue code to tie these two components together, and thanks to the generosity of a recent Alfresco Enterprise customer, that glue code has now been developed and is available as part of the alfresco-wcm-deployment project on Google Code.

Under the Hood



As it turns out there was one critical detail that made this slightly less straight forward than I'd expected. Specifically, the action responsible for deploying a Web Site (the AVMDeployWebsite action) isn't responsible for writing deployment reports into the Web Project - that step is performed in the Explorer UI's JSF code instead (in a private class called org.alfresco.web.bean.wcm.DeployWebsiteDialog).



Given that deployment reports are a critical piece of operational reporting information, it was clear that generating the deployment reports in exactly the same fashion as the OOTB 'Deploy Snapshot' and 'Auto Deploy' functions was a high priority.  As a result the code doesn't call the AVMDeployWebsite action directly - instead I copied the relevant block of code out of DeployWebsiteDialog and added it to my own custom class.



Other than that, the code is pretty straight forward.  Following Alfresco's ubiquitous 'in-process SOA' pattern I introduced a new service interface (called WebProjectDeploymentService), wired it into my Quartz job class using Spring, then configured it (with the cron expression that controls how frequently it runs) in a separate 'trigger' bean.



As always, if you have any questions or comments, please feel free to reply here.  I would request that any bug reports or enhancement requests get raised in the issue tracker in the Google Code project however - they are far easier to monitor there than in the comments on this post.





Note that this creates upgrade risk, since that code could change in future versions of Alfresco.  Given I work with the Alfresco code day-to-day I'm in a better position to detect when such changes have occurred, but if you're doing something like this yourself I would encourage extra diligence in monitoring changes to the original code to ensure your extension doesn't break unexpectedly following an upgrade.
pmonks2

The Case for Killing 'WCM'

Posted by pmonks2 Dec 17, 2009
As if the gaudy Christmas lights, crass inflatable Santas and disturbing illuminated mechanical deer weren't enough, CMS Watch have loudly proclaimed the start of the silly season with their annual prognostication on the state of CMS for the coming year.



This has generated a range of responses from the usual suspects, but the response  that really caught my eye was Jon Marks' 'Visions of Jon: WCM is for losers'.



Considering myself a 'WCM guy', I took some umbrage at being called a loser (even by someone of Jon's pedigree!), but after digesting his proposal (along with a 'venti' serving of pre-season, 100-proof egg nog to help calm the nerves) the idea is beginning to grow on me.  That's the idea that WCM is a nonsense term - the jury is still out on whether I'm a loser or not! 



From one of Jon's comments:





I think the VCM and Drupal are fundamentally different, and neither are an ECM system.





This is a specific example of a general pattern I've observed for a while now.  Jon continues:





The problem we have at the moment is that both of them are called WCM systems. ... The fact that we have to put them both into the same WCM bucket kills me.





This really struck a chord with me, and had me rethinking my previous stance that WCM is a single product category with 2 major subdivisions.  Perhaps the problem is deeper than that, and CPS' and PMS' are so different that there's little justification for grouping them together into a single 'WCM' bucket?  If so we've arrived at the same conclusion as Jon: WCM is a meaningless term and deserves to be deprecated.



To start undoing the 15 years of mind share that the term 'WCM' has enjoyed, it's time to start thinking about new terminology that better describes these two functional categories.  For several years I've been throwing around the terms 'Content Production System' (CPS) and 'Presentation Management System' (PMS), and in their COPE strategy NPR uses the terms 'Content Management System' (CMS) and 'Web Publishing Tool' (WPT).



What terms do you use (or think could / should be used) to describe these two product categories?
pmonks2

Version Baselining

Posted by pmonks2 Nov 3, 2009
One of the great things about working with Alfresco is the vast number of extension points the system offers to developers.  Some of these stem from the pervasive use of the Spring framework, some of them to a well thought out application architecture, and many of them from a number of guiding principles that are consistently applied even when their potential uses aren't necessarily known with certainty ahead of time.



I recently had the pleasure of being reminded of this latter case when a customer asked for an extension that allowed their content contributors to control the 'baseline' version number of documents in their Alfresco installation.  The idea was to allow their contributors to (optionally) enter a version number along with each document, and have the Alfresco versioning system start with that version number instead of the default of 1.0.



Although I didn't know how this might be achieved, in less than 10 minutes I had my answer and it relied on a slight variation of a mechanism that I'd used in the past.  The customer was also gracious enough to release the IP, so I've made the initial version of the extension available on google code.



Here is a brief overview of its usage:



This extension works by extending Alfresco with a custom content type called 'Version Baselined Content' that includes a single property called 'Base Version'.  This property is where the content contributor can set the base version to be used if/when versioning is enabled on the document.



In order to create content of this type, 'Version Baselined Content' needs to be selected in the 'Type' dropdown of the 'Add Content Dialog':





Provided the 'Modify all properties when this page closes' checkbox is left checked (the default), the contributor will then be presented with the option to specify the base version number for this document (if/when versioning is enabled):





The default value for this field is '0.1' – if the contributor elects to skip modification of the new content's properties, this is the base version number it will be assigned automatically.



The base version number must be a valid non-negative decimal number (ie. it must be a number greater than or equal to 0.0).  If an invalid value is entered, an error will be displayed when the user clicks the 'OK' button.



Once the version number is populated, it may be edited via the document's properties as many times as are necessary, up until the time versioning is enabled for the document:





Once versioning is enabled for the document, the initial version number will be set to the value of the 'Base Version Number' property at that time:





From this point on, any modifications to the 'Base Version Number' property will be ignored as it is not possible to renumber an existing Alfresco version history.



Other than allowing explicit control over the initial version number for a document, this extension does not change any other versioning behavior in the system.  For example creating a new minor revision of a document (via checkout and checkin) will increment the version number by 0.1.  Similarly, creating a new major revision of a document (via checkout and checkin) will increment the major component of the version number by 1, and set the minor component to 0:

 







While the extension is quite neat and (due to the generosity of the customer) available for anyone to use, refine and extend, what really grabbed me as I developed it was how, despite having no prior experience with this particular extension point, it was familiar enough that I was able to understand it almost immediately and leverage it to achieve the desired goal.
pmonks2

Bulk Import from a Filesystem

Posted by pmonks2 Oct 22, 2009

The Use Case



In any CMS implementation an almost ubiquitous requirement is to load existing content into the new system.  That content may reside in a legacy CMS, on a shared network drive, on individual user's hard drives or in email, but the requirement is almost always there - to inventory the content that's out there and bring some or all of it into the CMS with a minimum of effort.



Alfresco provides several mechanisms that can be used to import content, including:


Alfresco is also fortunate to have SI partners such as Technology Services Group who provide specialised content migration services and tools (their open source OpenMigrate tool has proven to be popular amongst Alfresco implementers).



That said, most of these approaches suffer from one or more of the following limitations:



  • They require the content to be massaged into some other format prior to ingestion


  • Orchestration of the ingestion process is performed external (ie. out-of-process) to Alfresco, resulting in excessive chattiness between the orchestrator and Alfresco.


  • They require development or configuration work


  • They're more general in nature, and so aren't as performant as a specialised solution



An Opinionated (but High Performance!) Alternative



For that reason I recently set about implementing a bulk filesystem import tool, that focuses on satisfying a single, highly specific use case in the most performant manner possible: to take a set of folders and files on local disk and load them into the repository as quickly and efficiently as possible.



The key assumption that allows this process to be efficient is that the source folders and files must be on disk that is locally accessible to the Alfresco server - typically this will mean a filesystem that is located on a hard drive physically housed in the server Alfresco is running on.  This allows the code to directly stream from disk into the repository, which basically devolves into disk-to-disk streaming - far more efficient than any kind of mechanism that requires network I/O.



How those folders and files got onto the local disk is left as an exercise for the reader, but most OSes provide efficient mechanisms for transferring files across a network (rsync and robocopy, for example).  Alternatively it's also possible to mount a remote filesystem using an OS-native mechanism (CIFS, NFS, GFS and the like), although doing so reintroduces network I/O overhead.



Another key differentiator of this solution is that all of the logic for ingestion executes in-process within Alfresco.  This completely eliminates expensive network RPCs while ingestion is occurring, and also provides fine grained control of various expensive operations (such as transaction commits / rollbacks).



Which leads into another advantage of this solution: like most transactional systems, there are some general strategies that should be followed when writing large amount of data into the Alfresco repository:



  1. Break up large volumes of writes into multiple batches - long running transactions are problematic for most transactional systems (including Alfresco).


  2. Avoid updating the same objects from different concurrent transactions.  In the case of Alfresco, this is particularly noticeable when writing content into the same folder, as those writes cause updates to the parent folder's modification timestamp.[EDIT] In recent versions of Alfresco, the automatic update of a folder's modification timestamp (cm:modified property) has been disabled by default.  It can be turned back on (by setting the property 'system.enableTimestampPropagation' to true), but the default is false so this is likely to be less of an impact to bulk ingestion than I'd originally thought.



The bulk filesystem import tool implements both of these strategies (something that is not easily accomplished when ingestion is coordinated by a separate process).  It batches the source content by folder, using a separate transaction per folder, and it also breaks up any folder containing more than a specific number of files (1,000 by default) into multiple transactions.  It also creates all of the children of a given folder (both files and sub-folders) as part of the same transaction, so that indirect updates to the parent folder occur from that single transaction.

But What Does this Mean in Real Life?



The benefit of this approach was demonstrated recently when an Alfresco implementation had a bulk ingestion process that regularly loaded large numbers (1,000s) of large image files (several MBs per file) into the repository via CIFS.  In one test, it took approximately an hour to load 1,500 files into the repository via CIFS.  In contrast the bulk filesystem import tool took less than 5 minutes to ingest the same content set.



Now clearly this ignores the time it took to copy the 1,500 files onto the Alfresco server's hard drive prior to running the bulk filesystem import tool, but in this case it was possible to modify the sourcing process so that it dropped the content directly onto the Alfresco server's hard drive, providing a substantial (order of magnitude) overall saving.

What Doesn't it Do (Yet)?



Despite already being in use in production, this tool is not what I would consider complete.  The issue tracker in the Google Code project has details on the functionality that's currently missing; the most notable gap being the lack of support for population of metadata (folders are created as cm:folder and files are created as cm:content). [EDIT] v0.5 adds a first cut at metadata import functionality.  The 'user experience' (I hesitate to call it that) is also very rough and could easily be substantially improved. [EDIT] v0.4 added several UI Web Scripts that significantly improve the usability of the tool (at least for the target audience: Alfresco developers and administrators).



That said, the core logic is sound, and has been in production use for some time.  You may find that it's worth investigating even in its currently rough state.



[POST EDIT] This tool seems to have attracted quite a bit of interest amongst the Alfresco implementer community.  I'm chuffed that that's the case and would request that any questions or comments you have be raised on the mailing list.  If you believe you've found a bug, or wish to request an enhancement to the tool, the issue tracker is the best place.  Thanks!
(with apologies to the Propeller Heads and Shirley Bassey)



Laurence Hart recently posted some reminisces regarding his formative years in content management, and it got me feeling a little nostalgic about my own introduction to and history with content management.  Allow me to bore you with a rather self indulgent look back at the last decade or so...

Sun, Surf and Sandstone



For me it all started in late 1996, when I decided to update the 1991 rockclimbing guide to Sydney.  Lacking in publishing experience and having heard from more experienced souls that publishing was more than half the work in preparing such a guide, I decided to update the information, put it online and then consider a hard copy edition at a later date (the classic divide-and-conquer get-bored-and-do-something-else approach).



At the time I was working for one of the (then) Big-5 management consulting firms, and had specialised in BEA (now Oracle) Tuxedo, so the web and its technologies was pretty much foreign territory for me.  I figured this little guidebook project would be a good use case for learning about this newfangled interwebitube thingamajig.



Not having heard of content management (in part because it was a niche indication in those days!) I rolled my own 'CMS' in MS Access, and used that to publish out the new guidebook as a static HTML site.  This wasn't just a one-trick pony CMS either - the editor of a rockclimbing guide to the Glasshouse Mountains also picked it up a year or two later, and has been using it to manage his guidebook since then.  It's with mixed feelings that I admit that this is one of the longer lived CMS implementations I've worked on!



The key takeaway for me from this period was that keeping presentation and content separate is indeed a highly valuable guiding principle, but that it's also difficult to do without creating a visually repetitive site (which isn't necessarily a bad thing, but tends to rub marketing and creative folks the wrong way).

The 300lbs Gorilla



Having caught the web bug (and if the truth be known, being completely fed up with developing business applications in C & C++), in 2000 I took a leap of faith and joined Vignette, arguably at about the time the company was at the pinnacle of its success.  To the casual observer it could appear that Vignette was on a steady decline from that point on, but for me personally it was a pretty wild ride - a lot of very smart people with a dizzying array of ideas - many of them brilliant, even more of them completely outlandish and/or impractical in the extreme.



And of course all of it focused on how best to manage and deliver web content, rather than being seen as a slightly perverse hobby that detracted from the 'real work' of OLTP, N-tier client-server, data warehouses and the like!



In some ways the dotcom bust and subsequent 'dark ages' actually helped Vignette, by bringing a previously missing intensity of focus to operational matters and (mostly) putting paid to the hubris accrued during the heady closing days of the 20th century.



If I can summarise that period in one statement, it would be that relational databases make *terrible* CMSes.  So many of Vignette's technical flaws (specifically in the StoryServer and VCM product lines) stem directly or indirectly from the architectural decision to implement custom content models directly as relational data models.

Creative Interlude



After a stint in product management, I left Vignette in early 2006 and joined Avenue A | Razorfish - a Web Design Agency.  While only brief, this assignment gave me a new appreciation for the fine art of web design and the highly skilled, creative individuals who choose this profession.



It also reinforced the fact that many Web CMSes are still wrestling with basic plumbing issues (versioning, deployment, performance etc.) and have yet to really wrestle with some of the higher level issues of usability and productivity all while supporting creative freedom.



On a more mundane note, this experience also gave me a marked distaste for docroot management systems - that model was antiquated last millennium and makes no sense in this day and age!

Open Source Comes Calling



While I'd always had an interest in open source (in fact the Sydney climbing guidebook has been published under an open source documentation license - the GNU FDL - since its first edition in 1997), I'd never worked for an open source company before, and when the chance presented itself in late 2006, I jumped at the chance to join Alfresco, where I continue to work.



While it's a little premature for me to be drawing any conclusions from my experiences at Alfresco, there are some patterns that I can clearly identify.  For starters there's no doubt that open source is a disruptive business model - having a company that spends a majority of revenue on R&D (rather than on sales commissions) is a huge win for everyone (except career sales executives! ).  There's also something to be said for openly visible source code - the 'given enough eyeballs, all bugs are shallow' principle and all that.



In terms of content management, Alfresco comes closest (that I've seen) to realising the promise of a blended DM and WCM system (although as with any system there's always room for improvement).
Julian Wraith recently started a discussion entitled 'The future of content management' that has kicked off quite a few interesting responses.



Of those, the one that really grabbed my attention was Justin Cormack's great response entitled 'CMS technology choices'.  By strange coincidence it closely echoes (but far more eloquently and in a lot more detail!) a conversation Kevin Cochrane and I had in twitter at about the same time, and while I almost entirely agree with everything Justin has written, the twitter conversation does highlight my one fundamental disagreement with the post.  Here's the transcript of my side of that conversation:

Managing web content is about more than simply supporting the technical constructs the web uses (REST, stateless etc.).



eg. the graph of relationships between the content items making up a site can be an important source of information for authors.



But the web itself has no direct support for graph data structures (beyond humble 'pointers': <a href> tags and the like).



And perhaps as a consequence many (most?) Web CMSes don't have support for that either. ;-)



IMNSHO the future is: schemaless (ala CouchDB, MongoDB, at al), graph based (ala Neo4J), distributed version control (ala Git).


(in hindsight I should also have mentioned 'queryable (ala RDBMS, MongoDB, etc.)')



To better describe my divergence from Justin's vision of the future, I believe that management of, and visibility into the 'content graph' (the set of links / relationships / associations / dependencies / call-them-what-you-will) is one of the more important features a CMS can provide, particularly for web content management where the link structure (including, but not limited to, the site's navigation model) is so integral to the consumer's final experience of the content.



So what 'content graph' features, specifically, should a hypothetical CMS provide?



In my opinion a CMS needs to support at least the following operations on the content graph:



  • Track all links between assets that are under management, in such a way that the content graph can be:



    • bi-directionally traversed ie. the CMS can quickly and efficiently answer questions such as 'which assets does asset X refer to?', 'which assets refer to asset X?'


    • used within queries ie. the CMS can quickly and efficiently answer questions such as 'show me all content items that are within 3 degrees of separation from asset X, are of type 'press release', and were published in the last month by 'Peter Monks''





  • Flag any content modifications that 'break' the content graph eg. deletion of an asset that is the target of one or more references



    • From a usability perspective our hypothetical CMS would provide the ability for the user requesting the breaking change to automatically 'fix' the breakages eg. by correcting the soon-to-be invalid (dangling) links in the source item(s)





  • Support arbitrary metadata on references, preferably using the same metadata modeling language that is used for 'real' content assets


  • Support basic validity checking of external links - links that point to assets that are not under management (eg. URIs that point to other web sites)



Other than linking, I think Justin's post pretty much nails it.  I'm a big fan of schemaless repositories, having worked extensively with several 'schemaed' CMSes that made seemingly simple steps (such as adding or removing a single property from a content type that happened to have instances in existence) a lengthy exercise in frustration.



I'm also a big fan of 'structural' versioning (ala SVN, Git, Mercurial etc.), as it's the only way to properly support rollback in the presence of deletions.  Trying to explain to an irate user that they just deleted not only an asset but also its entire revision history is not something I particularly relish!



Rich query and search facilities are a given - it's one thing to put content into a CMS, but if you can't query and search that content, it's little better than a filesystem.



Replication (as in CouchDB, Git, etc.) is also an inevitable requirement for CMSes - I regularly see requirements for a CMS that can provide efficient access to documents across locations that are widely geographically distributed (including cases where connectivity to some of those locations is low bandwidth and/or intermittent).  Replication (with automatic conflict detection and sophisticated features to assist with the inevitably manual process of conflict resolution) is the only mechanism I'm aware of that can handle these cases.



And in closing, a big thank you to Julian Wraith for initiating this discussion - it's extremely refreshing to discover other folks who are as passionate and (if I may say) as opinionated about CMS technology as I am!
Since their inception, Alfresco WCM Web Forms have supported an inclusion mechanism based on the standard XML Schema include and import constructs.  Originally this mechanism read the included assets from the Web Project where the user was creating the content, but since v2.2SP3 the preferred mechanism has been to reference a Web Script instead (in fact the legacy mechanism may be deprecated in a future release).



One question that this new approach raises is how to support inclusion of static XSDs, as Web Scripts are inherently dynamic and introduce some unnecessary overhead for the simple static case.  The good news is that Alfresco ships with a Web Script that simply reads a file from the repository and returns its contents:

/api/path/content{property}/{store_type}/{store_id}/{path}?a={attach?}





An example usage is:

/api/path/content/workspace/SpacesStore/Company Home/Data Dictionary/Presentation Templates/readme.ftl





Using the Web Script inclusion mechanism for Web Forms, we can use this Web Script to include or import any XSD file stored in the DM repository.  For example, if we have a file called 'my-include.xsd' in the 'Company Home' space that contains the following content:

<?xml version='1.0'?>

<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'

           xmlns:alf='http://www.alfresco.org/'

           targetNamespace='http://www.alfresco.org/'

           elementFormDefault='qualified'>

  <xs:complexType abstract='true' name='IncludedComplexType'>

    <xs:sequence>

      <xs:element name='Title'

                  type='xs:normalizedString'

                  minOccurs='1'

                  maxOccurs='1' />

      <xs:element name='Summary'

                  type='xs:string'

                  minOccurs='0'

                  maxOccurs='1' />

      <xs:element name='Keyword'

                  type='xs:normalizedString'

                  minOccurs='0'

                  maxOccurs='unbounded' />

    </xs:sequence>

  </xs:complexType>

</xs:schema>





We could include it into a Web Form XSD using an include statement such as the following:

<?xml version='1.0'?>

<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'

           xmlns:alf='http://www.alfresco.org/'

           targetNamespace='http://www.alfresco.org/'

           elementFormDefault='qualified'>

  <xs:include schemaLocation='webscript://api/path/content/workspace/SpacesStore/Company Home/my-include.xsd?ticket={ticket}' />

  <xs:complexType name='MyWebFormType'>

    <xs:complexContent>

      <xs:extension base='alf:IncludedComplexType'>

        <xs:sequence>

          <xs:element name='Body'

                      type='xs:string'

                      minOccurs='1'

                      maxOccurs='1' />

        </xs:sequence>

      </xs:extension>

    </xs:complexContent>

  </xs:complexType>

  <xs:element name='MyWebForm' type='alf:MyWebFormType' />

</xs:schema>





This is clearly faster and easier than developing a custom Web Script to either emit the XML Schema shown above, or to return the contents of a specific XSD file from the repository!



This approach also provides a solution to another question: how does one neatly package up a Web Form, along with all of its dependencies, ready for deployment to another Alfresco environment?



By storing included XSD files in Company Home > Data Dictionary > Web Forms, we give ourselves the option to package up the entire Web Forms space as an ACP file and deploy that ACP file to any other Alfresco environment, knowing that we've captured not only all of the Web Forms in the source environment, but all dependent XSD files as well.

Seth Gottlieb has written a great post entitled 'Code moves forward. Content moves backward.' that, by strange coincidence, echoes an Alfresco KB item authored by Alfresco's very own Ben Hagan last year.



What's interesting to me is that there is an alternative world view that asserts that code and content are two sides of the same coin and hence should be managed the same way in the same management system.  This meme seems particularly strong amongst those who are adherent's of the Boiko school of thought and also those who've had significant exposure to certain Web CMS products (that shall remain nameless) that are clearly designed for the blended model, and so indoctrinate users /developers to use a blended model in all cases (whether appropriate or not).



My experience has been that blending code and content management together doesn't work well in the majority of cases, for two primary reasons:



  1. Typically very different groups are producing the code and the content - often they're in completely different divisions within the organisation (ie. IT vs business unit) and sometimes are even separate companies (ie. web agency vs client).


  2. The releases cycles for code and content are vastly different - code is typically released infrequently (weekly, at best), while the content on any large site is typically changing virtually non-stop.



The net result is that shoehorning both activities together creates unnecessary procedural couplings, between groups who are typically poorly structured (from a communication and coordination perspective) to efficiently manage those redundant couplings.



Anyway, it's a great post on a very interesting topic, and I'd definitely encourage anyone involved in implementing a Web CMS (whether Alfresco WCM or not) to give it a solid read.

Filter Blog

By date: By tag: