Skip navigation
All Places > Alfresco Content Services (ECM) > Blog > Author: ragauss

We all use technology standards everyday.

We hop into our cars and play music from our mobile phones through the car speakers.  It doesn't matter who the device or car manufacturer is, as long as they both add support for the A2DP Bluetooth profile.

We connect our set top boxes to our televisions with a single HDMI cable, again regardless of manufacturers.

We send emails to people in various organizations that use different email server vendors, but we don't need to worry about how our server communicates with theirs, they just both support SMTP.

Specific to Digital Asset Management (DAM), we add captions and keywords to images in common desktop applications and most of us don't have to think about the fact that they're being embedded using IPTC.

The examples are numerous, and we usually take the end result for granted, but those standards take significant time and effort to properly define the problems and come up with solutions that satisfy everyone participating, who sometimes have competing interests.

Disconnected silos of media are widespread in the DAM space.  A large organization can have multiple, even tens of DAM systems, usually for different departments.  A common thread in the industry is that we need to make it easier to integrate these systems with each other and with the rest of the enterprise so that the entire lifecycle of content can be managed in an efficient manner.

What might those integrations look like?

Metadata. Say a web content management system needs to display an image and a crucial piece of metadata, its credit line for example.  There might be several web pages or even separate CMSes using the same image.  If that credit line needs to change the CMS users shouldn't have to have to update it in every spot it's used.  An integration could be developed to pull the image and data from a central DAM by building a connector between the two systems, but what if the DAM is switched out for another vendor, or there are multiple DAMs from different vendors?

Renditions. Lower quality renditions of rich media files (called proxies in the video world) are often used instead of the much larger original source files for various scenarios.  Let's say we have a collaborative review mobile app which can be configured to point to a DAM system in order to use or request these renditions for playback or display.  For the app to support DAMs from multiple vendors a connector must be developed for each, assuming the vendor even has a relevant API for the task in the first place.

Rights. Omni-channel customer experience management solutions (Mmmm, words, so buzzy) might need to assemble content and obtain rights information from a DAM before making a campaign live.  There again: a different integration needed for each DAM vendor.


Interoperability standards are an obvious answer to these problems.

If we, the DAM ecosystem, build these critical system to system connectors using open standards that many vendors have agreed upon and have ultimately implemented we can be assured we'll only have to do it once and will avoid vendor lock-in.

If you haven't heard, work was started on just such a standard for the DAM industry, CMIS4DAM, designed to work alongside the existing Content Management Interoperability Services standard.  It held great promise as an answer to the call that seemed to be coming from many DAM analysts, consultants, vendors, and customers with initial participation from a range of companies and individuals full of vigor to help shape what was to come.

The OASIS technical committee (TC) charter was put forth, uses cases and deliverables were defined, and work started on the specification itself.

Unfortunately the participation waned throughout that process and the monthly meetings now consistently involve only a few attendees.

Without commitment from more individuals, and perhaps more importantly vendors, OASIS will have to shut down the CMIS4DAM TC.

Consider this a call for help.

If you were on the TC and have stopped attending, was there a specific reason?

If you're a DAM vendor interested in improving the DAM technology ecosystem would you consider joining the committee to help develop the specification?

Is there a standard other than CMIS4DAM you feel would be a better match for the DAM community?

Perhaps most importantly, if you're a purchaser of a DAM system and have interest in leveraging these concepts in your implementation, contact your DAM vendor and encourage them participate.  Your demands (hopefully) drive their roadmap, let's make sure an open interoperability standard is on it!

ragauss

Metadata Extraction to Tags

Posted by ragauss Employee May 31, 2013

The Alfresco Tags You Know and Love



The tagging capabilities in Alfresco and Share provide an easy way for users to associate tags with a piece of content and filter content sets by those tags. It's an excellent way to add more context for other members of a team, and is particularly useful for visual content where there may not be any associated text to describe or enable searching for what's depicted.



Some file formats may contain metadata that can aid in that description and searching, and that metadata is likely something Alfresco can already extract, but we wouldn't get the slick user experience that tags afford us.



'Wait! What if we could map extracted metadata to standard tags?' Hey, that's a great idea!

Tag Mapping



We introduced the ability to map metadata extraction to tags in 4.2.c, but it's not enabled out of the box. Let's take a quick look at how things worked, what was changed, then how you might use it.

Metadata Extraction Mapping Refresher



ContentMetadataExtracter is the action executer which does the work of getting the proper MetadataExtracter from the MetadataExtracterRegistry, then calls its extract method to fill in a properties map.



AbstractMappingMetadataExtracter, which is what most metadata extractors extend from, allows you to map incoming metadata fields to Alfresco properties.

Tags Refresher



Alfresco's tags are stored and displayed using the cm:taggable property inside the cm:taggable aspect. A type of category node is created for each tag (or linked to if it already exists) and is associated with the tagged content as a property by the TaggingService.



In the past you couldn't just map your free-form string metadata fields to cm:taggable as it's expecting a nodeRef to perform that property linking.

What's Changed



Now we've caught MalformedNodeRefExceptions related to tags in AbstractMappingMetadataExtracter, and if enableStringTagging=true the raw string values will be passed on as is to the next step. There may be some cases where you actually have a tag's nodeRef as a metadata field in your binary file, in which case no MalformedNodeRefException would be thrown and your content would be linked to that existing tag.



Once we've returned to ContentMetadataExtracter the properties modified by the metadata extractor are iterated and set by the NodeService. It's during that process that we look for cm:taggable and use the TaggingService to create or link the raw string tags, provided enableStringTagging=true and the TaggingService is set.



Multi-valued metadata fields are supported of course, and a tag will be created or linked for each value.

How to Use it



Again, to make all this magic happen you must currently set the taggingService property on ContentMetadataExtracter and set enableStringTagging=true. Your overriding bean definition might look like this:

<bean id='extract-metadata' class='org.alfresco.repo.action.executer.ContentMetadataExtracter' parent='action-executer'>

    <property name='nodeService'>

        <ref bean='NodeService' />

    </property>

    <property name='contentService'>

        <ref bean='ContentService' />

    </property>

    <property name='dictionaryService'>

        <ref bean='dictionaryService' />

    </property>

    <property name='taggingService'>

        <ref bean='TaggingService' />

    </property>


    <property name='metadataExtracterRegistry'>

        <ref bean='metadataExtracterRegistry' />

    </property>

    <property name='applicableTypes'>

        <list>

            <value>{http://www.alfresco.org/model/content/1.0}content</value>

        </list>

    </property>

    <property name='carryAspectProperties'>

        <value>true</value>

    </property>

    <property name='enableStringTagging'>

        <value>true</value>

    </property>


</bean>


then define your metadata extractor mapping, something like:

dc\:subject=cm:taggable


IPTC Keywords Example



The Media Management module supports full IPTC extraction for images, which is where keywords used by so many photo editing and organization programs is stored, and a perfect candidate for mapping to Alfresco tags:



Tag Mapping



What are other metadata fields are you thinking of mapping to tags?
ragauss

Metadata Embedding

Posted by ragauss Employee May 31, 2013

What is Metadata Embedding?



Extraction of metadata from binary files is a critical task for enterprise content and digital asset management systems. The information contained in those files can aid in searching, workflows, and user interface visualizations.



Alfresco does a fantastic job of handling metadata extraction through it's concept of MetadataExtracters registering themselves in the MetadataExtracterRegistry, and the use of the Apache Tika project to power many of those extractors enables a huge number of file formats and metadata standards to be supported.



We ingest a binary file, metadata is extracted and mapped to Alfresco data model properties, and we can view and edit those properties in an interface like Alfresco Share.



In some cases it's important to get those property changes or other required fields back into the binary file as metadata. You might, for example, want to set the author metadata in a document or set copyright info in images before sending them outside of your organization.



In 4.2.c we introduced the concept of metadata embedders, which are essentially the inverse of MetadataExtracters, and are responsible for writing properties into content.

How Does it Work?



The MetadataEmbedder interface has just two methods, isEmbeddingSupported, and embed.



Rather than create an entirely separate registry for embedders, the MetadataExtracterRegistry was extended with a getEmbedder(String sourceMimetype) method. Note that currently only embedders which are also extractors can be registered, but in the future support may be added for explicitly registering embedders. You'd usually implement both in the same class anyway. Speaking of...



AbstractMappingMetadataExtracter now implements the MetadataEmbedder interface and contains:



  • A supportedEmbedMimetypes collection that's used in the isEmbeddingSupported call


  • embedMapping that defines the mapping from Alfresco properties to metadata fields


  • An embedInternal method to be overridden by extending classes


Again, just the reverse of the extraction pattern.



For classes extending AbstractMappingMetadataExtracter, the embed mapping can be defined in a properties file in the same location as the extract mapping properties but with an embed suffix, i.e. classpath:/x/y/z/MyExtracter.embed.properties (note that the preferred location for mapping files for extractors and embedders has changed after 4.2.c, see ALF-17891). If no embed properties are found a reverse mapping of the extract mapping is used by default, cool right?

What About Tika?



'But that's still sooooo... abstract. How are we going to leverage Tika? It doesn't support embedding, does it?'



Well as a matter of fact it does, as of version 1.3 (TIKA-775).



The same notion of writing metadata into a binary has been outlined with an interface and basic implementation in Tika, so of course our TikaPoweredMetadataExtracter builds on that and overrides the embedInternal method defined in its parent AbstractMappingMetadataExtracter to convert Alfresco properties to Tika metadata fields and passes that on to a Tika Embedder's embed method, which then passes back the new binary with the metadata embedded.



Tika embedding

How Can we Use Embedding?



Our shiny new Alfresco metadata embedder's embed method isn't very useful if we don't have an easy way to call it, so we've added a ContentMetadataEmbedder action executor which shows up as a standard 'Embed properties as metadata in content' action that can be used in a rule on a folder or executed in a workflow.  (After 4.2.c you can find this in alfresco/extension/metadata-embedding-context.xml.sample)



So what kinds of files and metadata does Tika have embed support for? Truth be told, not many at the moment, but the tika-exiftool project does!



tika-exiftool is wrapper for calls to the ExifTool command-line which contains a Tika Parser and Embedder for image files.



The Media Management module contains an example which brings all of this together with an extension of TikaPoweredMetadataExtracter that uses the Tika Embedder defined in the tika-exiftool project to enable IPTC embedding in image files.



We can add an embed rule to a folder that fires on content update such that when we edit our caption field through Share, the new value is embedded in the file and can be seen using standard image metadata tools, like Photoshop's file info.



Embed flow



Sit down and stop clapping, everyone is staring at you. Aw, who cares, go ahead.

What's Next?



We'll be adding embed support for more file and metadata types to Tika and Alfresco in the future including, of course, documents, but in the meantime, what other formats are you anxious to start embedding?

Filter Blog

By date: By tag: