ragauss

Metadata Embedding

Blog Post created by ragauss Employee on May 31, 2013

What is Metadata Embedding?



Extraction of metadata from binary files is a critical task for enterprise content and digital asset management systems. The information contained in those files can aid in searching, workflows, and user interface visualizations.



Alfresco does a fantastic job of handling metadata extraction through it's concept of MetadataExtracters registering themselves in the MetadataExtracterRegistry, and the use of the Apache Tika project to power many of those extractors enables a huge number of file formats and metadata standards to be supported.



We ingest a binary file, metadata is extracted and mapped to Alfresco data model properties, and we can view and edit those properties in an interface like Alfresco Share.



In some cases it's important to get those property changes or other required fields back into the binary file as metadata. You might, for example, want to set the author metadata in a document or set copyright info in images before sending them outside of your organization.



In 4.2.c we introduced the concept of metadata embedders, which are essentially the inverse of MetadataExtracters, and are responsible for writing properties into content.

How Does it Work?



The MetadataEmbedder interface has just two methods, isEmbeddingSupported, and embed.



Rather than create an entirely separate registry for embedders, the MetadataExtracterRegistry was extended with a getEmbedder(String sourceMimetype) method. Note that currently only embedders which are also extractors can be registered, but in the future support may be added for explicitly registering embedders. You'd usually implement both in the same class anyway. Speaking of...



AbstractMappingMetadataExtracter now implements the MetadataEmbedder interface and contains:



  • A supportedEmbedMimetypes collection that's used in the isEmbeddingSupported call


  • embedMapping that defines the mapping from Alfresco properties to metadata fields


  • An embedInternal method to be overridden by extending classes


Again, just the reverse of the extraction pattern.



For classes extending AbstractMappingMetadataExtracter, the embed mapping can be defined in a properties file in the same location as the extract mapping properties but with an embed suffix, i.e. classpath:/x/y/z/MyExtracter.embed.properties (note that the preferred location for mapping files for extractors and embedders has changed after 4.2.c, see ALF-17891). If no embed properties are found a reverse mapping of the extract mapping is used by default, cool right?

What About Tika?



'But that's still sooooo... abstract. How are we going to leverage Tika? It doesn't support embedding, does it?'



Well as a matter of fact it does, as of version 1.3 (TIKA-775).



The same notion of writing metadata into a binary has been outlined with an interface and basic implementation in Tika, so of course our TikaPoweredMetadataExtracter builds on that and overrides the embedInternal method defined in its parent AbstractMappingMetadataExtracter to convert Alfresco properties to Tika metadata fields and passes that on to a Tika Embedder's embed method, which then passes back the new binary with the metadata embedded.



Tika embedding

How Can we Use Embedding?



Our shiny new Alfresco metadata embedder's embed method isn't very useful if we don't have an easy way to call it, so we've added a ContentMetadataEmbedder action executor which shows up as a standard 'Embed properties as metadata in content' action that can be used in a rule on a folder or executed in a workflow.  (After 4.2.c you can find this in alfresco/extension/metadata-embedding-context.xml.sample)



So what kinds of files and metadata does Tika have embed support for? Truth be told, not many at the moment, but the tika-exiftool project does!



tika-exiftool is wrapper for calls to the ExifTool command-line which contains a Tika Parser and Embedder for image files.



The Media Management module contains an example which brings all of this together with an extension of TikaPoweredMetadataExtracter that uses the Tika Embedder defined in the tika-exiftool project to enable IPTC embedding in image files.



We can add an embed rule to a folder that fires on content update such that when we edit our caption field through Share, the new value is embedded in the file and can be seen using standard image metadata tools, like Photoshop's file info.



Embed flow



Sit down and stop clapping, everyone is staring at you. Aw, who cares, go ahead.

What's Next?



We'll be adding embed support for more file and metadata types to Tika and Alfresco in the future including, of course, documents, but in the meantime, what other formats are you anxious to start embedding?

Attachments

Outcomes