Skip navigation
All Places > Alfresco Content Services (ECM) > Blog > 2013 > May
2013
ragauss

Metadata Extraction to Tags

Posted by ragauss Employee May 31, 2013

The Alfresco Tags You Know and Love



The tagging capabilities in Alfresco and Share provide an easy way for users to associate tags with a piece of content and filter content sets by those tags. It's an excellent way to add more context for other members of a team, and is particularly useful for visual content where there may not be any associated text to describe or enable searching for what's depicted.



Some file formats may contain metadata that can aid in that description and searching, and that metadata is likely something Alfresco can already extract, but we wouldn't get the slick user experience that tags afford us.



'Wait! What if we could map extracted metadata to standard tags?' Hey, that's a great idea!

Tag Mapping



We introduced the ability to map metadata extraction to tags in 4.2.c, but it's not enabled out of the box. Let's take a quick look at how things worked, what was changed, then how you might use it.

Metadata Extraction Mapping Refresher



ContentMetadataExtracter is the action executer which does the work of getting the proper MetadataExtracter from the MetadataExtracterRegistry, then calls its extract method to fill in a properties map.



AbstractMappingMetadataExtracter, which is what most metadata extractors extend from, allows you to map incoming metadata fields to Alfresco properties.

Tags Refresher



Alfresco's tags are stored and displayed using the cm:taggable property inside the cm:taggable aspect. A type of category node is created for each tag (or linked to if it already exists) and is associated with the tagged content as a property by the TaggingService.



In the past you couldn't just map your free-form string metadata fields to cm:taggable as it's expecting a nodeRef to perform that property linking.

What's Changed



Now we've caught MalformedNodeRefExceptions related to tags in AbstractMappingMetadataExtracter, and if enableStringTagging=true the raw string values will be passed on as is to the next step. There may be some cases where you actually have a tag's nodeRef as a metadata field in your binary file, in which case no MalformedNodeRefException would be thrown and your content would be linked to that existing tag.



Once we've returned to ContentMetadataExtracter the properties modified by the metadata extractor are iterated and set by the NodeService. It's during that process that we look for cm:taggable and use the TaggingService to create or link the raw string tags, provided enableStringTagging=true and the TaggingService is set.



Multi-valued metadata fields are supported of course, and a tag will be created or linked for each value.

How to Use it



Again, to make all this magic happen you must currently set the taggingService property on ContentMetadataExtracter and set enableStringTagging=true. Your overriding bean definition might look like this:

<bean id='extract-metadata' class='org.alfresco.repo.action.executer.ContentMetadataExtracter' parent='action-executer'>

    <property name='nodeService'>

        <ref bean='NodeService' />

    </property>

    <property name='contentService'>

        <ref bean='ContentService' />

    </property>

    <property name='dictionaryService'>

        <ref bean='dictionaryService' />

    </property>

    <property name='taggingService'>

        <ref bean='TaggingService' />

    </property>


    <property name='metadataExtracterRegistry'>

        <ref bean='metadataExtracterRegistry' />

    </property>

    <property name='applicableTypes'>

        <list>

            <value>{http://www.alfresco.org/model/content/1.0}content</value>

        </list>

    </property>

    <property name='carryAspectProperties'>

        <value>true</value>

    </property>

    <property name='enableStringTagging'>

        <value>true</value>

    </property>


</bean>


then define your metadata extractor mapping, something like:

dc\:subject=cm:taggable


IPTC Keywords Example



The Media Management module supports full IPTC extraction for images, which is where keywords used by so many photo editing and organization programs is stored, and a perfect candidate for mapping to Alfresco tags:



Tag Mapping



What are other metadata fields are you thinking of mapping to tags?
ragauss

Metadata Embedding

Posted by ragauss Employee May 31, 2013

What is Metadata Embedding?



Extraction of metadata from binary files is a critical task for enterprise content and digital asset management systems. The information contained in those files can aid in searching, workflows, and user interface visualizations.



Alfresco does a fantastic job of handling metadata extraction through it's concept of MetadataExtracters registering themselves in the MetadataExtracterRegistry, and the use of the Apache Tika project to power many of those extractors enables a huge number of file formats and metadata standards to be supported.



We ingest a binary file, metadata is extracted and mapped to Alfresco data model properties, and we can view and edit those properties in an interface like Alfresco Share.



In some cases it's important to get those property changes or other required fields back into the binary file as metadata. You might, for example, want to set the author metadata in a document or set copyright info in images before sending them outside of your organization.



In 4.2.c we introduced the concept of metadata embedders, which are essentially the inverse of MetadataExtracters, and are responsible for writing properties into content.

How Does it Work?



The MetadataEmbedder interface has just two methods, isEmbeddingSupported, and embed.



Rather than create an entirely separate registry for embedders, the MetadataExtracterRegistry was extended with a getEmbedder(String sourceMimetype) method. Note that currently only embedders which are also extractors can be registered, but in the future support may be added for explicitly registering embedders. You'd usually implement both in the same class anyway. Speaking of...



AbstractMappingMetadataExtracter now implements the MetadataEmbedder interface and contains:



  • A supportedEmbedMimetypes collection that's used in the isEmbeddingSupported call


  • embedMapping that defines the mapping from Alfresco properties to metadata fields


  • An embedInternal method to be overridden by extending classes


Again, just the reverse of the extraction pattern.



For classes extending AbstractMappingMetadataExtracter, the embed mapping can be defined in a properties file in the same location as the extract mapping properties but with an embed suffix, i.e. classpath:/x/y/z/MyExtracter.embed.properties (note that the preferred location for mapping files for extractors and embedders has changed after 4.2.c, see ALF-17891). If no embed properties are found a reverse mapping of the extract mapping is used by default, cool right?

What About Tika?



'But that's still sooooo... abstract. How are we going to leverage Tika? It doesn't support embedding, does it?'



Well as a matter of fact it does, as of version 1.3 (TIKA-775).



The same notion of writing metadata into a binary has been outlined with an interface and basic implementation in Tika, so of course our TikaPoweredMetadataExtracter builds on that and overrides the embedInternal method defined in its parent AbstractMappingMetadataExtracter to convert Alfresco properties to Tika metadata fields and passes that on to a Tika Embedder's embed method, which then passes back the new binary with the metadata embedded.



Tika embedding

How Can we Use Embedding?



Our shiny new Alfresco metadata embedder's embed method isn't very useful if we don't have an easy way to call it, so we've added a ContentMetadataEmbedder action executor which shows up as a standard 'Embed properties as metadata in content' action that can be used in a rule on a folder or executed in a workflow.  (After 4.2.c you can find this in alfresco/extension/metadata-embedding-context.xml.sample)



So what kinds of files and metadata does Tika have embed support for? Truth be told, not many at the moment, but the tika-exiftool project does!



tika-exiftool is wrapper for calls to the ExifTool command-line which contains a Tika Parser and Embedder for image files.



The Media Management module contains an example which brings all of this together with an extension of TikaPoweredMetadataExtracter that uses the Tika Embedder defined in the tika-exiftool project to enable IPTC embedding in image files.



We can add an embed rule to a folder that fires on content update such that when we edit our caption field through Share, the new value is embedded in the file and can be seen using standard image metadata tools, like Photoshop's file info.



Embed flow



Sit down and stop clapping, everyone is staring at you. Aw, who cares, go ahead.

What's Next?



We'll be adding embed support for more file and metadata types to Tika and Alfresco in the future including, of course, documents, but in the meantime, what other formats are you anxious to start embedding?

Introduction



So you've installed Alfresco in Amazon AWS, and your contentstore is on either a local ephemeral disk or it's on EBS.

This guide is to help you migrate from these to S3 using the Alfresco S3 Connector.



The information in this guide was compiled during the contentstore migration from EBS to S3 for one of our large AWS Alfresco users.



There are a variety of reasons to migrate the contentstore to S3. The main one is to increase the resilience of the store - during most of the AWS outages it has been EBS that has been most affected, including data loss (search google for 'ebs data loss').

With S3's 'Designed for 99.999999999% durability and 99.99% availability of objects over a given year' sla, and 'Amazon S3 Server Side Encryption (SSE)' , putting your content on S3 means you will have secure and available content items at all times.

Set up S3



First of all, create a new S3 bucket for you to use.

Make a note of the Bucket name. This will be used in all places tagged <s3_bucket>.



It is also a good idea to secure the bucket more than the default, using IAM - see the AWS documentation for this.



Next, install a tool that will allow you to migrate your existing content to S3, such as the S3tools.



If you are using RHEL6, the instructions are (for other operating systems follow the instructions on the s3tools website):

as root;

cd /etc/yum.repos.d

wget http://s3tools.org/repo/RHEL_6/s3tools.repo

yum install s3cmd

s3cmd --configure


Follow the instructions and enter the credentials asked for so you can connect to your bucket.



Once set up, check connectivity using:

s3cmd ls


This should list your buckets.

Copy your content to S3



Navigate to your contentstore directory:

cd /<dir_path>/alf_data/contentstore


If you want to check to see what will be uploaded to S3, perform a dry run first:

s3cmd sync --dry-run ./ s3://<s3_bucket>/contentstore/


Once you are happy that all is well, start the upload:

s3cmd sync ./ s3://<s3_bucket>/contentstore/


Navigate to your contentstore.deleted directory (these steps are optional if you want to keep your deleted files):

cd /<dir_path>/alf_data/contentstore.deleted


If you want to check to see what will be uploaded to S3, perform a dry run first:

s3cmd sync --dry-run ./ s3://<s3_bucket>/contentstore.deleted/


Once you are happy that all is well, start the upload:

s3cmd sync ./ s3://<s3_bucket>/contentstore.deleted/-system-/


If your contentstore is not massive and you have space on your ephemeral disks, you can copy your contentstore to 'cachedcontent' - this will mean that the S3 cached content is pre-populated. It is much better to have this on the local ephemeral disk than on EBS.

cp -r contentstore cachedcontent


Alfresco S3 Connector



Download the Alfresco S3 Connector.

Once downloaded, follow the rest of the steps in the above help to install the module into Alfresco.

alfresco-global.properties



There are some changes you will need to do to your 'alfresco-global.properties'. These are all documented in the Alfresco S3 Connector information. These changes are:

s3.accessKey=<put your account access key or IAM key here>

s3.secretKey=<put your account secret key or IAM secret here>

s3.bucketName=<s3_bucket>

s3.bucketLocation=<see http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region>

s3.flatRoot=false

s3.encryption=AES256

dir.contentstore=contentstore

dir.contentstore.deleted=contentstore.deleted





If you are using lucene, set the following if it is not already set:

index.recovery.mode=AUTO


Make sure that Alfresco is stopped before you progress any further.

DB Update



Once your content is all in S3, and your Alfresco properties are all configured to use S3 as the contentstore location, there is one final step that is needed to be performed - update the Database!

One of the tables Alfresco uses in the Database has a property that links an item of content to its location. Since we have moved the content to S3, we need to update all these links in the DB. Luckily it's easy :)



First, get the details of you database configuration from 'alfresco-global.properties'.

db.name=<db.name>

db.username=<db.username>

db.password=<db.password>

db.host=<db.host>


If the mysql tools are not already installed on your box, install them, e.g.

yum install mysql


Run Mysqldump, connecting to your DB, and dump the table called 'alf_content_url'.

The command below does this (you will be prompted for your user's pwd):

mysqldump -u <db.username> -p -h <db.host> <db.name> alf_content_url > s3_migration.sql


Next, make a backup of this dump in case anything goes hideously wrong :)

cp s3_migration.sql s3_migration.sql.bak


Then, we need to change every store location for each file to point to S3.

This involves changing the values of the 'content_url' column from 'store://...' to 's3://...'

Here's a command I made earlier to do this (if you are on linux):

find s3_migration.sql -type f -exec sed -i 's/store:\/\//s3:\/\//g' {} \;


Once that completes successfully, you now need to re-import this table data.

Connect to your mysql db (you will be prompted to enter the user's pwd):

mysql -u <db.username> -p -h <db.host>


Switch to use the database that Alfresco uses:

use <db.name>;


Import your modified sql file:

source s3_migration.sql;


Exit mysql.



So, to recap:

S3 bucket has been created.

S3 cmd line tool such as s3cmd has been installed.

Content has been copied to S3.

The 'Alfresco S3 Connector' module has been installed into your Alfresco instance.

alfresco-global.properties has been updated.

Alfresco has been stopped

A dump of the 'alf_content_url' has been made, and a backup of that made.

The store location has been modified in the sql dump.

The modified dump file has been re-imported into your mysql db.



You are now ready to restart Alfresco...



There are a few methods to check that the S3 connector is all working:

1. monitor the 'cachedcontent' directory - it is used as a cache for the S3 content so that Alfresco doesn't have to request frequently used content from S3 each time it is used.

2. Upload some new content and check the S3 bucket.

3. Enable logging for jets3t as below and see what the logs say.



If things don't work, you could do the following:



Try re-synching your content.

Pro tips



You can enable JMX Instrumentation on the S3 connector by adding the following JAVA_OPTS to your Alfresco start scripts:

'-Djets3t.mx -Djets3t.bucket.mx=true -Djets3t.object.mx=true'



Logging - The S3 connector is based on jets3t, so follow the logging information for this tool:

http://jets3t.s3.amazonaws.com/toolkit/guide.html



Filter Blog

By date: By tag: