AnsweredAssumed Answered

Help extracting HTML metadata in WCM

Question asked by ofwr on Jan 14, 2008
Latest reply on Jan 15, 2008 by pmonks
We're attempting to import an existing web site which contains metadata into Alfresco WCM.

I have imported a single .html file into a work space, extracted the meta data using the default action 'Extract Common Metadata' and moved it into a WCM folder.  Because of the number of files involved this method is not practical for the entire site - bulk import crashes and I can't see how to easily automate the process.

I read the 'XML Metadata Extractor Configuration for WCM' example on the wiki and have managed to import multiple .xml files extracting their metadata.  This appears to be the best method, however I can't see clear documentation or an example to do the same with .html.  Looking at the example there is a different method required and the Javadocs indicate that all other file types are within a different class hierarchy.

I'm assuming that the .html implementation should:
* register the HTML extractor in the avmMetadataExtracterRegistry
* construct a HtmlDocumentMetadataExtracter mapping the properties

I have created an example and imported .html files, but I have had no success with getting anything from the html extractor.

Help please.  Are there any examples of importing .html into WCM?

Outcomes