AnsweredAssumed Answered

Getting metadata from the content of a document

Question asked by jlabuelo on Jan 29, 2008
Latest reply on Aug 8, 2008 by jlabuelo
Hi there

I am trying to assgin a customized type to a document droped in an Space using a business rule, but then would also like to fulfill the properties of the type automatically obtaining the metadata from the "content" of the document.

Let me explain what I pretend to get

a) Move a Word, Excel, PDF document to a Drop Space in my application using CIFS – DONE
b) As soon as the file arrives to the Drop Space, it will get a new type assigned, which I have defined, using a business rule – DONE

c) The third step, would be reading the document (the content of the document) and find in there the values of the metadata of the type defined in step b) so I can assign it to the document.

For example, I have defined the type "CompanyInfomation" which contains two properties:
1) CompanyName
2) CompanyId.

I will load documents in a drop space, and each document will have inside somewhere in the text "Name of the Company: NAME" and "Id of the Company: Id". I would need to identify those words somehow and extract the value s NAME and ID to assign them to the properties of the document "CompanyName" and "CompanyId".

I am able to assign "fixed values" using a javascript script and hardcoding them, but now would like to get the values from the documents and dont know exactly how to do it.

I have been reading the wiki page about the MetaDataExtractor, but I did not find (maybe I did not searched well even I tried hard!!) how to extract metadata from inside the document, I know how to extract, the author, the creation date of the document…… but nothing from iside.

I have tried this JavaScript… but of course did not work….

Any ideas about how this should be focused????

Thanks a lot in advance guys!!

(JS code tried to use)

// First we read the document to find the values
var Sociedad ="";
var CIF ="";

var FileContent = document.content;
var FileLines = FileContent.split("\n");
var Lines=0;

var words;
var foundSociedad = False;
var foundtradoCIF = False;

while ((foundSociedad == False) || (foundtradoCIF == False)) && (Lines <= FileLines.length)
   var Word_Count=0;
   words = FileLines[Lines].split(" ");
   while ((foundSociedad == False) || (foundtradoCIF == False)) && (Word_Count <= words.length)
      if (words[Word_Count] =="Sociedad:")
        foundSociedad = True;
        Sociedad = words[(Word_Count +1)];
      if (words[Word_Count] =="CIF:")
        foundtradoCIF = True;
        CIF = words[(Word_Count +1)];
   Lines = Lines+1;


// Now that we have the values we apply them to the properties of the custom type."Doc_"+Sociedad+"_"+CIF;["custom:CompanyName"]=Sociedad;["custom:CompanyCIF"]=CIF;;