Open Source Text Mining

Question asked by mbibler on Dec 16, 2006
Latest reply on May 2, 2007 by swamikevala
I am personally delighted to see the important work being accomplished here, and in the incredibly swift amount of time it has been delivered.  I am quite envious of the privileged who are involved with it.

As you look forward, please consider embedding an open source text mining framework, such as GATE, and an easy-to-use web-based rule entry and output analysis UI into the Alfresco suite.

In my humble opinions:
a) I sense the general proximity of these two organizations (3 hours distance?) could facilitate great work.  Both organizations are composed of genius and highly motivated inventors and users to see their success.
b) Given a user-selected corpus (full set or subset of documents), exposing text relationships within sets of documents would be a major attraction for industries like law-enforcement, pharma research, insurance, and any corporation with a legal staff needing in-house e-Discovery capability.  The format transformation services of Alfresco (I suppose this leverages the work of OpenOffice?) fit very well for inputs into a GATE engine.
c) Offering the Alfresco suite as open source is already a monumental accomplishment.  I can imagine a demand when this is the only ECM tool that has this capability embedded, with a set of customers demonstrating its value and sharing it openly.  The Big 3 (maybe with the exception of UIMA) rightfully don't appear to be giving this full consideration, instead leaving it to customer demand from their set of customers who probably don't know what we don't know about the functional value well enough to be demanding it, and instead have other historical functional needs to solve first.
d) The meta-data extraction threads I've read here discuss information extraction and full-text indexing, both valuable functions of text mining.  The information extraction threads I've read here seem to indicate parsing from an expectation of structured areas within documents, e.g. email headers, known Office document metadata sections, etc.  I only request a more detailed evaluation into how to incorporate unstructured extraction using a rules framework, and hopefully not wait until Alfresco v4.4.   :)