AnsweredAssumed Answered

Determining the encoding of a text document

Question asked by hbf on Jan 6, 2008
Latest reply on Jan 18, 2008 by derek
Hi,

For an AMP I am developing I need a way to determine from an input stream the encoding of a document. I want to store the latter in Alfresco and need to know the mime-type (which I know how to determine) and the encoding.

In the Alfresco API I've found that the MimetypeService provides a way:

mimetypeService.getContentCharsetFinder().getCharset(streamSupportingMark, type)

In the code I see that a certain CharactersetFinder implementation (GuessEncodingCharsetFinder) is being run on the stream. In my case it "fails": I have an HTML document containing

<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />

but GuessEncodingCharsetFinder is not making use of this as it is a "last resort" encoding guesser that ignores meta-information present in the file (if I am not mistaken).

Is there a plan to add a CharactersetFinder that looks for a "charset=" in the meta-area and "guesses" from this?

Or am I on the wrong track using MimetypeService's getContentCharsetFinder()? I am not sure…

If getContentCharsetFinder() *is* the right approach and I write my own CharactersetFinder, how can I configure Alfresco to use it? (I don't want to change core-services-context.xml.) Of course, I'd contribute my finder…

Many thanks,
Kaspar

Outcomes