I have added some chinese "word documents" and created a chinese HTML document. I cannot search with the documents when I use chinese terms(Using English term is ok).
I have looked into the index using Luke, the chinese words seem not indexed. I suspected that is the problem of the StandardAnalyzer, but I tried different analyzers (custom and two from lucene sandbox), recreted the index on each attempt, but still in vain. I have checked the analyzers with lucene and they worked fine (in my test script).
Can anyone tell me if there are different ways to check what happened? Is it the web layers problem?
Thanks in advance. :o
I have looked into the index using Luke, the chinese words seem not indexed. I suspected that is the problem of the StandardAnalyzer, but I tried different analyzers (custom and two from lucene sandbox), recreted the index on each attempt, but still in vain. I have checked the analyzers with lucene and they worked fine (in my test script).
Can anyone tell me if there are different ways to check what happened? Is it the web layers problem?
Thanks in advance. :o
The tokeniser is specified at the type level and is localisable.
If there is no localisation then the values from dictionaryModel.xml are used.
In the default configuration there is a default localisation bundle that specifies the tokenisers that are used. This is in the file
dataTypeAnalyzers.properties
You can set these for a particular locale by adding something like
dataTypeAnalyzers_zh_CN.properties
in the language pack. I don't think any of the language bundles include this as yet. If no locale specific file is found it falls back to the default.
The next question is: "How is the locale found?"
The default locale is picked up from the server or set when interacting with the client. It is possible that interactively adding a document may produce different tokenisation from indexing as a result of a rule as one could set the locale via the client and the other via the default Java locale in the repository.
I hope this helps. Let me know how you get on.
Regards
Andy