AnsweredAssumed Answered

Search and Lucene questions

Question asked by wweber on Nov 17, 2005
Latest reply on Nov 17, 2005 by andy
Greetings,

Looking at the "contentModel.xml" file, I can see the following for the cm:content type:
<type name="cm:content">
<title>Content</title>
<parent>cm:cmobject</parent>
<properties>
    <property name="cm:content">
       <type>d:content</type>
       <mandatory>false</mandatory>
       <!— Index content in the background –>
       <index enabled="true">
         <atomic>true</atomic>
         <stored>false</stored>
         <tokenised>true</tokenised>
       </index>
    </property>
</properties>
</type>
I understand that the "index" element and its children are referring to the Lucene index, not a database index. I have the following questions:

1. If index enabled is set to "false", will this property not be searchable and/or retrievable via the SearchService?

2. If index enabled is set to "false", then can we assume that the "atomic", "stored", and "tokenized" settings are irrelevant?

3. Is the comment "Index content in the background" correct? I thought you would only index content in the background if "atomic" is false.

4. In a previous discussion, I learned that using Lucene via the SearchService is strictly used for searching. That whenever data is retrieved from a ResultsSetRow (ResultsSetRow.getValue), Alfresco is always getting the data from the database, not from Lucene. Why would you ever set <stored>true</stored>?

5. The Wiki at http://www.alfresco.org/mediawiki/index.php/Full-Text_Search_Configuration says the following: "The default is that properties are indexed, indexed atomically, that the property value is not stored in the index, and that the property is tokenised when it is indexed." Does that mean that if the entire "index" element and its children are missing under the "property" element these defaults will be used?

6. If I set "tokenise" to true and I have a property that has the string "Hello World" then two words will be stored as two tokens with the "StandardAnalyzer", is that right? If I set "tokenise" to false, then the entire phrase "Hello World" will be stored and not broken up into individual word tokens, right? With "tokenise" set to true, I could search for either "Hello" or for "World" and get a hit on this property. If I set "tokenise" to false, then I would have to be searching for the entire phrase "Hello World" to get a hit on this property, right? If I set "tokonise" to false, it seems like I'd be storing the entire text as is and might take up as much space in the index as if I had set "stored" to true, is that right (my goal is to reduce index size)?

7. If a single property, either with the type itself or via an aspect applied to that type, has atomic set to true, will all properties be atomic, even if they have atomic set to false. In other words, can I store into the index in the background only if ALL properties (including aspects) have atomic set to false?

8. I have a stand-alone test client that is entering data in the repository and then exiting. I noticed that under the "…\alf_data\lucene-indexes\workspace\SpacesStore\delta" directory I am building a number of directories with a lot of numbers on them (like 8b41158a-5766-11da-af5f-cf53901d41b0). Some of these directories (most) only have a single file, the "segments" file, and the rest of them are totally empty. I noticed that if I sleep for 60 seconds before I exit my client, I don't build up a directory after entering a batch of data into the repository.  Are these directories here because their deletion is done in the background and I am exiting before they get a chance to get deleted? Will they ever get deleted? Would it hurt for me to delete them manually if they only have the "segments" file in them?

9. For my testing I have stored 1 million nodes with 12 properties for each node in the repository. I noticed that the "alf_data\lucene-indexes\workspace\SpacesStore\index" directory now has 19 files that end with ".cfs" that are about 100 mb in size. I have to assume that the more of these files I get, the slower will be my search response because all of these will have to be examined in order to get back my search results. Is there a way to control the size of these files? Could I have better search performance if these files were 300 mb each and there were less of these files? I noticed that the content of these files is largely composed of repeating QNames. Perhaps these are references to specific nodeId's. It seems that one of the side effects of the long QNames might be significantly increasing the size of both the Lucene index and database files. If we are anticipating storing a lot of data and performance is important, would you recommend using short QNames? That is don't use something like "http://www.mycompanyname.com/myproject/model/content/1.0" for our name space but something like "content1.0" (if we are just using the repository within our company)? Of course, if we use any of your aspects or extend any of your types, we will still get those long QNames. One possibility might be having our own version of the aspects and types with shorter QNames. Either that or we could edit your contentModel.xml and change the namespace definition (would that make sense?).

Ok, thanks for your patience. Perhaps the answer to these questions will be helpful to others.

——————————–
Thanks again :)
Wayne Weber
Knight Ridder Digital

Outcomes