Index Version 1

Document created by resplin Employee on Jun 6, 2015
Version 1Show Document
  • View in full screen mode

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com



Search Indexing

Note:  This design has been superceeded with Index Version 2 as of Release 1.4.


Requirement


A mechanism is required to search against the property, full text, content
and semi-structured data in the repository. The structural data is in two forms:
the parent - child relationship between nodes; and the location within
hierarchies used for categarisation.

The persistence of the data may be separate from the index used to search and locate data.

For example, indexing external content, separating the storage of content from other information.


Why Lucene?


The intention is to use lucene as the index and search engine.

It allows the production of an unstructured index with potentially repeating fields.
Each field in the index can be optionally:


  • indexed (available for search)
  • stored  (available in the documents returned by the search)
  • tokenised (stored and indexed as is or tokenised)

Not all documents need to index the same fields.
This is a good match to the extensible content model.

It is not clear if we should use lucene to store document content as well as to be able to use it for indexing.
If delayed indexing or non storage of one attribute requires propertie to be obtained via the node service then all properties will be returned.



Lucene seems an obvious choice as it resolves the following issues:


  1. Databases have varying support for full text search
    • Databases have varying support for indexing foreign content for full text search. It may have to be stored in the database.
    • Somedata bases have to store the content they index
  2. Databases have varying support for hierarchical queries
    • The implementation is not performant
    • There is no support
    • Bridge tables can be used but there may be no triggers to suport management of the trigger tables.
  3. It would be preferable to excute queries in one place and not have to merge result sets.
  4. Ideally read access permissions would be applied during data access but it could be a post filter.

Lucene has disadvantages as:


  1. It does not support join
  2. It duplicates information held in the repository
    • This shuld be controlled by the data dictionary

We should tokenise each field/attribute according to its type definition.
For example path should be treated in a special way.
We should map to the same analyser on the query side.
Integers etc. need to be stored and tokenised in a form that will allow lexographical ordering. Similarly for date. Timestamps need to be indexed as dates and treated specially in queries.

The data dictionary should control the indexing behaviour.


Issues


Search Issues


Prototypes


Implementation Plan


Search - Plan


Recovery


There are two scenarios


  • JTA
  • nonJTA

When we are not in the two-phase commit world we have to do more detailed error recovery.
With JTA we will know if we need to recover and just need to know what to do.

For each store we need to keep the following when we prepare a transaction


  • The things to delete
  • The delta to merge in
  • If we have managed recovery (JTA or nonJTA)

If we find a nonJTA TX that still has info we need to determine


  • did everything fail
  • did hibernate commit and the index update fail
  • did everything succeed

In the JTA world we are told what to recover.

To test the index state:


  • Compare deleted objects in the delta with the database and the index
  • The same for added objects
    • This will decided one way or the other.
  • If we have only updates just recreate the delta and update
    • This will rollback or update to the required state with out checking
    • We could scan the index and regenerate the entries until one is different
    • We could rebuild everything as no change is valid
    • We may well have to build the reverse in any case
    • Just do for the first pass

If an index is absent we have to rebuild.

If an index is partially corrupted by deleting an index segment then the index will effectively be broken and should be rebuilt from scratch.

In the non-JTA world we would not commit the index befroe the database. There is no need to back out a change from the index.


JTA


Support for JTA.

Should switch to the spring pattern for keeping transactional resources.


XAIndexer


Produced by all internal factories.


  • XAResource getXAResource()

Registration


Conditional on being a JTA or Hibernate transaction manager

JTA


  • Enlist as resource
  • register spring synchronisation to do only a NodeService.save();
    • beforeCommit()
      •   NodeService.save()
    • beforeCompletion()
    • afterCompletion()

NodeService.save()


  • integrity first pass
  • rules
  • integrity second pass
  • index flush
  • optional hibernate flush

nonJTA


    • beforeCommit()
      • NodeService.save()
      • indexer prepare();
    • beforeCompletion()
    • afterCompletion()
      • indexer post action - commit or rollback

This implies we have one synchronisation that optionally does the indexer stuff.

We should be done before the Spring synchronisation</pre>


Integration with lucene


We have modified lucene 1.4.3 to address a number of minor issues and enhancememnts.
These are described here. Lucene Extensions and Issues

Attachments

    Outcomes