WCM Search

Document created by resplin Employee on Jun 6, 2015
Version 1Show Document
  • View in full screen mode

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com



{{AVMWarning}}
AVM


Search


Searching against a WCM store is the same as searching against an ADM store, as described on the Search page. The search API is the same. Some index fields are modified as described below, these are mostly only intended for internal use. PATH is the major exception which matches paths in the avm store, where there is no name space prefix. Paths are like '/a/b/c/d'.

Lucene based search is only available for the latest snapshot of the head revision for staging. It is not available for user sand boxes, workflow sandboxes etc.

XPath based searching is available for all WCM stores, with the caveat that performance may be slow depending on the query and the store structure, as the implementation walks the object model. This follows the same implementation as the ADM stores. XPath searching is always available for all stores and requires no configuration. Searches are always live, and not just available for the last snapshot.

A StoreRef is sufficient to identify the store for the search.

A search against an avm store can be performed via Java, JavaScript, Freemaker template or the node browser. It is not yet exposed via open search. As the API is the same, searches via web services etc are the same apart from the store context.


Indexing


Indexing takes place for a limited number of avm actions. These are:


  • taking a snapshot;
  • creating a store;
  • purging a store; and
  • moving a store.

All other actions have no effect on the index. So new additions to a wcm store will not be found until a snapshot is taken. If the snapshot is indexed asynchronously then the addition will not be found until the snapshot has been indexed in the background.

For synchronous indexing, search works immediately after any snapshot. After the snapshot any searches in the same transaction will find the changes in the snapshot. After the snapshot is committed all users will see the changes. 

Indexing of a snapshot may take place asynchronously or synchronously. AVM store creates and purges are indexed synchronous. Move is a mixed beast - the index for the old store is synchronously deleted, the new index is synchronously created and the initial snapshot may be indexed asynchronously or synchronously, depending on the configuration.


Altered Fields in the index compared with ADM


ID
This is multivalued entry in WCM indexes. It contains the avm node ref and the full path - including the store. It can be used in the same way as the ADM index. It will also match the full avm path to the object.

PARENT
is the path to the parent (as it needs to match the parent's ID)

ANCESTOR
(container index entries only) The IDs of all parent containers as paths.

PATH
(container index entries only) The full avm path to the container.
This is not the PATH as exposed for search and is not available via lucene parsed queries.

ISCATEGORY
Unused as categories are not yet supported in the AVM index

Fields that do not exist in the AVM index
FTSSTATUS
LINKASPECT
PRIMARYASSOCIATIONTYPEQNAME
PRIMARYPARENT
QNAME
TX

All other fields are as described in |Fields in the index and how they are exposed for queries




Configuration


Configuration is set in a method interceptor bean definition that wraps AVM calls. This is defined by default in public-service-context.xml and can be over-ridden. The default configuration is shown below. By default staging areas are indexed synchronously. There is an extension example for asynchronous indexing. Remember, if you configure asynchronous indexing your query results may be out of date.



    <bean id='avmSnapShotTriggeredIndexingMethodInterceptor' class='org.alfresco.repo.search.AVMSnapShotTriggeredIndexingMethodInterceptor'>
        <property name='avmService'>
            <ref bean='avmService' />
        </property>
        <property name='indexerAndSearcher'>
            <ref bean='avmLuceneIndexerAndSearcherFactory' />
        </property>
        <property name='enableIndexing'>
            <value>true</value>
        </property>
        <property name='defaultMode'>
            <value>SYNCHRONOUS</value>
        </property>
        <property name='indexingDefinitions'>
            <list>
                <value>SYNCHRONOUS:TYPE:STAGING</value>
                <value>UNINDEXED:TYPE:STAGING_PREVIEW</value>
                <value>UNINDEXED:TYPE:AUTHOR</value>
                <value>UNINDEXED:TYPE:AUTHOR_PREVIEW</value>
                <value>UNINDEXED:TYPE:WORKFLOW</value>
                <value>UNINDEXED:TYPE:WORKFLOW_PREVIEW</value>
                <value>UNINDEXED:TYPE:AUTHOR_WORKFLOW</value>
                <value>UNINDEXED:TYPE:AUTHOR_WORKFLOW_PREVIEW</value>
                <value>ASYNCHRONOUS:NAME:avmAsynchronousTest</value>
                <value>SYNCHRONOUS:NAME:.*</value>
            </list>
        </property>
    </bean>

The AVMSnapShotTriggeredIndexingMethodInterceptor class supports querying the index state if you want to know if an asynchronous index is up to date. See the java doc.


enableIndexing
disable or enable all avm indexing




defaultMode
If non of the indexing definitions match this is the default indexing mode to be used.

indexingDefinitions
A list of indexing defintions. Entries are  ':' separated values of the form

(SYNCHRONOUS | ASYNCHRONOUS | UNINDEXED): (TYPE | NAME) : regular expression

Each entry defines a regular expression that is used against either the name of the store or the WCM UI store type. Each entry is tried from first to last, the first match defines the indexing mode.

In the definition above stores of type staging area are synchronously indexed. All other types of store used by the WCM UI are unindexed. This must not be changed as it is not yet supported. The store named avmAsynchronousTest, used in testing, is indexed asyncronously. All other stores are indexed synchronously. The default entry will never be used as a catch-all regular expression '.*' is at the end of the list.

Synchronous indexing indexes everything - including content. There is no way to index meta data synchronously and content asynchronously. Asynchronous indexing indexes nothing (no meta data, nothing). It creates an index request which indexes the snapshot in the background at some later date.




XML metadata extraction


If meta data is extracted from XML and stored in an attribute then you can search for it. There will also be a full text conversion of the xml to text.

See XML Metadata Extractor Configuration for WCM

The main limitation is for repeated elements mapping to one attribute as repeating values. Attributes do not support any position queries (by the normal API). So if you have something like



<a>
 
    <c>1</c>
    <d>2</d>
 

 
    <c>3</c>
    <d>4</d>
 

</a>

and this get pulled into an aspect of type {test}A with attribute {test}b.c as [1,3] and {test}b.d as [2,4] then a query of the form

+@{test}b.c:'1' +@{test}b.d:'4'

will find a match where it may be unexpected.

Lucene only supports basic PATH look up. There are no built-in aggregation functions etc.

XPath (V1.0) can provide this to some extent

If you want data manipulation or to process the result set, you have to do this in  JavaScript or in a template.


Implementation notes which are currently out of date


Indexing


When a snapshot is made of a store then the changed nodes in the store make up a new overlay index.
A revert to a previous snapshot will be treated in the same way - there is no need to store overlays and be able to roll back - although this could be more performant.

PATH is used as the node ID as well as the PATH and to determine which files etc are overlayed in the index.
The store information is ignored. This assumes that all stores are rooted at the same point. This is true in practice in the WCM world but does not have to be the case. An overlay at the root of one store could point to a sub tree of another store. In the first case we will ignore this. There is a node id (the DB long id) which we should user here - suitably encoded in the index as we do for other longs.

Hooked into snapshot for a store, may be on demand for an author's store.

Store types are available to support this.

Indexing is done at the store level for each snap shot of the store. In the first instance, only the latest snapshot will be indexed.
As only the latest snap shot is required, the overlays can be merged up into one big overlay that can be applied over other stores. The index will only contain information for the store - nothing about layers above or below.

At search time the indexes for the stores are overlayed. The deletion list needs to be kept for the overlay as this will involve many base indexes etc. If we know the store is not layered on any other store we could throw away the list of overlayed paths as it will never be used. This is the basis for the first implementation - we will only worry about this base index o fhte staging area.




XML Metadata extraction


XML content will not be generally searchable. Metadata will be extracted to suitable aspects depending on the content of
the XML data. The metadata in these aspects will be indexed according to the DD definitions.

When is this transform done?
As we have no actions then we could just extract to pseudo attributes for search only (indexed but do not exist).

It may be worth adding navidation into XML docs types in the XPath navigator. Which would give a slow but full xpath search and a reasonable API for in document search....

Metadata extraction will have to be done at index time of required.


Changes


Overlay to support a generic ID (remove noderef from index API)

Support overlay types in the index - which always keep the deletion list for overlaying indexes.

Do we index per store and merge? Or, add overlays to the index? Prefer the first.

Support to merge overlays (useful in any case) Keep the deletion list. Can be used to merge index deltas.

Index AVM nodes (minor change to existing with Node and Content facades)

Interceptor to index snapshots. Find the nodes unique to the store changed since the last snapshot. Index them as a chunk.

Build index overlays at search time.

AVM does not have 'Index' or 'Delta' layers they are all 'Overlay'

Target to have one overlay per store for background merging of overlays.
Could maintain some history of overlays to support point in time searching. Overlays have a numeric incrementing id.

Thread pooling of background threads as the number of mini indexes will sky rocket

Only index stores of type something like staging and authoring

XML Extraction.




V1


  • Use current index or simple lucene index
    • Fix use of generic ID for layers (done)
  • Index staging store only
    • Explicitly wire up incremental index at snap shot time - ask the index to update a bunch of nodes
    • Use a method interceptor for this
  • No index history
    • only index the latest information
    • snap shot - index the forward change set
    • revert - reindex the backward change set
    • No need to keep change sets at the index level
    • Async and sync options
    • include the snapshot revision in the index - see when async index has been done (will be done async in a TX)
  • XML indexing
  • Sync/Async implementations
  • Long term AVM has a node id we can use as the id (rather than the relative path)
    • Can start with this
  • Paths are expensive to find
    • Better to just get the latest path to an id in a given store - index the path in a single store
    • Path in general is going to be a pain
    • An existing node can get a new path at any time - in the old world this would be a reindex which is not going to scale for AVM
  • PATH support may not be there. It may be done some other way.
  • XPath support required ?
    • Simple navigator on top of the AVMNodeService should be working out of the box?
    • Check with Kev for requirements here
  • Deploy time index mode
    • Check with Kev if we want sync.async/or to choose.
  • Catch up with KevR about store types and confirm that the staging area as at the bottom of the stack in practice.
  • Index to do list
    • Store name and snapshot id to index 




  • First cut we will not worry bout the overlay - just index staging in its own right.
  • There is not duplication of data and work unless we index the authoring areas
    • Authoring will be done in the second pass when we worry about layering stores in the search and index.

Attachments

    Outcomes