Search - Prototype 1

Document created by resplin Employee on Jun 6, 2015
Version 1Show Document
  • View in full screen mode

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com



Search Prototype


Index structure


  • UUID (The unique identifier for the node)
  • FTS (The full text search entry for the node)
  • PATH (The full path to the node) If there are multiple paths this can be repeated)
  • QNAME (The fully qualified name of the node)
  • NAME (The fully qualified name of the node)
  • ANCESTORID (A repeated field containing the IDS of all the nodes ancestors including itself)
  • A_PATH_NAME (The tokenised path with only names includes [depth] at the start)
  • A_PATH_NS (The tokenised path with only namespaces)
  • A_PATH_QNAME (The tokenised path with the fully qualified node names)
  • REL_PATH_NAME (The tokenised path with only names)
  • REL_PATH_NS (The tokenised path with only namespaces)
  • REL_PATH_QNAME (The tokenised path with the fully qualified node names)
  • LEVEL (The depth of this node relative to the root - may be repeated)
  • WORKSPACEID (The id of the workspace)
  • Attributes as (Name=@ns:name Value=value and also Name=@ns: and Name=name)

To Add


  • Categories
  • Role based read access control

Yes these are required as you can not do wild-card search in phrase queries or at the start using the standard query parser
(for example to find all names spaces)


Comments


  1. Full paths are slow and the * matching is greedy which means it would also be difficult to use.
  2. 420,000 docs with 50 attributes ~320 Meg
  3. Indexes the attributes in about 3 ms/doc
  4. FTs in about 45 ms/doc
  5. I suggest we split indexing for full text search and indexing for attributes.
  6. It would make sense to have an index per workspace to keep concurrent access to minimum. This would still have to managed, as would the behaviour at transaction boundaries and the atomicity of indexing.
  7. We need to control Reader-Delete and Writer.*() to avoid concurrent and inconsistent operation. We should not rely on the lucene lock mechanism.
  8. As Lucene stores positional information we should only need one index  as
    Namespace Name Namespace Name
    We have name space information at odd positions, and names at even positions and we have implied depth.
    The maximum term gives us our depth. We could include this as a separate entry.
    So we could do absolute path elements as ANDS. Followed by a sequence of and 'thing at +n' after last token, 'anywhere greater than the last token' and then 'no next token'. This should give us a powerful absolute and relative query path.

Example XPath queries and lucene translations against the prototype index


In the default workspace

/documents


+A_PATH_NAME:'[1]documents' +WORKSPACEID:'default'



//documents 


+REL_PATH_NAME:'documents' +WORKSPACEID:'default'



//document[jcrfn:like(@title, '%.java')]


Must be done via the API as you can not have the wild card at the start when using the QueryParser.



//author[@name='andy]//documents


Done in two stages (Actually a stage for each attribute predicate with sub path expressions


# +NAME:'author' +WORKSPACEID:'default' +@name:'andy' returns UUIDs
# +NAME:'documents' +WORKSPACEID:'default' +ANCESTORID:('ID1' OR 'ID2')



/a/b/c


+A_PATH_NAME:'[1]a [2]b [3]c' +WORKSPACEID:'default'



/a/b//c


+A_PATH_NAME:'[1]a [2]b' +NAME:'c' +WORKSPACEID:'default'



/a/b//c/*


+A_PATH_NAME:'[1]a [2]b' +REL_PATH_NAME:'c' +WORKSPACEID:'default'



/a/b//c/d/*


+A_PATH_NAME:'[1]a [2]b' +REL_PATH_NAME:'c d' +WORKSPACEID:'default'



/a/b/*//c/d/*


+A_PATH_NAME:'[1]a [2]b' +REL_PATH_NAME:'c d' +WORKSPACEID:'default'
+ post filter on the results


Need to remove /a/b/c/d as too short - could also add LEVEL > 5 (there are still use cases that would require a filter and can not be done using level - absolute path+relative path+ all with wild card elements)
Path filter - probably do anyway
The next proposed index structure and new query element solve this.

Next


Search - Prototype 2

Search

Attachments

    Outcomes