Search - Prototype 2

Document created by resplin Employee on Jun 6, 2015
Version 1Show Document
  • View in full screen mode

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com



Search Prototype


Revised Index structure


  • UUID (The unique identifier for the node)
  • FTS (The full text search entry for the node)
  • PATH (The full path to the node) If there are multiple paths this can be repeated)
  • QNAME (The fully qualified name of the node)
  • ANCESTORID (A repeated field containing the IDS of all the nodes ancestors including itself)
  • LEVEL (The depth of this node relative to the root - may be repeated)
  • WORKSPACEID (The id of the workspace)
  • Attributes as (Name=@ns:name Value=value and also Name=@ns: and Name=name)

To Add


  • Categories
  • Role based read access control

Proposed Special Property Types


These property types will require special indexing and tokenisation

QName
Path
Category
Security


Lucene Query Extensions


To execute complex structure expressions a new type of query is required.




  • StructuredFieldQuery
    • StructuredQueryElement
    • Next Query
  • Root
    • Needs to iterate over entries
    • How to mark the root entry
  • Fixed Position
    • Name + position
    • Next fixed or relative clause
  • FixedDepth and/or End
    • Depth
  • Relative Position
    • Name
    • Offset or any
    • Next relative clause
  • Simple tokeniser
    • Path is
      • Depth
      • NameSpace + Name (repeating)
      • Optional end marker followed by other paths

Comments


Issues


  • Impact of renaming
  • Impact of restructuring
  • A bridge table does not make much sense
    • Still have a big up date problem - better to split the path in a different way ...
  • Cold use indirection for top level hierarchies
  • Can not serach across tow indexes tha index the same docs without pulling out all docs amd joining on the primary key.
  • Do not see a sensible way of partitioning below the store level

Performance


  • 5 Million Paths on my laptop
    • Returning 1 or 2 million result sets on simple paths in 1-3 seconds
  • Indexing performance (99 attribute stored + one PATH as above * 5)
    • 3 ms to add a document to an in memory index
    • More efficient to then merge into the on disc index
      • best is around 1 ms per doc
      • This decreases as the size of the index increases
      • 200,000 times = 1 million paths OK (10 iteratoins of appending 20000)
      • 2M is more of a problem (10 million paths) decreasing (100 iterations appending 20000)
        • slows to 16 ms/doc at the 20th iteration  

The indexing performance could be due to the heavy common terms in attributes and similar paths.

Different machines all the same java and command line options


  • My laptop
    • Write to in memory index up to 20000 times (CPU limited)
    • 2.06 ms per doc
    • Merge 10 times to make 200000 (IO limited)
    • 1.11 ms per doc
  • Same laptop spec + mandrake 10.1
    • 5.13 ms/doc
    • 0.67 ms/doc
  • Modo
    • 2.03 ms/doc
    • 0.39 ms/doc

Getting a document out of the above index


  • 20,000 docs  - all docs - 0.07 ms/doc
  • 200,000 docs - all docs - 0.03 ms/doc

Todo:


  • How fast to delete?
  • How fast to optimise an index?



Search

Attachments

    Outcomes