Review Of Full Text Search Query Syntax

Document created by resplin Employee on Jun 6, 2015
Version 1Show Document
  • View in full screen mode

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com



3.0Search
Reviewed for 3.3

Design discussion for the FTS in 3.0. Refer to Full Text Search Query Syntax for the implemented FTS.


Background


Currently we expose the default Lucene query parser syntax for full text search support.
This excludes the availability of some advanced Lucene features, such as Span queries, which we would like to expose. It also ties us to the Lucene query syntax. This may not embed well if we were to expose a SQL query language. It is also additional work to upgrade as we have out own customisations to the query parser.

The Data Dictionary (DD) defines indexing behaviour by binding a Lucene tokeniser to index types. So there is only basic control of indexing per property (indexing on/off, tokenisation on/off).


Possible DD extensions for indexing a property


This covers features found across many implementations and which of those features we may support and how we might do it. In particular, what we would have to store in the index to support this.


Indexed
as it is now

Index priority
FTS importance - not really of much use as lucene can not do an update and requires a delete and add. Indexing twice will be enough! We could prioritise documents to be FTS indexed in preference to others.

Pluggable indexers in addition to the core
Add an API to allow configurable index extensions. Add your own fields to the Lucene index.
Will require support to do a cascade update for the most common use case (tag a file is in some path)
Performance improvements and caching for path searches may work just as well
The other use case is XML metadata extraction without populating alfresco properties

Orderable -
Support for sort which may over lap with use as an identifier and support FST
Included with tokenisation

Tokenised
We should be able to support FTS and SQL like pattern matching. The tokenisation requirements are different so an attribute may need indexing twice. The identifier like indexing may also be more appropriate for ordering
This will be a comma separated, case-insensitive list of what is required of tokenisation.
ID, FTS, SORT
BOTH -> FTS, SORT
TRUE -> FTS (backward compatibility and default)
FALSE -> ID, SORT
ID and SORT are distinguished to separate IDs that do not support sort
Currently we support. BOTH, TRUE and FALSE

Boost
We do not set document or field boosts at index time. Changing a field boost would require a reindex so all documents with the field have the same boost setting.

case sensitivity
Depends on the analyzer

diacritics
Depends on the analyzer

stemming
Depends on the analyzer

thesaurus/synonyms
Depends on the analyzer

stop words
Depends on the analyzer

language/localisation
localisation is driven by:
the locale of each value for a multi-lingual text property
the locale set d:content types
IF UNSET, IN ORDER
the locale set on the node
the locale of the user
the server locale

wildcards
Defined by the tokeniser and tokenisation properties

position/ordering
This information is held in Lucene as offset from the previous token

window/range/distance
This information is available in the index

scope
sentence/paragraph/page/chapter
This information is not available in the index (it does not store the token type)
We would have to add this information some how

anchoring
start/end/...
start is easy - end is not unless we add special support

cardinality number of occurances
This information is held in the index (and is used for scoring)

exclusion
Documents are excluded via permissions




documents like this
We do not store term vectors. This could be an option for 'more like this'
Doing this analysis on the fly would be expensive.




cross language support
We would have to tokenise again without stop words to make this sensible
This is a lot of extra work (we just use the tokens each tokeniser generates at the moment)
We could use the exact text rather than the token and put these through the standard analyser with no stop words. Each language would then add the words it considers meaningful in some common form without stemming etc.

Index time versus search time
We should expose this to our analyser wrappers (so synonym generation, if we had it, could be index or search side only - and not both) FTS token generation already does this in a weak way ...

Tokenisation bundle.
On a DD property specify the name of a tokenisation bundle to use
Will pick up the tokeniser it defines by locale and property type
Allows mixed tokenisation, property specific tokenisation etc

Query time options


This covers features found across many implementations and which of those features we may support and how we might do it. In particular, what we can do at search time to support this.


boost
can be set at query time for each individual query

thesaurus/synonyms
dependent upon the analyzer

languages
as selected - more languages -> generates more language specific tokens

position/ordering
Supported within Lucene (phrases, proximity, span)

window/range/distance
Supported within Lucene (proximity, span)




scope sentence/paragraph/page/chapter
not supported

anchoring start/end/...
start is possible, end would need special index support

cardinality number of occurances
Included in the scoring
We could expose as a specific part of the query language

exclusion
via ACL

documents like this
See indexer support

cross language support
See indexer support

Pluggable indexing


Support to add customer defined indexing and search behaviour

Allow additional, user defined fields in the index.

e.g. indexing of XML content via extraction based on DTD definitions (which could be done when content or metadata is indexed)

Requires node context (path etc) to be available.


FTS Syntax


Based on Google with Lucene extensions.


Google like (Part 1)


Search for a single term
banana

Search for conjunctions (the default)
big yellow banana

Search for disjunctions
big OR yellow OR banana

Search for phrases
'Boris the monkey eating a banana'

Not
yellow banana -big

+
the term is used as is
no plurals
no synonyms
no stemming or tokenisation
the word is not treated as a stop word

Synonym expansion for a term
~big yellow banana

Specify the field to search
Google advanced operators
field:term
field:phrase
TYPE:'cm:content'
direct or some other exposure of Lucene fields via property QNames etc
path, aspect support

Proximity
Google separated by one or more words
big * banana

Range
[#]..[#]

Control
order
limit/paging

Notes:

  • To support Google + we would have to index stop words but mostly ignore them at search time.
  • Google + conflicts with the Lucene use of the same token for required (AND should be sufficient)
  • - is not allowed on its own (or reports no matches)

Lucene Extensions (Part 2)


Support AND for explicit conjunctions
big AND yellow AND banana

Wild cards for terms and within phrases

Fuzzy matches
term~

Phrase proximity
phrase~proximity

Range queries (inclusive and exclusive)
{# TO #}
[# TO #]

Query time boosts
term^boost

Not
Also include ! and NOT

grouping of query elements
general
(big OR large) AND banana
field
title:(big OR large) AND banana




Further extensions (Part 3)


Explicit spans/positions
start
woof[^]

end
woof[$]

separation
yellow banana[0..2]


Occurrences
banana{2}
banana{,2}
banana{2,}
banana{2,4}


Positions
Phrase (??)
Sentence (s)
yellow[^S] banana[S]
Yellow at the start of a sentence that also contains the work banana

Paragraph (p)




Support to specify languages, tokeniser and thesaurus to use for given terms
field:banana
field:<en_uk>banana




Notes:
End will require special support. The most common requirement is to find files based on the name ending pattern. This can in fact be done (and is perhaps better) against the content mimetype which is already in the index.
Positions look like a pain

Alfresco FTS


See Full_Text_Search_Query_Syntax


Alfresco FTS Query Builder


Register query languages with a a query builder that generates Alfresco FTS.
The search service will allow queries with languages like 'ui', 'rm', 'opensearch', 'google', 'share'.
The query will be processed by the query builder and appropriate definition.

This definition includes:


  • Components to expose
  • Macros for term generation
    • macro expansion to complex queries
    • simple field mapping
  • constraints

To be resolved ...


  • support for well known namespaces (usability)
    • name does not need to be prefixed 
    • name:text  =  cm_name:text




  • support for property mappings to simple aliases
    • name -> cm_name
    • status -> my_aspect.my_property




  • system wide property mappings
    • persistable in queries
    • user mappings (which can not be persisted in saved queries)
  • Persisted queries
    • Remove and add user preferences for field mappings (out of scope here)




  • TODO:
    • mappings and where they are defined
    • Date format handling + date functions (not included in CMIS)
    • Locale handling
    • Query constraints and functions e.g. TODAY + 2w




  • FTS vs ID
    • If both are available when to use
    • Exact match
    • FTS match
    • FTS pattern match
    • SQL pattern match




  • Embed in CMIS




  • Expose direct (not embedded)

FTS vs Embedded vs RM


  • Selector
    • Embedded -> selector, implied single selector or error
    • UI -> No selector
    • RM -> No selector
  • Fields
    • Embedded - CMIS style (cm_content:'woof')
    • UI - can use mappings to avoid namespacing
    • RM - RM mappings?
  • Field collision (see context above)
    • Embedded - fully specified - no issue
    • UI - fully specified - no issue
    • UI - 'well known' or mapped - no issue
    • UI - no prefix, matches local name in more than one namespace
      • Error
      • OR together
      • Could distinguish by case
      • Namespace search order
    • RM - as UI - specific mappings?
  • Default Field (part of context)
    • Required for RM
    • All ready have this idea - contextual in some way?
  • Simple search (part of context - could have a consistent SIMPLE field)
    • A set of default fields
      • Set on the query?
  • Advanced
    • Simple + specific

Resources


http://www.google.com/support/websearch/bin/answer.py?answer=136861
http://www.blackbeltcoder.com/Articles/data/easy-full-text-search-queries

1 person found this helpful

Attachments

    Outcomes