andy1

Structure, Tags, Categories and Query in the public API

Blog Post created by andy1 Employee on May 26, 2017

Introduction

 

Alfresco content services has supported structural queries for a long time. You can find documents by how they are filed in a folder structure, how they are categorised and how they have been tagged. You can add new types of category, add new categories of your own to existing hierarchies, decorate the base category object using aspects and these categories are all discoverable. Underneath the hood, categories and tags are implemented the same way. Tags are a flat category. Categories are treated as an additional path to a document in a category hierarchy. A document is linked to a category by setting a property to the node ref of one or more categories.

 

In Alfresco Search Services we added support to role-up and facet on some things that are effectively encoded into paths - the SITE and TAG fields. Via the 5.2 public API for search, there is now support for more advanced path based drill down, navigation and roll-up.

 

Just to be clear, TAG and SITE roll ups are available in Share; path and taxonomy faceting is not. You can roll up the node ref in the category property in Share but this is probably not what you want.

 

This post is about the public API for search.

 

So what's new?

 

There are 6 new bits of information in the index and fields that can be used in query, filter queries and facets. These are:

  • TAG - used to index all the lowercase tags that have been assigned to a node.
  • SITE - used to index the site short name for a node, in any site. Remember a node can be in more than one site, although this is unusual. If a node is not in any site it is assigned a value of "_REPOSITORY_" to reflect this.
  • NPATH - the "name path" to a node. See the example below for how it is indexed.
  • PNAME - the path from the node up through its parents - again, see the example below.
  • APATH - as NPATH but using UUID - the UUID can be used as the key for internationalisation (SOLR 6 only).
  • ANAME - as PNAME but using UUID - the UUID can be used as the key for internationalisation (SOLR 6 only).

 

The search public API in Alfresco Content Services 5.2 exposes filter queries and faceting by field with prefix restrictions. These, in combination with the additional data, supports new ways to drill-in and roll up data.   

 

Example of what is in the index

 

Lets say we have uploaded the file CMIS-v1.1-cs01.pdf into the site woof we would have .....

"PATH":["/{http://www.alfresco.org/model/application/1.0}company_home/{http://www.alfresco.org/model/site/1.0}sites/{http://www.alfresco.org/model/content/1.0}woof/{http://www.alfresco.org/model/content/1.0}documentLibrary/{http://www.alfresco.org/model/content/1.0}CMIS-v1.1-cs01.pdf"]      

"SITE":["woof"]        

"APATH":[
          "0/264ed642-b527-488a-9139-ecde3673e4de",          

          "1/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612,a4e4-354d10f3217e",

          "2/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2",

          "3/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19",

          "4/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",

         "F/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675"]

       

"ANAME":[
          "0/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",          

          "1/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",      

          "2/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",          

          "3/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",          

          "4/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",

          "F/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675"]

"NPATH":[
          "0/Company Home",          

          "1/Company Home/Sites",          

          "2/Company Home/Sites/woof",          

          "3/Company Home/Sites/woof/documentLibrary",          

          "4/Company Home/Sites/woof/documentLibrary/CMIS-v1.1-cs01.pdf",          

          "F/Company Home/Sites/woof/documentLibrary/CMIS-v1.1-cs01.pdf"]

"PNAME":[
          "0/documentLibrary",          

          "1/woof/documentLibrary",          

          "2/Sites/woof/documentLibrary",          

          "3/Company Home/Sites/woof/documentLibrary",          

          "F/Company Home/Sites/woof/documentLibrary"]

Queries

 

To find things by SITE you can just use

SITE:"woof"

It is great to do this in a filter query in the search API as filter queries are cached, reused and warmed.

 

Similarly for TAGs

  TAG:tag

SITE and TAG also support faceting.

 

The fun starts with drill down. NPATH can be used for navigation. You can get top level names by asking for a facet on NPATH starting with the prefix "0/". You can then remove the "0/" from the facets returned to get the names of the top level things. Here is the JSON body of the request:

 

{
  "query": {
    "query": "*"
  },
  "facetFields": {
    "facets": [
      {"field": "NPATH", "prefix": "0/"}
    ]
  }
}

The response contains,  amongst many others, "0/categories". So lets drill into that another layer. We need the prefix "1/categories" and we filter stuff out based on where we want to drill-in "0/categories" - as below.

 

{
  "query": {
    "query": "*"
  },

  "filterQueries": [{"query": "NPATH:\"0/categories\""}],
  "facetFields": {
    "facets": [
      {"field": "NPATH", "prefix": "1/categories"}
    ]
  }
}

 

This gives us "1/categories/General" and "1/categories/Tags". So lets skip a step or two and count the stuff in the General/Languages category ... I have not added the prefix here to show how you can get counts for all nodes in the hierarchy.

 

 

{
  "query": {
    "query": "*"
  },
  "filterQueries": [{"query": "NPATH:\"2/categories/General/Languages\""}],
  "facetFields": {
    "facets": [
      {"field": "NPATH" }
    ]
  }
}

In a clean repository, this will show the structure of Language and show how many sub-categories exist.

 

With UUIDs it is pretty much the same.

 

PNAME

 

PNAME helps with some odd use cases. "I am sure I put it in a folder called .....". It gives faceting based on ancestry. It can highlight common structures for storing data or departure from such a structure: things in odd locations. If you have used folders to encode state you can roll up on this state. (Yes you should probably have used metadata or workflow).

 

PNAME can also be used to count direct members of a category - rather than everything in the category and below. So  NPATH can count everything below and PNAME everything direct. The design uses the same prefix for each. So to get the next category layer with total and direct counts you would use something like:

 

{
  "query": {
    "query": "*"
  },

 

  "filterQueries": [{"query": "NPATH:\"2/categories/General/Regions\""}],
  "facetFields": {
    "facets": [
      {"field": "NPATH", "prefix": "3/categories/General/Regions", "label": "clade"},
      {"field": "PNAME", "prefix": "3/categories/General/Regions", "label": "direct"}
    ]
  }
}

 

Summary

 

SITE and TAG provide easy query time access to concepts in Share. NPATH and PNAME support queries that will progressively drill into a folder or category structure and support faceting to count the documents and folders in each part of the next layer. APATH and ANAME do the same job with a UUID key to aid internationalisation and a bridge to other public APIs where UUID is ubiquitous.

 

Notes

 

Clade is a term from biological classification that defines a grouping with a common ancestor.

 

The public API can not currently have a field facet on the same field with different configuration  e.g. NPATH with a prefix and NPATH without a prefix. You have to choose one or the other.

 

We used something inspired by PathHierarchyTokenizer to support this functionality. In order to use doc values we parse internally and use a multi-valued string field. We are pre-analysing as doc values and tokenisation do not play well.

 

If you want to exclude the category nodes themselves from the counts add a filter query

-TYPE:category

 

Don't forget the workhorse support for structure provided by querying PATH, ANCESTOR and PARENT. You can also use PARENT and ANCESTOR for faceting on NodeRefs.

 

In SOLR 6, fields of type d:category are now indexed "as is" and you can facet on the NodeRef.

Outcomes