Skip navigation
All Places > Alfresco Content Services (ECM) > Blog > 2017 > May
2017

Introduction

 

Alfresco content services has supported structural queries for a long time. You can find documents by how they are filed in a folder structure, how they are categorised and how they have been tagged. You can add new types of category, add new categories of your own to existing hierarchies, decorate the base category object using aspects and these categories are all discoverable. Underneath the hood, categories and tags are implemented the same way. Tags are a flat category. Categories are treated as an additional path to a document in a category hierarchy. A document is linked to a category by setting a property to the node ref of one or more categories.

 

In Alfresco Search Services we added support to role-up and facet on some things that are effectively encoded into paths - the SITE and TAG fields. Via the 5.2 public API for search, there is now support for more advanced path based drill down, navigation and roll-up.

 

Just to be clear, TAG and SITE roll ups are available in Share; path and taxonomy faceting is not. You can roll up the node ref in the category property in Share but this is probably not what you want.

 

This post is about the public API for search.

 

So what's new?

 

There are 6 new bits of information in the index and fields that can be used in query, filter queries and facets. These are:

  • TAG - used to index all the lowercase tags that have been assigned to a node.
  • SITE - used to index the site short name for a node, in any site. Remember a node can be in more than one site, although this is unusual. If a node is not in any site it is assigned a value of "_REPOSITORY_" to reflect this.
  • NPATH - the "name path" to a node. See the example below for how it is indexed.
  • PNAME - the path from the node up through its parents - again, see the example below.
  • APATH - as NPATH but using UUID - the UUID can be used as the key for internationalisation (SOLR 6 only).
  • ANAME - as PNAME but using UUID - the UUID can be used as the key for internationalisation (SOLR 6 only).

 

The search public API in Alfresco Content Services 5.2 exposes filter queries and faceting by field with prefix restrictions. These, in combination with the additional data, supports new ways to drill-in and roll up data.   

 

Example of what is in the index

 

Lets say we have uploaded the file CMIS-v1.1-cs01.pdf into the site woof we would have .....

"PATH":["/{http://www.alfresco.org/model/application/1.0}company_home/{http://www.alfresco.org/model/site/1.0}sites/{http://www.alfresco.org/model/content/1.0}woof/{http://www.alfresco.org/model/content/1.0}documentLibrary/{http://www.alfresco.org/model/content/1.0}CMIS-v1.1-cs01.pdf"]      

"SITE":["woof"]        

"APATH":[
          "0/264ed642-b527-488a-9139-ecde3673e4de",          

          "1/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612,a4e4-354d10f3217e",

          "2/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2",

          "3/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19",

          "4/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",

         "F/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675"]

       

"ANAME":[
          "0/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",          

          "1/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",      

          "2/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",          

          "3/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",          

          "4/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675",

          "F/264ed642-b527-488a-9139-ecde3673e4de/e4c94340-8e40-4612-a4e4-354d10f3217e/b9f14a0f-cffb-4409-b8d0-d77e89eca0e2/4f3c4bcd-2ee1-462d-9462-24b7a72acc19/340d5e93-89bf-4cc2-9ae5-9f4ffbad2675"]

"NPATH":[
          "0/Company Home",          

          "1/Company Home/Sites",          

          "2/Company Home/Sites/woof",          

          "3/Company Home/Sites/woof/documentLibrary",          

          "4/Company Home/Sites/woof/documentLibrary/CMIS-v1.1-cs01.pdf",          

          "F/Company Home/Sites/woof/documentLibrary/CMIS-v1.1-cs01.pdf"]

"PNAME":[
          "0/documentLibrary",          

          "1/woof/documentLibrary",          

          "2/Sites/woof/documentLibrary",          

          "3/Company Home/Sites/woof/documentLibrary",          

          "F/Company Home/Sites/woof/documentLibrary"]

Queries

 

To find things by SITE you can just use

SITE:"woof"

It is great to do this in a filter query in the search API as filter queries are cached, reused and warmed.

 

Similarly for TAGs

  TAG:tag

SITE and TAG also support faceting.

 

The fun starts with drill down. NPATH can be used for navigation. You can get top level names by asking for a facet on NPATH starting with the prefix "0/". You can then remove the "0/" from the facets returned to get the names of the top level things. Here is the JSON body of the request:

 

{
  "query": {
    "query": "*"
  },
  "facetFields": {
    "facets": [
      {"field": "NPATH", "prefix": "0/"}
    ]
  }
}

The response contains,  amongst many others, "0/categories". So lets drill into that another layer. We need the prefix "1/categories" and we filter stuff out based on where we want to drill-in "0/categories" - as below.

 

{
  "query": {
    "query": "*"
  },

  "filterQueries": [{"query": "NPATH:\"0/categories\""}],
  "facetFields": {
    "facets": [
      {"field": "NPATH", "prefix": "1/categories"}
    ]
  }
}

 

This gives us "1/categories/General" and "1/categories/Tags". So lets skip a step or two and count the stuff in the General/Languages category ... I have not added the prefix here to show how you can get counts for all nodes in the hierarchy.

 

 

{
  "query": {
    "query": "*"
  },
  "filterQueries": [{"query": "NPATH:\"2/categories/General/Languages\""}],
  "facetFields": {
    "facets": [
      {"field": "NPATH" }
    ]
  }
}

In a clean repository, this will show the structure of Language and show how many sub-categories exist.

 

With UUIDs it is pretty much the same.

 

PNAME

 

PNAME helps with some odd use cases. "I am sure I put it in a folder called .....". It gives faceting based on ancestry. It can highlight common structures for storing data or departure from such a structure: things in odd locations. If you have used folders to encode state you can roll up on this state. (Yes you should probably have used metadata or workflow).

 

PNAME can also be used to count direct members of a category - rather than everything in the category and below. So  NPATH can count everything below and PNAME everything direct. The design uses the same prefix for each. So to get the next category layer with total and direct counts you would use something like:

 

{
  "query": {
    "query": "*"
  },

 

  "filterQueries": [{"query": "NPATH:\"2/categories/General/Regions\""}],
  "facetFields": {
    "facets": [
      {"field": "NPATH", "prefix": "3/categories/General/Regions", "label": "clade"},
      {"field": "PNAME", "prefix": "3/categories/General/Regions", "label": "direct"}
    ]
  }
}

 

Summary

 

SITE and TAG provide easy query time access to concepts in Share. NPATH and PNAME support queries that will progressively drill into a folder or category structure and support faceting to count the documents and folders in each part of the next layer. APATH and ANAME do the same job with a UUID key to aid internationalisation and a bridge to other public APIs where UUID is ubiquitous.

 

Notes

 

Clade is a term from biological classification that defines a grouping with a common ancestor.

 

The public API can not currently have a field facet on the same field with different configuration  e.g. NPATH with a prefix and NPATH without a prefix. You have to choose one or the other.

 

We used something inspired by PathHierarchyTokenizer to support this functionality. In order to use doc values we parse internally and use a multi-valued string field. We are pre-analysing as doc values and tokenisation do not play well.

 

If you want to exclude the category nodes themselves from the counts add a filter query

-TYPE:category

 

Don't forget the workhorse support for structure provided by querying PATH, ANCESTOR and PARENT. You can also use PARENT and ANCESTOR for faceting on NodeRefs.

 

In SOLR 6, fields of type d:category are now indexed "as is" and you can facet on the NodeRef.

andy1

Document Fingerprints

Posted by andy1 Employee May 12, 2017

Introduction

 

 

Support to find related documents by content fingerprint, given a source document, was added in Alfresco Content Services 5.2 and is exposed as part of the Alfresco Full Text Search Query Language. Document fingerprints can be used to find similar content in general or biased towards containment. The language adds a new FINGERPRINT keyword:

 

FINGERPRINT:<DBID | NODEREF | UUID>

 

By default, this will find documents that have any overlap with the source document. The UUID option is likely to be the most common as UUID is ubiquitous in the public API. To specify a minimum amount of overlap use

 

FINGERPRINT:<DBID | NODEREF | UUID>_<%overlap>

FINGERPRINT:<DBID | NODEREF | UUID>_<%overlap>_<%probability>

 

FINGERPRINT:1234_20

Will find documents that have 20% overlap with the document 1234.

 

FINGERPRINT:1234_20_80

Will execute a faster query that will be 80% confident anything brought back will overlap by 20%

 

Additional information is added to the SOLR 4 or 6 indexes using the rerank template to support fingerprint queries. It makes the indexes ~15% bigger.

 

The basis of the approach taken here and a much wider discussion is presented in Mining of Massive Datasets, Chapter 3. Support to generate text fingerprints was contributed by Alfresco to Apache Lucene via LUCENE-6968. The issue includes some more related material.

 

Similarity and Containment

 

Document similarity covers duplicate detection, near duplicate detection, finding different renditions of the same content, etc. This is important to find and reduce redundant information. Fingerprints can provide a distance measure to other documents, often  based on Jaccard distance, to support "more like this" and clustering. The distance can also be used as a basis for graph traversal. Jaccard distance is defined as:

 

 

the ratio of the amount of common content to the total content of two documents. This distance can be used to compare the similarity of any two documents with any other pair of documents.

 

Containment is related but is more about inclusion. For example, many email threads include parts or all of previous messages. Containment is not symmetrical like the measure of similarity above, and is defined as:

 

         

 

It represents how much of the content of a given document is common to another document. This distance can be used to compare a single document (A) to any other document. This is the main use case we are going to consider.

 

Minhashing

 

Minhashing is an example of a text processing pipeline. First, the text is split into a stream of words; these words are then combined into five word sequences, known as shingles, to produce a stream of shingles. So far, all standard stuff available in SOLR. The 5-word shingles are then hashed, for example, in 512 different ways; keeping the lowest hash value for each hash. This results in 512 repeatably random samples of 5 word sequences from the text represented by the hash of the shingle. The same text will generate the same set of 512 minhashes. Similar text will generate many of the same hashes. It turns out that if 10% of all the min hashes from two documents overlap then it is a great estimator that J(A,B) = 0.1.

 

Why 5 word sequences? A good question. Recently, word embedding suggests 5 or more words are enough to describe the context and meaning of a central word. Looking at the distribution of words, 2 word shingles, 3 word shingles, 4 word shingles and 5 word shingles found on the web; at 5 word shingles the frequency distribution flattens and broadens compared with the trend seen for 1, 2, 3 and 4 word shingles.

 

Why 512 hashes? With a well distributed hash function this should give good hash coverage for 2,500 words and around 10% for 25,000, or something like 100 pages of text.

 

We used a 128-bit hash to encode both the hash set position (see later) and hash value to minimise collision compared with a 64 bit encoding including bucket/set position.

 

An example

 

Here are 2 summaries of the 1.0 and 1.1 CMIS specification. It demonstrates, amongst other things, how sensitive the measure is to small changes. Adding a single word affects 5 shingles.

 

 

The content overlap of the full 1.0 CMIS specification found in the 1.1 CMIS specification, C(1.0, 1.1) ~52%.

 

The MinHashFilter in Lucene

 

The MinHashFilterFactory has four options:

  • hashCount - the number of real hash functions computed for each input token (default: 1)
  • bucketCount - the number of buckets each hash is split into (default: 512)
  • hashSetSize - for each bucket, the size of the minimum set (default:1)
  • withRotation - empty buckets take values from the next with wrap-around (default: true if bucketCount > 1 )

 

Computing the hash is expensive. It is cheaper to split a single hash into sub-ranges and treat each as supplying an independent minhash. Using individual hash functions, each will have a min value - even if it is a duplicate. Splitting the hash into ranges means this duplication can be controlled using the withRotation option. If this option is true you will always get 512 hashes - even for very short docs. Without rotation, there may be less than 512 hashes depending on the text length. You can get similar, but biased, sampling using a minimum set of 512.

 

Supporting a min set is in there for fun: it was used in early minhash work. It leads to a biased estimate but may have some interesting features. I have not seen anyone consider what the best combination of options may be. The defaults are my personal opinion from reading the literature linked on LUCENE-6968. There is no support for rehashing based on a single hash as I believe the bucket option is more flexible.  

 

Here are some snippets from our schema.xml:

....

<fieldType name="text_min_hash" class="solr.TextField" positionIncrementGap="100">

   <analyzer type="index">

      <tokenizer class="solr.ICUTokenizerFactory"/>
      <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose" />
      <filter class="solr.ShingleFilterFactory" minShingleSize="5" maxShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" tokenSeparator=" " />
      <filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" hashCount="1" hashSetSize="1" bucketCount="512" />
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.KeywordTokenizerFactory" />
   </analyzer>
</fieldType>

.....

<field name="MINHASH"           type="identifier"  indexed="true"  omitNorms="true"  stored="false" multiValued="true"  required="false"  docValues="false"/>

.....

 <field name="min_hash"               type="text_min_hash"    indexed="true"  omitNorms="true"  omitPositions="true" stored="false"  multiValued="true" />

 

It is slightly misleading as we use the min_hash field as a way of finding an analyser to use internally to fill the MINHASH field. We also have a method to get the MINHASH field values without storing them in the index. Others would most likely have to store it in the index.

 

Query Time

 

At query time, the first step is to get hold of the MINHASH tokens for the source document - we have a custom search component to do that in a sharded environment form information we persist along side the index. Next you need to build a query. The easy way is to just OR together queries for each hash in a big boolean query. Remember to use setDisableCoord(true) to get the right measure of Javccard distance.. Applying a cut-off simply requires you to setMinimumNumberShouldMatch() based on the number of hashes you found.

 

Building a banded query is more complicated and derived from locality sensitive hashing, described in Mining of Massive Datasets, Chapter 3. Compute the band size as they suggest, based on the similarity and expected true positive rate required. For each band, build a boolean query that must match all the hashes in a band selected from the source docs fingerprint. Then, OR all of these band queries together. Avoid the trap of making a small band at the end by wrapping around to the start. This gives better query speed at the expense of accuracy.

If you are serious about Alfresco in your IT infrastructure you most certainly have a "User Acceptance Tests" environment. If you don't... you really should consider setting one up (don't make Common mistakes)!

When you initially set up your environments everything is quiet, users are not using the system yet and you just don't care about data freshness. However, soon after you go live, the production system will start being fed with data. Basically it is this data, stored in Alfresco Content Service, that make your system or application valuable.

When the moment comes to upgrade or deploy a new customization/application, you will obviously test it first on your UAT (or pre-production or test, whatever you call it) environment. Yes you will!

When you do so, having a UAT environment that doesn't have an up-to-date set of data can make the tests pointless or more difficult to interpret. This is also true if you plan to do kick off performance tests. If the tests are done on a data set that is only one third the size of the production data, it's pointless.

Basically that's why you need to refresh your UAT data with production data every now and then or at least when you know it's going to be needed.

The scope of this document is not to provide you with a step by step guide on how to refresh your repository. Alfresco Content Services being a platform, this highly depends on what you actually do with your repository, the kind of customization you are using and 3rd parties apps that could be link to Alfresco anyhow. This document will mainly highlight things you should thoroughly check when refreshing your dataset.

 

Prerequisites

Some thing must be validated before going further:

  • Production & UAT environments should have the same architecture (same number of servers, same components installed, and so on and so forth...)
  • Production & UAT environments should have the same sizing (while you can forget this for functional tests only, this is a true requirement for performance tests of course)
  • Production & UAT environments should be hosted on different, clearly separated networks (mostly if using cluster)

 

What is it about?

In order to refresh your UAT repository with data from production you will simply go through the normal restore process of an Alfresco repository.

Here I consider backup strategy is not a topic... If you don't have proper backups already set up, that's where you should start: Performing a hot backup | Alfresco Documentation

      

The required assets to restore are:

  • Alfresco's database
  • Filesystem repository
  • Indexes

Before you start your fresh UAT

There a re a number of things you should check before starting your refreshed environment.

 

Reconfigure cluster

Although the recommendation is to isolate environments it is better to specify different cluster configuration for both environments. That will allow for a less confusing administration and log analysis and also prevent information leaking from one network to another in case isolation is not that good.

When starting a refreshed UAT cluster, you should always make sure you are setting a cluster password or a cluster name that is different from production cluster. Doing so you prevent yourself from cluster communication to happen between nodes that are actually not part of the same cluster:

alfresco.hazelcast.password=someotherpassword

Alfresco 4.2 onward

   
alfresco.cluster.name=uatCluster

Alfresco pre-4.2

   

On the Share side, it is possible to change more parameters in order to isolate clusters but we will still apply the same logic for the sake of simplicity. Here you would change the Hazelcast password in the custom-slingshot-application-context.xml configuration file inside the {web-extension} directory.

<hz:topic id="topic" instance-ref="webframework.cluster.slingshot" name="slingshot-topic"/>
   <hz:hazelcast id="webframework.cluster.slingshot">
     <hz:config>
       <hz:group name="slingshot" password="notthesamepsecret"/>
       <hz:network port="5801" port-auto-increment="true">
         <hz:join>
           <hz:multicast enabled="true" multicast-group="224.2.2.5" multicast-port="54327"/>
           <hz:tcp-ip enabled="false">
             <hz:members></hz:members>
           </hz:tcp-ip>
        </hz:join>
...

Email notifications

It's very unlikely that your UAT environment needs to send emails or notifications to real users. Your production system is already sending digest and other emails to users and you don't want them to get confused because they received similar emails from other systems. So you have to make sure emails are either:

  • Sent to a black hole destination
  • Sent to some other place where users can't see them

If you really don't care about emails generated by Alfresco, then you can choose the "black hole" option. There are many different ways to do that, among which configuring your local MTA to send all emails to a single local user and optionally link his mailbox to /dev/null (with postfix you could use canonical_maps directive and mbox storage). Another way  to do that would be to use the java DevNull SMTP server. It is very simple to use as it is just a jar file you can launch

java -jar -console -p 10025 DevNull.jar

On the other hand, as part of your users tests, you may be interested in knowing and analyzing what emails generated by your Alfresco instance. In this case you could still use previous options. Both are indeed able to store emails instead of swallowing them, postfix by not linking the mbox storage to /dev/null, and DevNull SMTP server by using the "-s /some/path/" option. However storing emails on the filesystem is not really handy if you want to check their content and the way it renders (for instance).

If emails is a matter of interest then you can use other products like mailhog or mailtrap.io. Both offer an SMTP server that stores emails for you instead of sending it to the outside world, but they also offer a neat way to visualize them, just like a webmail would do.

 

Mailhog WebUI

Mailtrap.io is a service that also offer advanced features like POP3 (so you can see emails in "real-life" clients), SPAM score testing, content analysis and for subscription based users, collaboration features.

 

Whatever option is yours, and based on the chosen configuration you'll have to switch the following properties for you UAT Alfresco nodes:

mail.host
mail.port
mail.smtp.auth
mail.smtps.auth
mail.username
mail.password
mail.smtp.starttls.enable
mail.protocol

Jobs & FSTR synchronisation

Alfresco allows an administrator to schedule jobs and setup replication to another remote repository.

Scheduled jobs are carried over from production environments to UAT if you cloned environments or proceeded to a backup/restore of production data. However you most certainly don't want the same job to run twice from two different environments.

Defining whether or not a job should run in UAT depends on a lot of factor and is very much related to what the job actually does. Here we cannot give a list of precise actions to take in order to avoid problem. It is the administrator's call to review Scheduled jobs and decide whether or not he should/can disable them.

Jobs to review can be found a in spring bean definitions file like ${extensionRoot}/extension/my-scheduler-context.xml.

One easy way to disable jobs can be to set a cron expression to a far future (or past)

 

<property name="cronExpression">
    <value>0 50 0 * * 1970</value>
</property>

The repository can also hold synchronization jobs. Mainly those jobs that are used in File Transfer Receiver setups.

In that case the administrator surely have to disable such jobs (or at least reconfigure them) as you do not want UAT frozen data to be synced to a remote location where live production data is expected!

Disabling this kind of jobs is pretty simple. You can do it using the Share web UI by going to the "Repository \ Data Dictionary \ Transfers \ Default Target group \ Group1" and edit properties of the "Group1" folder. In the property editor form, just untick the "Activated" checkbox.

 

Repository ID & Cloud synchronization

Alfresco repository IDs must be universally unique. And of course if you clone environments,  you create duplicated repository IDs. One of the well known issue that can be triggered by duplicate IDs is for hybrid cloud setups where synchronization is enabled between the production environment and the Cloud my.alfresco.com. If your UAT servers connect to the cloud with the production's ID you can be sure synchronization will fail at some point and could even trigger data loss on your production system. You really want to avoid that from happening!

One very easy way to prevent this from happening is to simply disable clouds sync on the UAT environment.

system.serverMode=UAT

Any string other than "PRODUCTION" can be used here. Also be aware that this property can only be set in alfresco-global.properties file

 

Also if you are using APIs that need to specify the repository ID in order to request Alfresco (like old CMIS endpoint used to) then such API calls may stop working in UAT as the repo ID is now the one from production (in the case the calls where initially written with a previous ID, and it is not gathered previously - which would be a poor  approach in most cases).

Starting with Alfresco 4.2, CMIS now returns the string "-default-" as a  repository ID, for all new API endpoints (e.g. atompub /alfresco/api/-default-/public/cmis/versions/1.1/atom), while previous endpoint (e.g. atompub /alfresco/cmisatom) returns a Universally Unique IDentifier.

 

If you think you need to change the repository ID, please contact Alfresco support. It makes the procedure heavier (a re-index is expected) and should be thoroughly planned.

 

Carrying unwanted configuration

If you stick to the best practices for production, you probably try to have all your configuration in properties or xml files in the {extensionRoot} directory.

But on the other hand, you may sometimes use the great facilities offered by Alfresco enterprise, such you as JMX interface or the admin console. You must then remember those tools will persist configuration information to the database. This means that, when restoring a database from one environment to another one, you may end up starting an Alfresco instance with wrong parameters. 

Here is a quite handy SQL query you can use *before* starting your new Alfresco UAT. It will report all the properties that are stored in the database. You can then make sure none of them is harmful or points to a production system.

SELECT APSVk.string_value AS property, APSVv.string_value AS value
  FROM alf_prop_link APL
    JOIN alf_prop_value APVv ON APL.value_prop_id=APVv.id
    JOIN alf_prop_value APVk ON APL.key_prop_id=APVk.id
    JOIN alf_prop_string_value APSVk ON APVk.long_value=APSVk.id
    JOIN alf_prop_string_value APSVv ON APVv.long_value=APSVv.id
WHERE APL.key_prop_id <> APL.value_prop_id
    AND APL.root_prop_id IN (SELECT prop1_id FROM alf_prop_unique_ctx);

                 property                 |                value
------------------------------------------+--------------------------------------
alfresco.port                            | 8084

Do not try to delete those entries from the database straight away this is likely to brake things!

   

If any property is conflicting with the new environment, it should be removed.

Do it wisely! An administrator should ALWAYS prefer using the "Revert" operations available through the JMX interface!

The "revert()" method is available using jconsole, in the Mbean tab, within the appropriate section:

revert property using jconsole

"revert()" may revert more properties than just the one you target. If unsure how to get rid of a single property, please contact alfresco support.

   

Other typical properties to change:

When moving from environment the properties bellow are likely to be different in UAT environment (that may not be the case for you or you may have others). As said earlier they should be set in the ${extensionroot} folder to a value that is specific to UAT (and they should not be present in database):

ldap.authentication.java.naming.provider.url
ldap.synchronization.java.naming.security.principal
ldap.synchronization.java.naming.security.credentials
solr.host
solr.port

We've come a really long way in this series of blog posts, we've now covered a majority of the functionality in the 5.2 release through the following posts:

 

 

On reflection, this post should have been the first one in the series as we're going to discuss some capabilities of the API that can be applied to most endpoints! We're also going to cover some lesser know features provided by our underlying REST API Framework. The remainder of this post will highlight 10 features that we think you should know about.

 

To help demonstrate these we've provided a Postman collection with examples, click the "Run in Postman" button below to import it into your client.

 

 

1. Tickets

Throughout this series we've used basic authentication in all the examples. The APIs also support the Alfresco ticket mechanism. You can POST the following body to http://localhost:8080/alfresco/api/-default-/public/authentication/versions/1/tickets to create a new ticket (1st request in the Postman collection):

{
  "userId": "test",
  "password": "test"
}

 

The response provides the ticket in the id property:

{
  "entry": {
    "id": "TICKET_ed4981b4bbb15fc2713f7caaffd23982d0dd4e5c",
    "userId": "test"
  }
}

 

This ticket can then be used instead of a username and password. Although the REST API supports the alf_ticket query parameter we do not recommend using it, a more secure approach is to use a HTTP header. The basic auth mechanism is still used i.e. sending an Authorisation header, however, the base64 encoded username/password is replaced with the base64 encoded ticket.

 

The API Explorer shows an example of this approach (select the Authentication API from the drop-down menu and expand the POST /tickets method) and all requests in the Postman collection accompanying this article use this approach.

 

2. Limiting results

By default the API will return a maximum of 100 results in any one request, this can be controlled via the maxItems query parameter.

 

The 2nd request in the Postman collection (http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/-my-/children?maxItems=5 ) shows how you can limit the number of results to five.

 

This query parameter is supported across all collection endpoints.

 

3. Skipping results

By default the API will return results starting from the beginning, it's possible to skip any number of results using the skipCount query parameter. This is typically used for implementing paging or infinite scrolling in clients.

 

The 3rd request in the Postman collection (http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/-my-/children?skipCount=2 ) shows how you can skip the first two results.

 

This query parameter is supported across all collection endpoints.

 

4. Ordering results

By default all collection endpoints (those returning a list of results) have a default sort order, it's possible to change the sort order on some endpoints via the orderBy query parameter.

 

The 4th request in the Postman collection (http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/-my-/children?orderBy=sizeInBytes DESC) shows how you can order the nodes in your home folder by the size of the content, starting with the largest item.

 

DESC and ASC can be used to control the direction of the sorting. As previously mentioned not all endpoints allow ordering to be controlled so you'll need to consult the API Explorer to see whether the orderBy parameter is supported and what properties within the response can be used.

 

5. Filtering results

Sometimes only a subset of the response is required, several endpoints support this via the where query parameter.

 

The where parameter allows you to provide one or more clauses defining what items you want to see in the response. The 5th request in the Postman collection (http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/-my-/children?where=(isFile=true)) shows how you can limit the nodes in your home folder to just the files.

 

The where parameter is specific to each endpoint so you'll need to consult the API Explorer to see firstly, whether the where parameter is supported and secondly, what expressions and clauses can be used.

 

6. Requesting optional information

We have taken what we're calling a "performance first" approach with the API meaning that each endpoint, by default, only returns the information that is efficient to retrieve. If additional processing is required to obtain the information it is made available via the include query parameter.

 

The 6th request in the Postman collection (http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/-my-/children?include=properties,aspectNames) shows how you can also include the properties and aspects for each node in your home folder when listing it's children.

 

As with orderBy and where the include parameter is specific to the endpoint so you'll need to consult the API Explorer to see what extra information is available.

 

7. Limiting the response

Sometimes bandwidth is a major consideration, for example, if you're building a mobile client. To cater for this scenario the API allows you to control the amount of data sent over the wire via the fields query parameter.

 

The 7th request in the Postman collection (http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/-my-/children?fields=id,name ) shows you can limit the response to only contain the id and name properties as shown in the response below:

{
  "list": {
    "pagination": {
      "count": 3,
      "hasMoreItems": false,
      "totalItems": 3,
      "skipCount": 0,
      "maxItems": 100
    },
    "entries": [
      {
        "entry": {
          "name": "A Folder",
          "id": "a5634765-ab0f-438a-8efd-bfa4139da8aa"
        }
      },
      {
        "entry": {
          "name": "lorem-ipsum.txt",
          "id": "5516aca4-df8b-43e8-8ff3-707316c60c6e"
        }
      },
      {
        "entry": {
          "name": "image.jpg",
          "id": "fc132aa5-6281-40bf-adee-5731e6ecb653"
        }
      }
    ]
  }
}

 

The fields parameter works in conjunction with the include parameter so you don't have to repeat yourself, for example, say you want to include the id, name and the optional aspectName properties in the response you can use fields=id,name&include=aspectNames, you don't have to specify aspectNames again in the fields parameter.

 

The fields parameter is supported universally across all endpoints.

 

8. -me- alias

There are several endpoints across the API that expect a person id as part of the URL, this is OK if the client knows the person id but there are some scenarios where it might not be known, for example when using tickets.

 

For this scenario the API supports the -me- alias which can be substituted in any URL that expects personId.

 

The 8th request in the Postman collection (http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/people/-me- ) shows how this can be used to retrieve the profile information of the currently authenticated user.

 

9. Create multiple entities

Supporting batch operations i.e. updating the metadata for multiple items simultaneously, is something we plan to support in the future, however it's a little known fact that the API already has some basic batch capabilities when it comes to creating entities.

 

Most POST endpoints that create entities actually allow an array of objects to be passed in the body which creates each one individually, but within the same transaction.

 

The 9th request in the Postman collection (http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/-my-/children) shows how two folders can be created with the one request by passing the body shown below:

[
  {
    "name": "Folder One",
    "nodeType": "cm:folder"
  },
  {
    "name": "Folder Two",
    "nodeType": "cm:folder"
  }
]

 

The API returns a standard listing response providing the details on each entity that was created, in this case, the two folders:

{
  "list": {
    "pagination": {
      "count": 2,
      "hasMoreItems": false,
      "totalItems": 2,
      "skipCount": 0,
      "maxItems": 100
    },
    "entries": [
      {
        "entry": {
          "aspectNames": [
            "cm:auditable"
          ],
          "createdAt": "2017-04-12T10:31:12.477+0000",
          "isFolder": true,
          "isFile": false,
          "createdByUser": {
            "id": "test",
            "displayName": "Test User"
          },
          "modifiedAt": "2017-04-12T10:31:12.477+0000",
          "modifiedByUser": {
            "id": "test",
            "displayName": "Test User"
          },
          "name": "Folder One",
          "id": "ecbec6fd-a273-4978-9b95-00a8e783948e",
          "nodeType": "cm:folder",
          "parentId": "062b8b2a-aa7e-4cdd-bfec-7fbcd16ecd85"
        }
      },
      {
        "entry": {
          "aspectNames": [
            "cm:auditable"
          ],
          "createdAt": "2017-04-12T10:31:12.501+0000",
          "isFolder": true,
          "isFile": false,
          "createdByUser": {
            "id": "test",
            "displayName": "Test User"
          },
          "modifiedAt": "2017-04-12T10:31:12.501+0000",
          "modifiedByUser": {
            "id": "test",
            "displayName": "Test User"
          },
          "name": "Folder Two",
          "id": "18c82e9b-5a2f-44bf-bc77-1aca7346a24a",
          "nodeType": "cm:folder",
          "parentId": "062b8b2a-aa7e-4cdd-bfec-7fbcd16ecd85"
        }
      }
    ]
  }
}

 

If the endpoint does not support creating multiple entities an error is returned.

 

10. Include the source entity for a collection

When returning a relationship collection of an entity, for example the children of a node or the members of a site, details of the entity are not included by default, to include them you can use the includeSource query parameter.

 

The last request in the Postman collection (http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/nodes/-my-/children?includeSource=true) shows how you'd include details of the users home folder when listing it's children, shown below in the source property:

{
  "list": {
    "pagination": {
      "count": 12,
      "hasMoreItems": false,
      "totalItems": 12,
      "skipCount": 0,
      "maxItems": 100
    },
    "entries": [ ... ],
    "source": {
      "name": "test",
      "createdAt": "2017-02-20T11:01:39.647+0000",
      "modifiedAt": "2017-04-12T10:31:12.509+0000",
      "createdByUser": {
        "id": "admin",
        "displayName": "Administrator"
      },
      "modifiedByUser": {
        "id": "test",
        "displayName": "Test User"
      },
      "isFolder": true,
      "isFile": false,
      "aspectNames": [
        "cm:ownable",
        "cm:auditable"
      ],
      "properties": {
        "cm:owner": {
          "id": "test",
          "displayName": "Test User"
        }
      },
      "nodeType": "cm:folder",
      "parentId": "a9ad3bc4-d30f-4910-bee0-63d497e74a22",
      "id": "062b8b2a-aa7e-4cdd-bfec-7fbcd16ecd85"
    }
  }
}

 

Another example is returning details of a site when listing it's members, to do that you'd use the following URL: http://localhost:8080/alfresco/api/-default-/public/alfresco/versions/1/sites/swsdp/members?includeSource=true

 

The includeSource parameter is supported for all endpoints that include an entity and relationship collection.

 

Well, that's it, we've reached the end of this series of blog posts, thank you very much if you've made it this far, we've covered a lot, but not everything! Despite the long series we've really only scratched the surface so I would definitely encourage you to experiment with the API Explorer and start building your first application.

 

We're always looking for feedback so please do raise any issues you encounter or any suggestions you have in JIRA and keep an eye out for more posts in the future as we enhance the v1 REST API further.

Filter Blog

By date: By tag: