harry.peek@alfresco.com

Progress on Solr SQL support for analytics

Blog Post created by harry.peek@alfresco.com Employee on Feb 20, 2018

Later in the year we will be launching a new enterprise analytics product for Alfresco built on top of Search Services and Solr.

 

One of the capabilities it will bring is the ability to write SQL queries against the Solr datastore and so I wanted to share where we are with that SQL support. Below we have documented how the SQL works and also listed what isn't yet supported at the end.

 

I'd love to know what you think about the level of support so do comment or message with any feedback or questions.

 

 

 

SELECT statements

 

The basic syntax of the SQL select is as follows:

Select DBID, cm_creator as creator, `cm_content.size`  as size from alfresco where `cm_content.size` > 1000 order by  `cm_content.size` desc limit 100

In the sections below we’ll explain and expand on the different aspects of this SQL statement.

 

Table

 

For the EA release the only table that can be specified is the alfresco table. The alfresco table contains the documents and fields that have been indexed within the Alfresco Indexing Server’s main alfresco index.

 


Fields

 

Standard Fields

 

Alfresco has a set of standard fields, which can be referred to by name in the SQL field list. The DBID field in the example SELECT statement is an example of a standard field.

 

The most useful ones are: PARENT, PATH, ANCESTOR, TYPE, ASPECT, PROPERTIES, NULLPROPERTIES, FIELDS, LID and  DBID.

 

 

Fields from Content Models

 

Fields from Alfresco’s out of-the-box content models as well as fields from custom content models can be referred to using the content model property qname, as in AFTS and the CMIS query language. The cm_creator field in the example SQL statement refers to the creator field in the out-of-the-box cm content model. Fields that have a unique local name over all prefixes do not need to use the prefix. Note that we use "_" to separate the prefix and the locale name as ":" would have to be escaped.

 

Escaping Fields

 

Fields that include reserved words or special characters will need to be escaped using the back tick character (`). The `cm_content.size` field in the example SQL statement is an example of back tick escaping. The only non-word character that can be used without escaping is the underscore “_”. 

 

Under the covers the implementation uses Apache Calcite which has a list of reserved words that would also need to be escaped. You are most likely to hit reserved keywords picking aliases for fields.

 

Field Aliases

 

SQL field aliases are supported in the field list. Field aliases cannot be used in the WHERE, ORDER BY or HAVING clauses. Field aliases that contain special characters or reserved words will need to be escaped with the back tick.

 

Predicate

 

Alfresco’s SQL predicate is designed to take advantage of the rich search capabilities available in the Alfresco Search Service.

 

Predicates on Text Fields

 

The basic predicate on a text field performs a phrase search. Below is the syntax of a basic predicate on a text field:

select DBID from alfresco where cm_content = ‘hello world’

This will search for the phrase ‘hello world’ in the cm_content field.

 

To gain full control of the search predicate for a specific field you can wrap the predicate in parenthesis and enter the query using Alfresco full text search syntax. For example to search for (hello OR world) in the cm_content field the following search predicate can be used:

select DBID from alfresco where cm_content = ‘(hello OR world)’

 

Predicates on String Identifier Fields

 

Predicates on string identifier fields will perform an exact match on the field.  For example:

select DBID, LID from alfresco where LID = ‘value’

The SQL statement above will perform an exact match on the LID field.

 

Most fields from the content models will perform full text search matches unless the property is defined as tokenised false in the model.
This may not be what you expect.

 

Predicates on Numeric Fields

 

The predicate on numeric fields can perform =, >=, <= and Alfresco Solr range queries.

 

Below are some examples using the =, >=, <=

select DBID, LID from alfresco where DBID = 123

select DBID, LID from alfresco where DBID >= 123

select DBID, LID from alfresco where DBID <= 123

select DBID, LID from alfresco where DBID >= 100 AND DBID <=123

 

Below are examples of Alfresco Solr range queries:

select DBID, LID from alfresco where DBID =’[* TO 123]’

This will select all DBIDs below 123, with inclusive ranges. The square brackets are inclusive ranges.

 

select DBID, LID from alfresco where DBID =’[* TO 123>’

This will select all DBIDs below 123, with an exclusive top range. < and > are exclusive ranges.

 

select DBID, LID from alfresco where DBID =’[123 TO *]’

This will select all DBIDs  above 123, with inclusive ranges.

 

select DBID, LID from alfresco where DBID =’<123 TO *]’

This will select all DBIDs above 123, with an exclusive bottom range.

 

select DBID, LID from alfresco where DBID =’<100 TO 123>’

This will select all DBIDs above 100 and below 123, with an exclusive top and bottom range.

 

Nested Boolean Predicates

 

SQL predicates can be combined with Boolean operators AND, OR and NOT and nested with parenthesis.

select DBID from alfresco where (cm_content = ‘hello world’ AND (DBID >= 100 AND DBID < 200))

 

Order By

 

SQL SELECT statements can contain an ORDER  BY clause with one or more order by fields. String identifiers and numeric fields can be specific in the ORDER BY clause.

 

Below is an example of  an ORDER BY on a numeric field:

select cm_creator, cm_name, exif_manufacturer, audio_trackNumber from alfresco order by audio_trackNumber asc

 

Limit

 

SQL SELECT statements can contain a LIMIT clause. If no limit is specified a default limit of 1000 is set. Caution should be used when increasing the default limit as performance and memory consumption increase as the limit increases.

 

SELECT DISTINCT statements

 

The basic syntax for SELECT DISTINCT syntax is as follows:

select distinct cm_name from alfresco where cm_content = 'world' order by cm_name asc

 

SELECT DISTINCT queries can also have multiple fields and multiple order by fields.

Aggregations Without GROUP BY

 

SQL aggregations without a GROUP BY clause return a single result tuple with the aggregation results. Below is an example:

 

select count(*) as count, avg(`cm_content.size`) as content_size from alfresco where cm_owner = 'xyz'

Aggregate functions

 

Alfresco SQL supports the following aggregation functions:

 

count(*)

sum(numeric_field)

avg(numeric_field)

min(numeric_field)

max(numeric_field)

 

Aggregate Fields

 

 

Any numeric field can be used within the aggregations sum, avg, min and max. As with the basic SELECT statements, aggregation fields defined by content models can be referenced using the content model prefix. Fields that are reserved words or contain special characters will need to be escaped with the back tick character.

 

Aggregate Result Tuple

 

 

If a field alias is specified for an aggregate function then the field alias will appear in the result tuple. If field aliases are not used then the field name for the aggregate functions will appear as follows: EXPR$1, EXPR$2. These values refer to the function expression by the order they appear in the field list, starting from 1. For example the first function that appears in the query will be named EXPR$1 in the result tuples.

 

Aggregations With GROUP BY

 

SQL aggregations with a GROUP BY clause are also supported and take the following form:

 

select `cm_content.mimetype`, count(*) as total_count from alfresco group by `cm_content.mimetype` having count(*) < 4 order by count(*) asc

 

Aggregate functions

 

Alfresco SQL supports the following aggregation functions: count(*), sum(numeric_field), avg(numeric_field), min(numeric_field), max(numeric_field).

 

Aggregation Fields

 

Any numeric field can be used within the aggregations sum, avg, min and max. As with the basic SELECT statements fields defined by content models can be referenced using the content model prefix. Fields that are reserved words or contain special characters will need to be escaped with the back tick character.

 

Group By Fields

 

One or more fields can be specified as group by fields. Fields that are designated as facetable in a content model will provide the best aggregation results.

Aggregate Result Tuples

 

 

If a field alias is specified for an aggregate function then the field alias will appear in the result tuple. If field aliases are not used then the field name for the aggregate functions will appear as follows: EXPR$1, EXPR$2. These values refer to the function expression by the order they appear in the query, starting from 1. For example the first function that appears in the query will be named EXPR$1 in the result tuples.

Order by

 

One or more fields may used in the ORDER BY clause. The ORDER BY clause can include both fields from the field list and aggregate functions. Field aliases cannot be used in the ORDER BY clause. When referring to an aggregate function in the ORDER BY clause the function call as it appears in the field list should be used.

Having

 

 

The HAVING clause is supported for aggregation functions only. Boolean logic and nested HAVING clauses are supported. The following comparison operations are supported in the HAVING clause: =, >=, <=, !=.

 

Limit

A LIMIT clause can be used to limit the number of aggregations results. If no LIMIT clause is provided a default limit of 1000 is applied. 

Time Series Aggregations

 

There is specific support for SQL time series reporting through the use of virtual time dimensions. The section below describes how virtual time dimensions are used.

 

Virtual Time Dimensions

 

Virtual time dimensions for every datetime field are automatically created and stored in the Alfresco Search Service. The three virtual time dimensions supported for the EA release are: _day, _month, _year. To use the virtual time dimensions you simply append the virtual time dimension to any datetime field and use it in the GROUP BY clause.  Below is an example where the _day dimension is appended to the cm_created datetime field.

 

select cm_created_day, count(*) as total from alfresco where cm_created >= 'NOW/DAY' group by cm_created_day

The query above creates a daily time series report using the cm_created_day virtual time dimension.

 

Datetime Predicates

 

A datetime predicate can be used in the WHERE clause to control the datetime range of the time series report. Notice in the example query the where clause:

 

where cm_created >= 'NOW/DAY'

 

This is a datetime predicate on the cm_created field. Its important to note that the virtual time dimension field is only used in the field list and GROUP BY clause. The predicate is applied to the non-virtual datetime field in the index.

 

The example above is using a date math expression to specify a lower boundary for the time series report. The sections below describe how to set lower and upper boundaries using both fixed date and date math predicates.

 

Unbounded Time Series Reports

 

If no datetime predicate is supplied, the following default lower and upper boundaries for the different time dimensions will be used:

 

day

lower: current day minus 1 month

upper: current full day

 

month

lower: current month minus 24 months

upper: current full month

 

year

lower: current year minus 5 years

upper: current full year

 

Fixed Datetime Predicates

 

Fixed datetime predicates are formatted according to a subset of ISO 8601. The SQL statement below shows an example of the fixed date predicate formatting:

 

select cm_created_day, count(*) from alfresco where cm_created >= '2010-02-01T01:01:01Z' and cm_created <= '2010-02-14T23:59:59Z' group by cm_created_day

 

Note that fixed datetime predicates require the full precision shown in the example above. 

 

Date Math Predicates

 

There is also support for a rich set of date math expressions. Below is an example of a time series aggregation using date math predicates:

 

select cm_created_month, count(*) from alfresco where cm_created >= 'NOW/MONTH-6MONTHS' and cm_created <= 'NOW' group by cm_created_month

 

In the example above NOW signifies the current point in time with milli-second precision. NOW/MONTH rounds the current point in time down to the current MONTH. The -6MONTHS subtracts 6 months from the current month. 

 

See the Solr date math guide for more details on date math syntax.

 

Single Dimension Group By

 

Time series aggregations that group by virtual time dimensions are currently limited to using a single group by field. 

 

Order By

 

By default time series aggregation results are sorted in datetime ascending order. An ORDER BY clause can be used to change the direction of the datatime sort or sort by an aggregate result. 

 

Having

A HAVING clause can be used to filter time series aggregations results.

 

JDBC Driver
 

The product includes a JDBC thin client that can be used with Apache Zeppelin and other SQL clients. The JDBC driver can also be used directly from a Java application. Below is sample code for using the JDBC driver:

 

String sql = "select DBID, LID from alfresco where cm_content = 'world' order by DBID limit 10 ";
String alfrescoJson = "{\"tenants\":[\"\"],\"locales\":[\"en_US\"],\"defaultNamespace\":\"http://www.alfresco.org/model/content/1.0\",\"textAttributes\":[],\"defaultFTSOperator\":\"OR\",\"defaultFTSFieldOperator\":\"OR\",\"anyDenyDenies\":true,\"query\":\"name:*\",\"templates\":[],\"allAttributes\":[],\"queryConsistency\":\"DEFAULT\",\"authorities\":[\"GROUP_EVERYONE\",\"ROLE_ADMINISTRATOR\",\"ROLE_AUTHENTICATED\",\"admin\"]}";

Properties props = new Properties();
props.put("alfresco.shards", "http://localhost:8983/solr/alfresco")
props.put("json", alfrescoJson);

String connectionString = "jdbc:alfresco://localhost:8983?collection=alfresco";
Connection con = null;
Statement stmt = null;
ResultSet rs = null;
   
try {
        con = DriverManager.getConnection(connectionString, props);
        stmt = con.createStatement();
        rs = stmt.executeQuery(sql);
        int i=0;
        while (rs.next()) {
            System.out.println(rs.getString("DBID"));
        }
    } finally {
        try { rs.close(); } catch(Exception e) {}
        try { stmt.close();} catch(Exception e) {}
        try { con.close();} catch(Exception e) {}
    }
}

Currently Unsupported SQL Constructs

Below are some commonly used SQL features that are not yet supported, some of these we are working at closing at the moment but we'd love to know which you would find most valuable. Leave a comment below or send me a message to give us your feedback. 

 

  • SELECT * : currently specific fields must be listed in the field list
  • SELECT COUNT(DISTINCT field): Currently count(distinct field) queries are not supported
  • HAVING limitations: currently the HAVING clause can only be applied to aggregation functions
  • JOINS : currently this no support for joins
  • SUB-QUERIES: currently there is no support for sub-queries
  • UNION: currently there is no support for union queries
  • IS NULL
  • IS NOT NULL
  • LIKE
  • CMIS QL functions IN_TREE, IN_FOLDER, SCORE, CONTAINS
  • IN

Outcomes