Database Schema Validation and Dumping Tool

Document created by resplin Employee on Jun 6, 2015
Version 1Show Document
  • View in full screen mode

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com




Repository database schema validation and comparison


Background and motivation


The schemacomp tool may be used when troubleshooting or examining the database schema for an Alfresco repository. The tool has two main functions:


  • Producing schema dumps as XML files.
  • Validating a database schema.

Schema dumps are not new to the Alfresco repository - prior to the introduction of this tool however, the only way to judge the validity of the schema was to examine the file manually and with simple text tools such as the Unix diff command. This tool performs a certain amount of automatic comparison that should remove much of the effort needed in making these comparisons.

If any changes are made to the database schema during server start-up (such as when installing Alfresco afresh) then the schemacomp tool performs both schema dumping and validation as described below. The dumps and validation are made both pre-upgrade (i.e. before the schema changes) and post-upgrade.


Definition of terms


The terms below are used throughout the rest of this document.


Database object
A schema, sequence, table, column, index, primary key or foreign key.

Reference schema
The definitive representation of an Alfresco repository schema for a given schema version on a vendor specific RDBMS. The reference schema is a model for what should be present in the database after installing or upgrading an Alfresco repository to particular version. A reference schema is presented in the same XML format as a schema dump. For example a schema reference may be produced for MySQL on version 5025 of the Alfresco repository schema.

Target schema
The database schema that will be compared and validated with respect to a reference schema. For example, if installing an Alfresco repository from scratch, then the newly created schema will be a target schema for comparison against the appropriate reference schema.

Performing schema dumps


Schema dumps are XML representations of the RDBMS schema. They should conform to the XSD:

http://www.alfresco.org/repo/db-schema db-schema.xsd

The XSD file is embedded in the repository.

As mentioned above, a schema dump is performed automatically during repository server startup if there were changes made to the database schema. The Alfresco log will indicate if any dumps were performed - entries such as these will be present:


2012-01-30 17:46:58,517  INFO  [domain.schema.SchemaBootstrap] [main] Normalized schema dumped to file
  /tomcat/temp/Alfresco/Alfresco-schema-PostgreSQLDialect-pre-upgrade-alf_-5548956643327704619.xml.
2012-01-30 17:46:58,518  INFO  [domain.schema.SchemaBootstrap] [main] Normalized schema dumped to file
  /tomcat/temp/Alfresco/Alfresco-schema-PostgreSQLDialect-pre-upgrade-avm_-2166257481854030130.xml.
2012-01-30 17:46:58,518  INFO  [domain.schema.SchemaBootstrap] [main] Normalized schema dumped to file
  /tomcat/temp/Alfresco/Alfresco-schema-PostgreSQLDialect-pre-upgrade-jbpm_-2230905975269998715.xml.
2012-01-30 17:46:58,519  INFO  [domain.schema.SchemaBootstrap] [main] Normalized schema dumped to file
  /tomcat/temp/Alfresco/Alfresco-schema-PostgreSQLDialect-pre-upgrade-act_-8103448407472298481.xml.


Similar entries for the post-upgrade files will also be present.

Note: that the legacy tool is still included and will create dumps of its own - the log messages look similar but shouldn't be confused with the new-format dumps.


Triggering dumps by JMX

In addition to automatic dumping, dumps can be manually invoked by use of the JMX interface.

Please note: this is an enterprise only feature.

The JMX category Alfresco > DatabaseInformation > SchemaExport contains two operations:

java.util.List dumpSchemaToXML()

and

java.util.List dumpSchemaToXML(String prefixList)

The first operation takes no parameters and when invoked will create four dump files one for each prefix 'alf_', 'act_', 'jbpm_' and 'avm_'. The prefix means that only tables and sequences whose names begin with the prefix will be included in the dump. Related items - such as the indexes belonging to a particular table - will be dumped regardless of name.

The second variation takes a single String parameter and is a comma-separated list of prefixes that you wish to dump. If this operation were invoked with the parameter 'alf_acl_, alf_node_' for example, then two files would be created (one for each prefix). The tables dumped in the first file would include alf_acl_change_set and alf_acl_member, tables in the second file would include alf_node_aspects and alf_node_assoc. Neither file would include alf_locale or alf_permission since they do not carry one of the supplied prefixes.

Both of these calls will result in the log showing the location of the dumped files, but they also return a List of path names. JConsole will helpfully display these lists in a copy/paste friendly manner.


Performing schema validation


As for schema dumps, schema validation can happen either due to a schema change during repository start -up, or triggered manually via JMX.

Schema validation is performed in two steps differencing and validation.


Step 1. Differencing


Differencing produces similar information to that obtained by using the Unix tool diff against a known 'good' reference schema dump and a potentially problematic target schema dump. However, since the tool is designed for performing a comparison between two database schemas, rather than arbitrary text, the output is more specific about the types of difference. The types of difference that can be reported are:

A database object appears in both the reference and target schemas, but has differences in its properties. For example if an index appears in both schemas but has a different name.

A database object appears in the reference schema but no corresponding object has been identified in the target database.

A database object appears in the target schema but no corresponding object has been identified in the reference database.

One advantage of schemacomp differencing over traditional diff tool comparisons is that an index is not recognised by the exact text appearing in a dump, but: by which table the index belongs to; which columns are indexed and in what order. If an index has the expected name and belongs to the correct table but has the wrong columns, or the correct columns in the wrong order - then differences will be reported. Or conversely, if the correct table has an index with the correct columns in the correct order but has the wrong index name - then this will be reported. The name can be ignored during comparisons (useful for auto-generated index names) or can be taken into account - part of the task of producing reference schema files is to specify this behaviour using DbValidator objects (explained later).


Index related example

Supposing we have the following index defined in the reference schema (which for schema version 5025, we do!)









Index name:permission_id
Parent table:alf_access_control_entry
Columns:permission_id, authority_id, allowed, applies

This index is specified in the schema reference file in this way (parts ommitted for brevity):


<table name='alf_access_control_entry'>
  <!-- column definitions, primary keys and foreign keys ommitted -->
  <indexes>
    <index name='permission_id' unique='true'>
      <columnnames>
        <columnname>permission_id</columnname>
        <columnname>authority_id</columnname>
        <columnname>allowed</columnname>
        <columnname>applies</columnname>
      </columnnames>
    </index>
    <!-- further index definitions ommitted -->
  </indexes>
</table>


When the target schema's index is compared against this reference then firstly a list of candidate matches are produced. There may be more than one matching index in the target schema, in which case a redundant database object warning is issued.

Candidate matches are produced dependent on object type. For indexes:


  1. If the parent table is the same and the index name is the same - then it is considered the same index.
  2. If the name is different but the parent table is the same and the columns indexed are the same (and in the same order) - then it is is considered to be the same index.

Taking the first scenario for matching and using the permission_id index defined above, then if the permission_id index in the target database has the allowed and applies columns in the reverse order than is expected, the log file would notify us of validation problems:


2012-01-31 11:24:24,280  WARN  [domain.schema.SchemaBootstrap] [RMI TCP Connection(11)-10.244.50.71]
    Schema validation found 2 potential problems, results written to:
    /tomcat/temp/Alfresco/Alfresco-PostgreSQLDialect-Validation-alf_-5903917616348258838.txt


The contents of the report file would look something like this:


Difference: expected index .alf_access_control_entry.permission_id.columnNames[2]='allowed',
    but was .alf_access_control_entry.permission_id.columnNames[2]='applies'
Difference: expected index .alf_access_control_entry.permission_id.columnNames[3]='applies',
    but was .alf_access_control_entry.permission_id.columnNames[3]='allowed'


Each line shows a problem with a particular database property. Here it tells us that the property at the path .alf_access_control_entry.permission_id.columnNames[2] has the value applies but according to the reference schema should be allowed. The leading dot of the path can be ignored (the schema name would be present before the leading dot in the case of Oracle for example), then we have the table name alf_access_control_entry, the index name permission_id within that and a zero-indexed list property within that - the third item (index 2) is the property at fault: columnNames[2].

Similarly, the next line tells us that the next item in the column name list (columnNames[3]) has the value allowed but was expected to be 'applies'.


Step 2. Validation


Validation allows the application of more complex rules than whether there is a difference between two property values. Validation is performed by DbValidator objects. A chain of DbValidator objects is associated with each database object in the reference schema - each of these is executed in turn and given the chance to create validation errors based on the corresponding object in the target schema.

If an index has not been given a specific name then the RDBMS will auto-generate one at creation time. This means that the reference schema cannot specify the exact name that the index in the target database will have. This would lead to schema differences being reported if it were not for the use of validators. A NameValidator<code> may be specified for such an index:


<index name='SQL120116153558430' unique='true'>
  <validators>
    <validator class='org.alfresco.util.schemacomp.validator.NameValidator'>
      <properties>
        <property name='pattern'>SQL[0-9]+</property>
      </properties>
    </validator>
  </validators>
  <columnnames>
    <columnname>ID</columnname>
  </columnnames>
</index>


This example is from a DB2 schema reference file (<code>Schema-Reference-ALF.xml) and tells us that although in the original reference schema the index was named SQL120116153558430 any index having the appropriate parent table, column names (and column order) is valid as long as the name matches the regular expression SQL[0-9]+.

When the validator is invoked, it checks that the name property of the index matches the supplied regular expression. In addition to this, the validator reports - when asked - that it takes responsibility for the name property of the index. This stops the schemacomp tool from applying the differencing logic to the property. A DbValidator can choose to apply its validation in addition to the differencing logic by not taking sole responsibility for any properties. Conversely a validator can also take sole responsibility for an entire database object in which case no differencing logic is applied to any part of the object.

A similar problem to the auto-generated name problem is when a database object is created automatically. DB2 creates indexes on the fly under certain circumstances. It is not known whether these indexes will exist at the time the schemacomp tool will be run. Furthermore, the indexes are an implementation detail for DB2 rather than an explicit declaration on how the Alfresco schema should appear. To suppress such errors an IgnoreObjectValidator may be used - it takes responsibility for validation of the associated database object, but performs no actual validation.


Another index related example

Supposing an index is expected to be auto-generated and is defined in the schema reference file as:


<index name='SQL120116153558430' unique='true'>
  <validators>
    <validator class='org.alfresco.util.schemacomp.validator.NameValidator'>
      <properties>
        <property name='pattern'>SQL[0-9]+</property>
      </properties>
    </validator>
  </validators>
  <columnnames>
    <columnname>ID</columnname>
  </columnnames>
</index>


Perhaps a specific unsupported upgrade path has introduced an unexpected schema change - it may not be a problem as such, but it is important that differences are highlighted so that a decision can be made on whether the difference represents a problem and whether a fix will need to be made. On running the schemacomp tool, the following might be observed in the log files:


2012-01-31 14:28:50,697  WARN  [domain.schema.SchemaBootstrap] [main] Schema validation found 1 potential problems, results written to:
    /tomcat/temp/Alfresco/Alfresco-DB2Dialect-Validation-Post-Upgrade-alf_-4048062354335481885.txt
2012-01-31 14:28:51,440  INFO  [domain.schema.SchemaBootstrap] [main] Compared database schema with reference schema (all OK):
    class path resource [alfresco/dbscripts/create/org.hibernate.dialect.DB2Dialect/Schema-Reference-AVM.xml]
2012-01-31 14:28:53,326  INFO  [domain.schema.SchemaBootstrap] [main] Compared database schema with reference schema (all OK):
    class path resource [alfresco/dbscripts/create/org.hibernate.dialect.DB2Dialect/Schema-Reference-JBPM.xml]
2012-01-31 14:28:54,682  INFO  [domain.schema.SchemaBootstrap] [main] Compared database schema with reference schema (all OK):
    class path resource [alfresco/dbscripts/create/org.hibernate.dialect.DB2Dialect/Schema-Reference-ACT.xml]


We can see that the AVM, JBPM and ACT database objects are all as expected, but there is a difference between the target schema and the ALF (alf_ prefixed database objects) schema reference. Looking at that file we can see that an index that is expected to have been auto-generated has been created with an explicit name:


Validation: index ALFUSER.ALF_ACCESS_CONTROL_ENTRY.SQL120131142718040.name='idx_alf_ace_auth' fails to match rule: name must match pattern 'SQL[0-9]+'


Specifically, the error report is stating that the index defined in the schema reference having the name SQL120131142718040 belonging to the table ALF_ACCESS_CONTROL_ENTRY is expected to be named in the same way: prefixed with SQL then a string of one or more digits.


Triggering validation by JMX

In addition to automatic validation, validation can be manually invoked by use of the JMX interface.

Please note: this is an enterprise only feature.

The JMX category Alfresco >  DatabaseInformation > SchemaValidator contains one operation:

void validateSchema()

The operation takes no parameters and returns nothing, however if the operation is invoked then validation will be performed and the Alfresco log will show the results:


2012-01-31 14:51:46,770  INFO  [domain.schema.SchemaBootstrap] [RMI TCP Connection(13)-10.244.50.71] Compared database schema
    with reference schema (all OK): class path resource
    [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-ALF.xml]
2012-01-31 14:51:47,360  INFO  [domain.schema.SchemaBootstrap] [RMI TCP Connection(13)-10.244.50.71] Compared database schema
    with reference schema (all OK): class path resource
    [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-AVM.xml]
2012-01-31 14:51:49,847  INFO  [domain.schema.SchemaBootstrap] [RMI TCP Connection(13)-10.244.50.71] Compared database schema
    with reference schema (all OK): class path resource
    [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-JBPM.xml]
2012-01-31 14:51:50,910  INFO  [domain.schema.SchemaBootstrap] [RMI TCP Connection(13)-10.244.50.71] Compared database schema
    with reference schema (all OK): class path resource
    [alfresco/dbscripts/create/org.hibernate.dialect.PostgreSQLDialect/Schema-Reference-ACT.xml]


In the example above there were no problems found in the target schema.
Troubleshooting

Attachments

    Outcomes