Auditing Draft - fieldenms/tg GitHub Wiki

This document outlines the idea of auditing and how it might be implemented and introduced in TG applications.

Auditing, or more specifically - data auditing, is the activity of recording operations on data: its creation, modification, and deletion; other operations, such as access, can also be recorded.

In the context of TG, data is comprised of entity records in a database, thus data auditing in TG means entity auditing. In fact, there already exists a minor manifestation of such auditing - entity version, represented by property version that is incremented upon each update to an entity record. Entity auditing has also been implemented for some specific cases where it applies to a single entity type. This kind of ad-hoc auditing takes the form of an entity type that is specifically defined for a particular domain entity type (e.g., VehicleAudit for Vehicle). While this approach has been serving well on a small scale (a couple of entity types), it is limited in that it has to be maintained as any other entity type, which hinders its wide applicability in practice.

What we are interested in is general entity auditing, which would enable auditing of all entity types in a TG application without significantly increasing the maintenance cost.

The expected benefits of data auditing are:

Traceability of all operations on data.
Improved evolvability of TG applications.

Auditing concepts

Audit entity - an entity type that models the audit of another entity type, called the audited entity.
Audited entity - an entity type that has a corresponding audit entity type.

Traceability of operations on data

Crucial for responses to application security incidents.

Evolvability

Improvements to evolvability gained from auditing will enable the following:

Ability to restore a TG application to a particular version, including the database state.

First-class support for the application version

Reification of the application version would enable the system to identify whether its running version is compatible with a database in any given state.

Application version would have to be persisted. For example, entity type ApplicationVersion could be introduced. Then, a database at any given state could be identified as compatible/incompatible with any given application version. The compatibility of a database with an application is defined by a correspondence between the database schema and entity definitions in the application.

Combined with data auditing, first-class support for application versions would enable a newer application version to adapt an older version of the database by populating it with the missing data.

The updating of a persisted application version would have to be performed manually whenever the actual application version changes. For example, the application developer, after releasing a new application version, would launch it against the now previous version of the database by including a specifcic startup option (e.g., persistNewVersion), which would instruct the application to persist its version. In an equivalent scenario, but without such a flag, the application could detect the mismatch between its running version and the persisted one, and issue a corresponding warning or an error.

Modelling approaches

Requirements for the audit entity structure:

Efficient data retrieval/insertion.
Efficient use of space.
Efficient reconstruction of entity versions for informational purposes.

Ad-hoc auditing in detail

This section describes the implementation of ad-hoc auditing in TG applications.

Ad-hoc audit entities take the form of a persistent entity that extends the audited one, thereby inheriting its structure. Additionally, the audit entity's structure contains:

Reference to the audited entity (its ID).
Audit date.
Descriptors of changed properties via a one-to-many association.

An audit entity record represents a particular version of an audited entity instance after the change. Therefore, for each audited entity instance, its contents are equal to those of its latest audit record. This requires a genesis audit record to be created with the contents of the audited entity instance at the moment of its creation (effectively recording the change from the moment of its non-existence).

This structure involves a lot of redundancy as it stores values for all properties, instead of storing a diff (i.e., only those properties whose values changed). In practice, one or two properties change at once, on average. The practical applicability of this structure in the general case needs to be evaluated (i.e., amount of storage required and the prospects of scalability).

Approach 1: Rectangular structure

The rectangular structure approach posits that audit entities have the same structure as entities audited by them. This is the approach taken by ad-hoc auditing.

This approach benefits from its simplicity. Its major limitation is the lack of evolvability - the audit entity's structure is tied to that of the audited entity. This poses a challenge: how should the audit entity's definition adapt to a change in the audited entity's definition? Say a property is removed from an audited entity. Since the audit entity extends the audited one, this structural change affects it too, requiring the corresponding DB column to be deleted, which will cause the loss of history of changes for that property.

Evolvability: reintroduce removed properties in the audit entity

To avoid data loss, a removed property could be added to the the audit entity. This solution feels hacky and does not lend itself well to evolvability. Moreover, it does not cover the case when the audited entity introduces a new property with the same name as the removed one - the audit entity's definition of the removed property will shadow it, making the new property non-existent in the audit entity.

Evolvability: an audit entity type per entity definition version

With this approach, each change to the definition of an entity, which makes it incompatible with the previous definition (e.g., a destructive change such as removal of a property), results in a new audit entity type. Effectively, the history of an entity type's definition is reflected in its corresponding audit entity types.

For all entity types e, let auditTypes(e) be the set of its audit entity types; currentAuditType(e) - its currently active audit entity type; and previousAuditTypes(e) can be defined as auditTypes(e) - currentAuditType(e).
When the definition of entity type e is changed, its current audit entity type a_i is replaced by a new audit entity type a_{i+1}. Afterwards, the following holds:
- a_i is a member of previousAuditTypes(e);
- currentAuditType(e) = a_{i+1}.

The following table demonstrates one possible history of entity Vehicle and the corresponding changes that would be applied to its audit entity types.

event	currentAuditType	previousAuditTypes
`Vehicle` is created	`VehicleAudit1`	`{}`
A new property is added	`VehicleAudit2`	`{VehicleAudit1}`
A property is removed	`VehicleAudit3`	`{VehicleAudit1, VehicleAudit2}`

This approach has a hidden cost in the form of additional complexity at the platform level, which would have to manage a mapping between audited entity types and their audit types. The effects on performance from retrieving data from multiple tables need to be considered as well.

The nature of changes that require the creation of a new audit type has to be defined.

The process of creating an audit entity type can be automated with source code generation. The generative tool should provide a reasonable degree of control over the process, given that there may exist an interpretation of a change that cannot be correctly identified by the tool.

It would be helpful to have verification of audit entity types (i.e., whether the structures of an entity type ands its current audit entity type match).

Approach 2: Diff structure

With this approach, each audit entity type is mapped to a single table that captures diffs, i.e., only those properties that changed between different entity versions. Audit entities would have the following structure:

Reference to the audited entity.
The property that changed (unique identifier, more than its name).
The new value.
Meta-information (e.g., date, transaction id).

For each audited entity instance, it would be required to have a "genesis" record for each property to record their initial values.

The challenging part of this approach is in the persistence of property values, which may have arbitrary types, requiring something akin to dynamic typing for the relational model.

Dynamic typing for the relational model

Persist values using 1 column, which stores their serialised representation (type tag + value).
Persist values using 2 columns: a type tag and a serialised value.

Approach 3: A table for each property type

With this approach, each unique property type has a corresponding table that is used to store changes for all entity types, which is possible due to universal uniqueness of entity identifiers.

Pros:

Fits nicely into the relational model.

Cons:

Inefficient data retrieval: the whole table needs to be searched to find records associated with a particular entity.

This could be addressed by splitting tables further: a table for each existing combination of an entity type and a property type (existing means that a table would be created only for those property types that are used in an entity type).
Inefficient entity reconstruction.

Approach 4: A table for each property

With this approach, a table is created for each property across all entity types.

Pros:

Fits nicely into the relational model.
Efficient entity reconstruction: a join on all property tables. Cons:
Large number of tables (> 10k); needs to be assessed further to understand whether this would affect performance.

Modelling the audit entity graph

The audit entity graph has the same shape as the graph of audited entities.

One powerful feature would be the ability to obtain an entity graph at any given point in time, enabling the inspection of history of the whole graph. This idea gives rise to the following question: should audit records reference live data or historic data?

Live data is comprised of records of audited entities. Historic data is comprised of audit records (i.e., records of audit entities).

If audit entities refer to live data, the audit entity graph is connected to the live entity graph. At the database level, this means that an audit record stores IDs of live entities.

If audit entities refer to historic data, the audit entity graph is (almost) disconnected from the live entity graph. It is almost disconnected because each audit record must contain a reference to the audited instance.

This contrast resembles the difference between mutable and immutable data structures.

For example, consider the following entity types:

class WorkActivity extends AbstractEntity<...> {
  @IsProperty
  @MapTo
  WorkActivityType type;
}

class WorkActivityType extends AbstractEntity<String> {}

If audit records for WorkActivity refer to WorkActivityType (live data), then, depending on the sophistication of the historic data retrieval algorithm, the following can happen: audit records for WorkActivity will contain only the current versions of WorkActivityType they refer to. Alternatively, the data retriveal algorithm could search audit records for WorkActivityType, and match them with audit records for WorkActivity by the audit date, to produce results that precisely reflect historic data.

On the other hand, if audit records always refer to historic data, the data retrieval algorithm could be simpler. For example, the audit entity type for WorkActivity would have property type: WorkActivityTypeAudit.

Synthetic audit entity

One of the primary functions of audit entities is an informational one - the display of historic data. This function can be modelled with a synthetic entity.

One challenge of this approach relates to the structure of synthetic entities used for such purposes. Using the rectangular shape approach, such a synthetic entity type would represent a union of all version of an audit entity type (i.e., all properties that ever existed in the audited entity type).

Persistence

This section is about the interplay of persistence and auditing.

Saving

Several strategies are applicable for the saving of an audit entity instance:

The audit entity is saved in the same transaction as the audited entity.

This strategy makes sense when an entity type must be audited, and a failure to save an audit record is unacceptable. If saving of an audit record fails, the saving of the audited entity fails as well.
The audit entity is saved in a different (typically, subsequent) transaction than the audited entity.

In contrast to the first approach, this strategy is applicable when a failure to save an audit record is acceptable.

The ability to choose a saving strategy can be provided in the form of annotations on entity types.

Deletion

When deleting an audited entity instance, its audit records have to be taken into account.

In general, users of TG applications are warned when deleting entities that have active references from other entities. This approach may need to be modified, as audit records will be refering to the audited entities, making the reference count always positive. One approach is to ignore references to audited entities from their audit entities, and allow its deletion without even issuing a warning.

Table Partitioning

A partitioning strategy based on date requires manual maintenance of partitions at regular time intervals. This involves splitting and merging of existing partitions to migrate old records and to update the partitioning function so that it correctly identifies new records.

Performance tests revealed that table partitioning does not significantly improve performance when inserting audit records or searching them. This suggests that either there was not enough test data, or that gains from table partitioning are to be sought elsewhere. On the latter note, the paper "Partitioned Table and Index Strategies Using SQL Server 2008" by Ron Talmage and Solid Quality Mentors provides some insights with a "sliding window" scenario, where table partitioning is leveraged to instantenously swap in new data and age out old data.

Also, from the said paper:

Some conditions that might indicate a table could benefit from partitioning are:

Index maintenance on the table is costly or time-consuming and could benefit from reindexing partitions of the table rather than the whole table at once.

Data must be aged out of the table periodically, and the delete process is currently too slow or blocks users trying to query the table.

New data is loaded periodically into the table, the load process is too slow or interferes with queries on the table, and the table data lends itself to a partition column based on ascending date or time.

Another advantage for performance gained from table partitioning comes from partition elimination that applies to queries. If a query specifies a condition on a partitioning column, then the data can be accessed efficiently by considering only those partitions that satisfy the condition. For example, a table with a partition for each year enables efficient access to data in any range of years. This is different from paging, which works by following a strict sequential order.

Also, queries that enforce ordering (via ORDER BY) can greatly benefit from table partitioning if the ordering is based on the partition column.

Overall, it seems that table partitioning may provide advantages when:

A lot of data is inserted at once, locking a table for too long due to expensive index update operations.
Small portions of data are inserted frequently enough to decrease throughput due to temporary table locking that happens concurrently with reads.
Bottlenecks in queries can be addressed by partition elimination.
Queries use ordering based on the partition column.

In the context of TG, audit-prop entity types could benefit from table partitioning, but it would make sense only if their tables were sufficiently large (> 10m records).
If audit-prop entity types are versioned like audit-entity types, then such a table is not likely to grow large enough.
On the other hand, if there was a single audit-prop entity type for all versions of a specific audit-entity type, then we could talk about table partitioning (using audit-entity type version as the partitioning column).