CKAN Transformations - ckan/ckan GitHub Wiki

Purpose

To give CKAN basic data transformation abilities. These transformations will happen to data within the CKAN datastore.

This will initially only include basics like importing data into the datastore, changing the type of a column and renaming a column.
However, in future this could consist of many other transformation options such as data cleaning, aggregation or normalization.

Implementation

The model will have one extra table named "transformation" with the following fields.

  • id (UUID): Unique id.
  • resource_id (UUID): CKAN resource id.
  • transformation_type (TEXT): Type of process.
  • config (JSON string): Config options to pass onto this transformation type.
  • stage (INT): Stage this transformation happens in.
  • task_data (JSON): JSON data of information returned by external service. (this can be used to save logs or links to logs of that service)
  • complete (BOOL): The transformation process completed this stage successfully.

Each transformation type will have its own set of rules and will call an external service to do the transformation. The only external service initially supported will be the datapusher. The datapusher extension will also manage the moving of the process onto the next stage.

The first stage will normally consist of the datapusher storing the csv/excel data to the datastore.

Each stage will call back to CKAN and report how well it went and if it was successful request the next stage if available.
If one of the stages fails then the process will stop and the logs against that stage will give information on the failure. There will be an option to rerun all the tasks or just try the last failed one again. Also there will be an option to clear all the Transformations and start configuring the whole process again.

Process Diagram

      +------------+           +---------------------+         +-----------------------+
      |   CKAN     |           |   CKAN Datapusher   |         |   Datapusher Service  |
      +------------+           +---------------------+         +-----------------------+
      | UI to Start+---------->|     Request         +-------->|                       |
      | Process    |           |     Stage 1         |         |     Run               |
      +------------+           +---------------------+         |     Stage 1           |
      |            | if fail   |  Store Stage 1      |         |                       |
      | Show       |<----------+     Result          |<--------+                       |
      | Stage 1    |           |                     |         +-----------------------+
      | Log        |           |     Request         |         |                       |
      |            |           |     Stage 2         +-------->|     Run               |
      +------------+           +---------------------+         |     Stage 2           |
      | Show       | if fail   |  Store Stage 2      |         |                       |
      | Stage 2    |<----------+     Result          |<--------+                       |
      | Log        |           |                     |         +-----------------------+
      +------------+           |                     |
      | Show       |<----------+   Report Success    |
      | Success    |           +---------------------+
      +------------+

Interface

In the datastore tab in the resource edit will have a list of the stages and option to configure the last stage added. Each stage that has been run will have an expandable view of the datapusher log of what happened in that stage, this will be automatically expanded for failure cases.

There will be button to reload the data with the same Transformations. This will be used if you want to reprocess changed underlying resource data. There will also be a button to clear all the transformations and start adding tranformations from scratch.