Relational DB - OpenData-tu/documentation GitHub Wiki

|Version|Date|Modified by|Summary of changes|
|-------|----|-----------|------------------|
| 0.1   | 2017-05-29 | Paul Wille | initial Version

Relational Database for Datasource Metadata

We decided to habe a separate system component to manage the metadata of data-sources that shall be imported and that already have been or are ongoingly imported. This system fulfills several tasks, that are important for organizing the importing/ETL-process, somehow connect the ETL-framework with our data-storage-component and give potential external/new users a graphical interface to beforehand register and describe new datasources.

Purpose of this system

This system has two main purposes:

The user registering and describing a datasources that is going to be imported
Providing (static) metadata information to datasources to the system and supplying validation information for the importing pipeline
Enabling visitors or users, that want to use the data that exists (and not import new data), to see the listed resources our system contains

Enable validation for the importing/ETL-process

While designing the ETL-importing pipeline we had the problem, that the validation of data-fragments (JSON documents) that are prepared and processed for insertion must be validated previously. That task is logically linked to the data storage component and not the importing process itself as

the ETL-framework should not validate its own outcome (as it or its creator are biased)
the database should asure, that it is not importing corrupted data-fragments that will break queries etc.

Therefore we initially decided to have a insertion gateway running independetly, that listens to a queue for readily processed data calidates it and inserts it into the database. We discovered some disadvantages with that solution

The ETL-pipeline would need another distinct worker, that is deployed and running independently. Meaning, that it has to be part of the CD-process, which seemed like an overhead, as they would also have to be managed with regards to scaling and for many databases there are components that manage the insertion process (also equiped with hooks that can listen to a queue, e.g. elastic filebeat for ElasticSearch)
It would still have to be connected to the workers of the ETL-process of the same source, which means a weel defined and stable communication channel would have to be provided aswell.
Error handling via the communication channel would be spread over two actual independent components, therefore couple them too tight.

Having this in mind we came up with the idea of a seperate relational database and an associated system, that administers the datasources that have been (or still are) imported to our system, that also is responsible for providing the validation criteria, i.e. the format for data-fragments for the ETL-process.

With a solution like this we achieve having a concrete description of a datasource that is accurately enough to provide validation information. This validation can then be integrated in the ETL-process (which we actually wanted to prevent), as the comparison for what is valid and what is not is still coming from an external component and is not integrated as such in the ETL-process. This will be achieved by providing an API within this system, that returnes a document capable for comparing and validating a processed JSON-document. We will most likely use JSONT that describes JSON templates of a certain format, equivalent to an xlst validation for xml-documents.

<<<<< TODO add reasons >>>>>

<<<<< TODO why relational >>>>>

Register Sources

To each datasource there is important information and metadata that must be known before importing the datasource and that is static. Therefore we chose not to manage it in the same database as the saved extraced and transformed datapoints of the sensor-data-sources, but have a separate system. With this we can tweak the data-storage component for providing and fullfilling its main tasks (<<<< reference to storage requirements >>>>>) and do not mix up meta-information and actual sensordata in one place.

Meta-Information that has to be provided to our system prior to importing for several reasons are mainly:

name of source
begin date of measurements. From what point in time does data exist
if the source is still active (data is being appended to the source in the future).
- If so:
  - at what schedule
- If not:
  - end date of measurements

Provide scheduing information

At the time being it is not yet clear, if the schedule provided as metadata will also serve as an actual cron-like schedule that triggers importing tasks, or if it will only serve as meta-information that is not logically used in the running system.

Schema

It is important to state, that this system and the related database will use a relational schema, although the imported sources will not be saved in such manner. Therefore the given schema is only a relational modeling of the structur of sensor data.

The relational schema is the following: Link for now, later embeded picture

Further potential functionality

Furthermore this system would aswell be the place for other future improvements and additional information about our open-data sytem.

As this system is the place, a user registers and describes data-sources it would also be a good place to contain:

documentation of the framework
potential authentication-management (tokens for source etc)
administration of sources