ETL Framework Usecases and Requirements - OpenData-tu/documentation GitHub Wiki

Authorship

Version	Date	Modified by	Summary of changes
0.1	2017-05-19	Paul Wille	First draft
0.2	2017-05-20	Andres Ardila	Added section regarding usage of term ETL, fixed spelling, & tried to improve readability, etc.
0.3	2017-07-02	Andres Ardila	Corrections & rewording

TOC

Terminology

The term ETL was used during early discussions as a simile for easier understanding of the discrete and sequential nature of the steps involved in such a process, namely extract, transform, and load. This, however, is arguably not a good choice of terminology to adopt and carry forward in the project as it has a well established and specific meaning in the Data Warehousing domain (read "relational"). Although we aim to build an analogous process to ETLs, we are not building a Data Warehouse per se; therefore, we would rather like to avoid potential confusion to future users by using domain-specific terminology.

Alternative terms include Batch processing, Read-Process-Write, //TODO find consensus...

ETL Framework Use Cases and Requirements

This section covers the requirements for the ETL Framework and not the ETL process itself.

As agreed upon in the meeting on May 18th, 2017 <please insert wiki link> we shall build an extensible framework, which covers several use cases regarding the types of inserting data to our system. Each of them shall be met by providing capabilities within our framework to the end-user/data provider, in a way is that independent of the use case each data source can be inserted with the least possible coding- and configuration-effort and by using as many reusable components provided by our framework.

Recurring task when uploading / possible reusable components

Extracting data

Getting the data in the provided form (and format) will be similar in many cases. This has two components

Transport/protocol

Possible ways of receiving the data could be:

FTP-File-Server
HTTP(S) access to resource
API access to resource
[...]

Data Formats

Data formats can also vary but the most common will probably be:

Machine-readable:
- JSON
- Delimiter-separated values (DSV):
  - CSV
  - SSV (?)
  - TSV
- XML
- Spreadsheet: Excel (XLS, OOXML), OpenDocument
Non-machine-readable/parseable:
- Within a normal HTML-page
- ~~Inside a PDF document~~ Out of scope for a semester project IMO. Can be extended in the future by a user who might need this functionality (AA).
- (Plain)text of any kind

Unit Conversion

Data points of sensors can be of many kinds. Units will occur in many different contexts and can mean a different thing (which is probably more of a schema-issue within the data storage component) and therefore a context has to be provided. On the other hand, units will often be supplied in formats that need to be converted (e.g. using non-metric units for length, mass, volume, etc. is probably best avoided).

Reformatting given data to expected data format

Given a schema that makes sense from a data storage view (e.g. defining a relation between sensors, sensor stations, concrete measurements and all their geolocation and point in time) data-sources will need to be reformatted to meet the requirements of the data-storage and therefore the gateway, that controls this and accepts requests.

Use Cases

The overall use cases of insertion are the following:

import in a specific schedule repeatedly (frequency should be abstracted)
crawl data for a given time-period in the past for which there is data available
ad-hoc uploading data (data that is present and shall be uploaded bulk-wise)

Obviously the framework to implement must allow them all. While in case of ad-hoc uploading a component within the uploading-framework may not be needed as calling the API with uploading-requests may be sufficient the other cases will need a controller/scheduler-infrastructure of some kind.

Therefore we will have to think about and implement an infrastructure for handling and automatically performing importing-tasks with a given configuration for a data-source.

Consequential Requirements for the framework

ETL Pipeline

Our framework must provide core components for the import-process that can be used by many single importing processes as well as a controlling infrastructure that schedules and automates importing.

[...]

Interfaces to the data storage

At the end of the ETL-pipeline/process will stand the final insertion of data into the storage component of whatever form and in whatever format. The framework in the end has to validate the outcome of the ETL-pipeline.

Non-functional Requirements

Any libraries and/or frameworks on which our solution is built must be OSS.