ETL Framework Usecases and Requirements - OpenData-tu/documentation GitHub Wiki
Authorship
Version | Date | Modified by | Summary of changes |
---|---|---|---|
0.1 | 2017-05-19 | Paul Wille | First draft |
0.2 | 2017-05-20 | Andres Ardila | Added section regarding usage of term ETL, fixed spelling, & tried to improve readability, etc. |
0.3 | 2017-07-02 | Andres Ardila | Corrections & rewording |
Terminology
The term ETL was used during early discussions as a simile for easier understanding of the discrete and sequential nature of the steps involved in such a process, namely extract, transform, and load. This, however, is arguably not a good choice of terminology to adopt and carry forward in the project as it has a well established and specific meaning in the Data Warehousing domain (read "relational"). Although we aim to build an analogous process to ETLs, we are not building a Data Warehouse per se; therefore, we would rather like to avoid potential confusion to future users by using domain-specific terminology.
Alternative terms include Batch processing, Read-Process-Write, //TODO find consensus...
ETL Framework Use Cases and Requirements
This section covers the requirements for the ETL Framework and not the ETL process itself.
As agreed upon in the meeting on May 18th, 2017 <please insert wiki link> we shall build an extensible framework, which covers several use cases regarding the types of inserting data to our system. Each of them shall be met by providing capabilities within our framework to the end-user/data provider, in a way is that independent of the use case each data source can be inserted with the least possible coding- and configuration-effort and by using as many reusable components provided by our framework.
Recurring task when uploading / possible reusable components
Extracting data
Getting the data in the provided form (and format) will be similar in many cases. This has two components
Transport/protocol
Possible ways of receiving the data could be:
- FTP-File-Server
- HTTP(S) access to resource
- API access to resource
- [...]
Data Formats
Data formats can also vary but the most common will probably be:
- Machine-readable:
- JSON
- Delimiter-separated values (DSV):
- CSV
- SSV (?)
- TSV
- XML
- Spreadsheet: Excel (XLS, OOXML), OpenDocument
- Non-machine-readable/parseable:
- Within a normal HTML-page
Inside a PDF documentOut of scope for a semester project IMO. Can be extended in the future by a user who might need this functionality (AA).- (Plain)text of any kind
Unit Conversion
Data points of sensors can be of many kinds. Units will occur in many different contexts and can mean a different thing (which is probably more of a schema-issue within the data storage component) and therefore a context has to be provided. On the other hand, units will often be supplied in formats that need to be converted (e.g. using non-metric units for length, mass, volume, etc. is probably best avoided).
Reformatting given data to expected data format
Given a schema that makes sense from a data storage view (e.g. defining a relation between sensors, sensor stations, concrete measurements and all their geolocation and point in time) data-sources will need to be reformatted to meet the requirements of the data-storage and therefore the gateway, that controls this and accepts requests.
Use Cases
The overall use cases of insertion are the following:
- import in a specific schedule repeatedly (frequency should be abstracted)
- crawl data for a given time-period in the past for which there is data available
- ad-hoc uploading data (data that is present and shall be uploaded bulk-wise)
Obviously the framework to implement must allow them all. While in case of ad-hoc uploading a component within the uploading-framework may not be needed as calling the API with uploading-requests may be sufficient the other cases will need a controller/scheduler-infrastructure of some kind.
Therefore we will have to think about and implement an infrastructure for handling and automatically performing importing-tasks with a given configuration for a data-source.
Consequential Requirements for the framework
ETL Pipeline
Our framework must provide core components for the import-process that can be used by many single importing processes as well as a controlling infrastructure that schedules and automates importing.
[...]
Interfaces to the data storage
At the end of the ETL-pipeline/process will stand the final insertion of data into the storage component of whatever form and in whatever format. The framework in the end has to validate the outcome of the ETL-pipeline.
Non-functional Requirements
Any libraries and/or frameworks on which our solution is built must be OSS.