ETL Overview - OpenData-tu/documentation GitHub Wiki

# Revision Protocol
#
# 2017-05-12 - Initially created by Paul Wille and Andres Ardila
# 2017-05-20 - Spelling, etc.
#
# ====================================================

ETL

Extract

Fetching a resource via a protocol

In the end of this process we want to have whatever dataset is there to be ready for transforming to a format that we want to use later on in the importing pipeline.

In between every of these step, there is some way of checkpointing, meaning that data is somehow verified and checked before continuing onto the next step.

Fetching

Different Data Sources

  • Files (raw data)
    • FTP
    • HTTP
    • ...
  • Dataset (Ready to use API)
    • HTTP
    • ...
  • Databases
    • locally
    • other origin

Different Formats

Everything can be compresses/in an archive

  • JSON
  • CSV
  • XML
  • Plain Text
  • ...

Transforming

Takes the Data prepared by the Extract-step in whatever format there is (csv, xml, json, ...) and transforms it into the format, with the right names, right units etc.

Schema mapping

Maps the "schema" we receive to the "schema" we expect.

  • Renaming attributes
  • putting attributes where we expect them
  • Altering the hierarchy of the data
  • Ordering data
  • Sorting data

Conversion

Besides mapping there will be some conversion needed as well (used data-formats, variable-types)

Logical Alteration

Probably optional in ETL, because this should be handled by ElasticSearch or whatever data storage we have.

  • Aggregation: Combining datapoints if they are to granular
  • Grouping/combining datapoints etc.

Load

Validating and storing the preprocessed data