Database Architecture - OpenData-tu/documentation GitHub Wiki

Version 0.1

Version Date Modified by Summary of changes
0.2 2017-05-29 Nico Tasche Added datastructure
0.1 2017-05-29 Nico Tasche Initial version

Architecture of Database backend

Introduction

This document describes the basic architecture of the database system, which holds the collected open environment data.

Architecture Overview

TODO: put in a drawing of the architecture overview

The database backend consists of two databases.

  1. Relational Database
  2. A NoSQL Database as a data lake, is being used to hold all collected data from the data sources

The Relational Database

The relational database holds metadata about the data sources as well as configuration data and security information and serves as an abstraction layer to Elasticsearch

The NoSQL Database

The NoSQL database serves as a data lake and holds all the collected data

Requirements

  • scale-able in the range of petabyte in size
  • hundred of thousands of requests per minute
  • high availablility
  • partition tolerance
  • immediate consistency is NOT necessary

Chosen Database

For the project Elasticsearch has been chosen, cause it is very scale-able and fast. It is a horizontally scaling database which uses inverted indices for full text searches and b-k-d-trees for numerical and geo-spatial data.

Datamodell

Against my better judgment, we are going to implement a type focused datamanagment. That means we are following the data structure as follows:

Sample with Luftdate.info
index: "ambient_temperature_2017...",
doctype: "measurement",
{
    "source_id": "luftdaten_info",
    "device": "141",
    "record_id": "uid for one set of sensordata",
    "timestamp": "2017-06-06T00:02:10",
    "timestamp_record": "this is gonna be the timestamp, when the data is included into the database",
    "location":{
        "lat": 48.779,
        "lon": 9.160
    },
    "height": 122.5,
    "licence": "gotta find out",
    "sensor": "BME280",
    "observation_value:" : 17.62,
    "quality": 5
}
Field Value
source_id name of data provider, which was provided during setup process
device name for the device, for example name of a weather station
timestamp time when data has been saved in this database
timestamp_record timestamp from datasource on when the measurement has been taken place
location location as object representation lat and lon explicitly named
sensor is the name of the sensor
height (optional) height in meter above sealevel
license license is mandatory, even when it is empty, ore unknown
observation_value: The value of the messurment, type is implied by the index
quality_indicator: (optional) local quality indicator
Why it is a bad idea to do it this way

one ground rule with elastic search is it, not to allow a elasticsearch index to grow indefinitely. Because, there might be a undefined number of data sources which might want to add their data to the database. That in turn means it might be possible, an index grows beyond the hardware capacity of an elasticsearch node which means it has to be reindex, which is computational very expensive.

Why do we still do it

It is one drawback, which we will have to except due to additional requirements. It is just one drawback which we can handy by monitoring the search cluster.