Database Architecture - OpenData-tu/documentation GitHub Wiki

Version 0.1

Version	Date	Modified by	Summary of changes
0.2	2017-05-29	Nico Tasche	Added datastructure
0.1	2017-05-29	Nico Tasche	Initial version

Architecture of Database backend

Introduction

This document describes the basic architecture of the database system, which holds the collected open environment data.

Architecture Overview

TODO: put in a drawing of the architecture overview

The database backend consists of two databases.

Relational Database
A NoSQL Database as a data lake, is being used to hold all collected data from the data sources

The Relational Database

The relational database holds metadata about the data sources as well as configuration data and security information and serves as an abstraction layer to Elasticsearch

The NoSQL Database

The NoSQL database serves as a data lake and holds all the collected data

Requirements

scale-able in the range of petabyte in size
hundred of thousands of requests per minute
high availablility
partition tolerance
immediate consistency is NOT necessary

Chosen Database

For the project Elasticsearch has been chosen, cause it is very scale-able and fast. It is a horizontally scaling database which uses inverted indices for full text searches and b-k-d-trees for numerical and geo-spatial data.

Datamodell

Against my better judgment, we are going to implement a type focused datamanagment. That means we are following the data structure as follows:

Sample with Luftdate.info
index: "ambient_temperature_2017...",
doctype: "measurement",
{
    "source_id": "luftdaten_info",
    "device": "141",
    "record_id": "uid for one set of sensordata",
    "timestamp": "2017-06-06T00:02:10",
    "timestamp_record": "this is gonna be the timestamp, when the data is included into the database",
    "location":{
        "lat": 48.779,
        "lon": 9.160
    },
    "height": 122.5,
    "licence": "gotta find out",
    "sensor": "BME280",
    "observation_value:" : 17.62,
    "quality": 5
}

Field	Value
`source_id`	name of data provider, which was provided during setup process
`device`	name for the device, for example name of a weather station
`timestamp`	time when data has been saved in this database
`timestamp_record`	timestamp from datasource on when the measurement has been taken place
`location`	location as object representation lat and lon explicitly named
`sensor`	is the name of the sensor
`height` (optional)	height in meter above sealevel
`license`	license is mandatory, even when it is empty, ore unknown
`observation_value:`	The value of the messurment, type is implied by the index
`quality_indicator:` (optional)	local quality indicator

Why it is a bad idea to do it this way

one ground rule with elastic search is it, not to allow a elasticsearch index to grow indefinitely. Because, there might be a undefined number of data sources which might want to add their data to the database. That in turn means it might be possible, an index grows beyond the hardware capacity of an elasticsearch node which means it has to be reindex, which is computational very expensive.

Why do we still do it

It is one drawback, which we will have to except due to additional requirements. It is just one drawback which we can handy by monitoring the search cluster.