Database Architecture - OpenData-tu/documentation GitHub Wiki
Version 0.1
| Version | Date | Modified by | Summary of changes |
|---|---|---|---|
| 0.2 | 2017-05-29 | Nico Tasche | Added datastructure |
| 0.1 | 2017-05-29 | Nico Tasche | Initial version |
Architecture of Database backend
Introduction
This document describes the basic architecture of the database system, which holds the collected open environment data.
Architecture Overview
TODO: put in a drawing of the architecture overview
The database backend consists of two databases.
- Relational Database
- A NoSQL Database as a data lake, is being used to hold all collected data from the data sources
The Relational Database
The relational database holds metadata about the data sources as well as configuration data and security information and serves as an abstraction layer to Elasticsearch
The NoSQL Database
The NoSQL database serves as a data lake and holds all the collected data
Requirements
- scale-able in the range of petabyte in size
- hundred of thousands of requests per minute
- high availablility
- partition tolerance
- immediate consistency is NOT necessary
Chosen Database
For the project Elasticsearch has been chosen, cause it is very scale-able and fast. It is a horizontally scaling database which uses inverted indices for full text searches and b-k-d-trees for numerical and geo-spatial data.
Datamodell
Against my better judgment, we are going to implement a type focused datamanagment. That means we are following the data structure as follows:
Sample with Luftdate.info
index: "ambient_temperature_2017...",
doctype: "measurement",
{
"source_id": "luftdaten_info",
"device": "141",
"record_id": "uid for one set of sensordata",
"timestamp": "2017-06-06T00:02:10",
"timestamp_record": "this is gonna be the timestamp, when the data is included into the database",
"location":{
"lat": 48.779,
"lon": 9.160
},
"height": 122.5,
"licence": "gotta find out",
"sensor": "BME280",
"observation_value:" : 17.62,
"quality": 5
}
| Field | Value |
|---|---|
source_id |
name of data provider, which was provided during setup process |
device |
name for the device, for example name of a weather station |
timestamp |
time when data has been saved in this database |
timestamp_record |
timestamp from datasource on when the measurement has been taken place |
location |
location as object representation lat and lon explicitly named |
sensor |
is the name of the sensor |
height (optional) |
height in meter above sealevel |
license |
license is mandatory, even when it is empty, ore unknown |
observation_value: |
The value of the messurment, type is implied by the index |
quality_indicator: (optional) |
local quality indicator |
Why it is a bad idea to do it this way
one ground rule with elastic search is it, not to allow a elasticsearch index to grow indefinitely. Because, there might be a undefined number of data sources which might want to add their data to the database. That in turn means it might be possible, an index grows beyond the hardware capacity of an elasticsearch node which means it has to be reindex, which is computational very expensive.
Why do we still do it
It is one drawback, which we will have to except due to additional requirements. It is just one drawback which we can handy by monitoring the search cluster.