Data Architecture - Show-your-Heart/ShowYourHeart-docs GitHub Wiki

Data architecture Show Your Heart

Process deliverable diagram of the Show your Heart Edit model in draw.io

The data architecture is composed of four parts: • Datawharehouse • Data sources • Transformation process • Data delivery

Datawharehouse

The datawarehouse (DWH) database system will be PostgreSQL. PostgreSQL is one of the best opensource databases and has some plugins that can help us with the transformation and query processes.

Data sources

The way to access data from SYH application is with PostgreSQL Foreign Data Wrapper (FDW). This tool allows to access to different databases with linked servers simulating that all data sources are in the same database. This tool allows to access SYH database and other sources like CSV files at the same time.

Transformation process

The transformation processes is the responsible to transform the data in the source databases to the DWH schema. The applications used for the transformation process are: • DBT: library that transform data in new models with SQL queries. It has the simplicity and power of SQL queries and builds the data lineage documentation. It also has the capability of build and execute tests and control time executions of each transformation. • Airflow: is an application from the Apache stack that allows schedule different types of processes like bash o python. It has a powerful logging system and allows to do system alerts when some process fails. You can make dependencies between processes inside airflow. • Python: for extra transformations that can’t be done with DBT we use Python, usually with Pandas. This programs can be called with Airflow.

Data delivery

We have three ways to deliver data: • PDFs/Sheets • Web dashboards • API

PDFs/Sheets

For export data in PFD or sheet format we use JasperReports. JasperReports is one of the main applications for building powerful reports with charts and export them in different formats, like pdf, xlsx, etc. JasperRerports allows different data sources like relational databases or XML files. Jasperreports can be generated with command line in java. This command line can be integrated inside airflow. The reports generated in JasperReports can be sent via mail throw a SMTP server.

Web dashboards

For web dashboards we use Apache Superset. Apache Superset allows to make different types of charts and integrate them into a dashboard. It has the capability to make self service charts, send the reports via mail. And also allows to implement row level security, in order to each user only can view the rows in the data source of their entities.

API

The third way to deliver data is with API endpoints. We use the library FastAPI that allows to build endpoints with user security in a simple way. It provides a swagger for viewing the documentation of each endpoint and execute them.

Other parts in the architecture

In this architecture we also have Prometheus monitoring endpoints in order to control the server availability, levels of CPU, memory, etc. We also have a data governance website with the datalineage generated with DBT and the subsequent lineage with Superset datasets-charts-dashboards.

Web analytics

There is the option to implement web analytics with Matomo. Matomo is one of the leaders in web analytics open source. Matomo is GDPR compilant and has all the user data anonymised.

Costs

The costs of this architecture without Matomo are 70€/month It includes: • Server with all components installed • Daily backups • Server monitoring The cost of Matomo cloud is 220€/year.