Technology stack - Fleet-Analytics-Dashboard/Application GitHub Wiki

This section is about identifying the right technology for the project's goals

Table of Contents

1. General
2. Dashboard solution
3. Data pipelines
4. Database Management Systems
5. Data Science Tools
6. Distributed Computing Systems

1. General

For our frontend, we decided to use dash together with Flask and build a one-page application. This provides a smoother user experience. Dash was mainly used for the data visualisation and the interaction possibilities with those graphs while Flask was responsible for the routing and the data stream. Our application was running inside a Gunicorn Web Server on a Google Cloud Virtual Machine. Although we could have used CSV files as data storage as well, a database ist the more versatile data storage solution. To ease the implementation of our dashboard with a different dataset, we decided on implementing a PostgreSQL database. To connect the database with our python-based application, we used the Psycopg2 Connector. The data cleaning, manipulation and preparation were done on a separate virtual machine. The python library Pandas was used for the handling of our dataset, while the XGBoost algorithm was used to predict the maintenance need of each vehicle.

2. Dashboard solution

The Python Dash web dashboard solution was chosen for this project. It is easier to scale and provides more complex and dynamic data visualizations possibilities than an Angular/Express dashboard with visualization libraries such as ngx-charts. The Shiny dashboard solution was discarded due to the preference of most team members for Python (Shiny uses the coding language R).

2.1. Dash

Dash is a productive Python framework for building web applications. Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It's particularly suited for anyone who works with data in Python

python framework Dash for building web dashboards (facilitates Flask, Plotly.js, and React.js)
batch processing of the telematic data (leaving the option open to implement streaming processing later)

Characteristics

Dash apps are rendered in the web browser. You can deploy your apps to servers and then share them through URLs.
Since Dash apps are viewed in the web browser, Dash is inherently cross-platform and mobile ready.
Every aesthetic element of the app is customizable: The sizing, the positioning, the colors, the fonts. Dash apps are built and published in the Web, so the full power of CSS is available.
CSS and default styles are kept out of the core library for modularity, independent versioning, and to encourage Dash App developers to customize the look-and-feel of their apps. The Dash core team maintains a core style guide here.
While Dash apps are viewed in the web browser, you don’t have to write any Javascript or HTML. Dash provides a Python interface to a rich set of interactive web-based components.
Dash ships with a Graph component that renders charts with plotly.js. Plotly.js is a great fit for Dash: it’s declarative, open source, fast, and supports a complete range of scientific, financial, and business charts.

2.2. Angular dashboard with ngx-charts

Python web framework with Flask or Django
Python data analytics tool (e.g. Pandas)
dashboard with Angular, Angular Material UI components, and a respective data visualization tool (e.g. ngxchart)

Angular data visualization tools

References

2.3. Shiny

Shiny is an open source R package that provides an elegant web framework for building web applications and dashboards straight from R. It helps you turn your analyses into interactive web applications without requiring HTML, CSS or JavaScript knowledge. You have two package options for building Shiny dashboards: flexdashboard and shinydashboard:

2.3.1. Flexdashboard

The flexdashboard is an easy interactive dashboard for R that uses R Markdown to publish a group of related data visualizations as a dashboard. Is supports a wide variety of components, including htmlwidgets, base, lattice, grid graphics, tabular data, gauges, value boxes and text annotations. It can only run interactive code clientside (in embedded JavaScript).

Characteristics

Flexible: includes easy to specify row and column-based layouts with intelligent re-sizing to fill the browser
Adapted for display on mobile devices
Offer storyboard layouts for presenting sequences of visualizations
Static or dynamic
CSS flexbox layout

2.3.2. Shinydashboard

The Shinydasboard is a more complex dashboard that uses Shiny UI code and offers a lot more functionality to your dashboard. It can implement any layout and contains more specific widgets designed to work in a dashboard layout.

Characteristics

Uses Shiny UI code
Dynamic
Bootstrap grid layout
Not quite as easy

3. Data pipelines

_For this project, data needs to be gathered from various sources and made available for analysis and visualization. Data can either processed (retrieved, transformed, and classified) only once in a batch or as continuously in a stream. This project will use first batch processing. If there is enough time left, stream processing will be implemented. The reference of this article and further information can be found here: (Big Data Battle: Batch Processing vs Stream Processing)[https://medium.com/@gowthamy/big-data-battle-batch-processing-vs-stream-processing-5d94600d8103]

A all-in-one solution to build data pipelines is (Apache Hadoop)[https://subscription.packtpub.com/book/application_development/9781788995092/1/ch01lvl1sec14/overview-of-the-hadoop-ecosystem]. It is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. This diagram shows a brief overview of the Hadoop ecosystem in the Apache technology stack:

3.1. Batch

Batch processing is where the processing happens of blocks of data that have already been stored over a period of time. For example, processing all the transaction that have been performed by a major financial firm in a week. This data contains millions of records for a day that can be stored as a file or record etc.

Possible solutions

Spark (Apache Spark - quick start guide)[https://spark.apache.org/docs/latest/quick-start.html]
Ignite
direct solution: uploading a csv file with the data into the source code of the application
Hadoop MapReduce

3.2. Streaming

Stream processing allows to get analytics results in real time. It enables to process data in real time as they arrive. Thereby, data cam be fed into an analytics tools as soon as they get generated. Apache Spark enables the distribution of data analysis (also in the cloud) The following figure gives you a detailed explanation how Spark process data in real time:

Possible solution

Cron Job (manually fetching the data every x seconds
Delta lake (Databricks): costs too much (Apache HBase alternatively)
A virtual machine like (Google Cloud)[https://console.cloud.google.com/getting-started?pli=1] for the data simulation and streaming of the data to the application. Possible streaming solutions are:
- Apache Kafka
- Apache Flink
- Apache Storm

4. Database Management Systems

PostgreSQL was chosen from these options: PostgreSQL, SQLite, MySQL. The reason for this is that it has a better support than MySQL for Python. Further, SQLite ist not supported by the Google Cloud Database Service.

Resources:

5. Data Science Tools

Due to the team's knowledge, Pandas, NumPy, and XGBoost was choosen as data science Python libraries for data analytics.

Pandas: library for data analysis and manipulation in Python
NumPy: provides all of the basic functions in scientific computing and is able to process lots of data quickl
Modin): enables working with large data sets while writing with the Pandas syntax
Wget: utility for non-interactive download of files from the Web.
Pendulum: Python package to ease datetimes manipulations
XGBoost: implements machine learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems.

Resource: Beginner’s Guide to Data Science Libraries in Python

6. Distributed Computing Systems

Processing the data could be distributed using one of these solutions (if there is enough time left)

Ray: framework for building and running distributed applications
Dataproc with PySpark: provides tools for batch processing, querying, streaming, and machine learning