Technology stack - Fleet-Analytics-Dashboard/Application GitHub Wiki
This section is about identifying the right technology for the project's goals
Table of Contents
- 1. General
- 2. Dashboard solution
- 3. Data pipelines
- 4. Database Management Systems
- 5. Data Science Tools
- 6. Distributed Computing Systems
1. General
For our frontend, we decided to use dash together with Flask and build a one-page application. This provides a smoother user experience. Dash was mainly used for the data visualisation and the interaction possibilities with those graphs while Flask was responsible for the routing and the data stream. Our application was running inside a Gunicorn Web Server on a Google Cloud Virtual Machine. Although we could have used CSV files as data storage as well, a database ist the more versatile data storage solution. To ease the implementation of our dashboard with a different dataset, we decided on implementing a PostgreSQL database. To connect the database with our python-based application, we used the Psycopg2 Connector. The data cleaning, manipulation and preparation were done on a separate virtual machine. The python library Pandas was used for the handling of our dataset, while the XGBoost algorithm was used to predict the maintenance need of each vehicle.
2. Dashboard solution
The Python Dash web dashboard solution was chosen for this project. It is easier to scale and provides more complex and dynamic data visualizations possibilities than an Angular/Express dashboard with visualization libraries such as ngx-charts. The Shiny dashboard solution was discarded due to the preference of most team members for Python (Shiny uses the coding language R).
2.1. Dash
Dash is a productive Python framework for building web applications. Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It's particularly suited for anyone who works with data in Python
- python framework Dash for building web dashboards (facilitates Flask, Plotly.js, and React.js)
- batch processing of the telematic data (leaving the option open to implement streaming processing later)
Characteristics
- Dash apps are rendered in the web browser. You can deploy your apps to servers and then share them through URLs.
- Since Dash apps are viewed in the web browser, Dash is inherently cross-platform and mobile ready.
- Every aesthetic element of the app is customizable: The sizing, the positioning, the colors, the fonts. Dash apps are built and published in the Web, so the full power of CSS is available.
- CSS and default styles are kept out of the core library for modularity, independent versioning, and to encourage Dash App developers to customize the look-and-feel of their apps. The Dash core team maintains a core style guide here.
- While Dash apps are viewed in the web browser, you don’t have to write any Javascript or HTML. Dash provides a Python interface to a rich set of interactive web-based components.
- Dash ships with a Graph component that renders charts with plotly.js. Plotly.js is a great fit for Dash: it’s declarative, open source, fast, and supports a complete range of scientific, financial, and business charts.
2.2. Angular dashboard with ngx-charts
- Python web framework with Flask or Django
- Python data analytics tool (e.g. Pandas)
- dashboard with Angular, Angular Material UI components, and a respective data visualization tool (e.g. ngxchart)
Angular data visualization tools
References
2.3. Shiny
Shiny is an open source R package that provides an elegant web framework for building web applications and dashboards straight from R. It helps you turn your analyses into interactive web applications without requiring HTML, CSS or JavaScript knowledge. You have two package options for building Shiny dashboards: flexdashboard and shinydashboard:
2.3.1. Flexdashboard
The flexdashboard is an easy interactive dashboard for R that uses R Markdown to publish a group of related data visualizations as a dashboard. Is supports a wide variety of components, including htmlwidgets, base, lattice, grid graphics, tabular data, gauges, value boxes and text annotations. It can only run interactive code clientside (in embedded JavaScript).
Characteristics
- Flexible: includes easy to specify row and column-based layouts with intelligent re-sizing to fill the browser
- Adapted for display on mobile devices
- Offer storyboard layouts for presenting sequences of visualizations
- Static or dynamic
- CSS flexbox layout
2.3.2. Shinydashboard
The Shinydasboard is a more complex dashboard that uses Shiny UI code and offers a lot more functionality to your dashboard. It can implement any layout and contains more specific widgets designed to work in a dashboard layout.
Characteristics
- Uses Shiny UI code
- Dynamic
- Bootstrap grid layout
- Not quite as easy
3. Data pipelines
_For this project, data needs to be gathered from various sources and made available for analysis and visualization. Data can either processed (retrieved, transformed, and classified) only once in a batch or as continuously in a stream. This project will use first batch processing. If there is enough time left, stream processing will be implemented. The reference of this article and further information can be found here: (Big Data Battle: Batch Processing vs Stream Processing)[https://medium.com/@gowthamy/big-data-battle-batch-processing-vs-stream-processing-5d94600d8103]
A all-in-one solution to build data pipelines is (Apache Hadoop)[https://subscription.packtpub.com/book/application_development/9781788995092/1/ch01lvl1sec14/overview-of-the-hadoop-ecosystem]. It is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. This diagram shows a brief overview of the Hadoop ecosystem in the Apache technology stack:
3.1. Batch
Batch processing is where the processing happens of blocks of data that have already been stored over a period of time. For example, processing all the transaction that have been performed by a major financial firm in a week. This data contains millions of records for a day that can be stored as a file or record etc.
Possible solutions
- Spark (Apache Spark - quick start guide)[https://spark.apache.org/docs/latest/quick-start.html]
- Ignite
- direct solution: uploading a csv file with the data into the source code of the application
- Hadoop MapReduce
3.2. Streaming
Stream processing allows to get analytics results in real time. It enables to process data in real time as they arrive. Thereby, data cam be fed into an analytics tools as soon as they get generated. Apache Spark enables the distribution of data analysis (also in the cloud) The following figure gives you a detailed explanation how Spark process data in real time:
Possible solution
- Cron Job (manually fetching the data every x seconds
- Delta lake (Databricks): costs too much (Apache HBase alternatively)
- A virtual machine like (Google Cloud)[https://console.cloud.google.com/getting-started?pli=1] for the data simulation and streaming of the data to the application. Possible streaming solutions are:
- Apache Kafka
- Apache Flink
- Apache Storm
4. Database Management Systems
PostgreSQL was chosen from these options: PostgreSQL, SQLite, MySQL. The reason for this is that it has a better support than MySQL for Python. Further, SQLite ist not supported by the Google Cloud Database Service.
Resources:
- Why is PostgreSQL a good database choice
- Which database engine to choose for Django app?
- Why it is not possible to setup a SQLite database on a Google Cloud VM
5. Data Science Tools
Due to the team's knowledge, Pandas, NumPy, and XGBoost was choosen as data science Python libraries for data analytics.
- Pandas: library for data analysis and manipulation in Python
- NumPy: provides all of the basic functions in scientific computing and is able to process lots of data quickl
- Modin): enables working with large data sets while writing with the Pandas syntax
- Wget: utility for non-interactive download of files from the Web.
- Pendulum: Python package to ease datetimes manipulations
- XGBoost: implements machine learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems.
Resource: Beginner’s Guide to Data Science Libraries in Python
6. Distributed Computing Systems
Processing the data could be distributed using one of these solutions (if there is enough time left)
- Ray: framework for building and running distributed applications
- Dataproc with PySpark: provides tools for batch processing, querying, streaming, and machine learning