TransferRoom Data Engineer Assignment - ahmedmaken/TransferRoom GitHub Wiki

Data Engineer Assignment

Objective

Design a scalable and efficient data architecture that seamlessly ingests large volumes of data for both OLTP and OLAP use cases. Emphasize best practices to enhance performance, optimize costs, and uphold the highest security standards

Current Solution

Customers leverage the platform via conventional native web and mobile experiences.
In addition, large and expensive OLAP and OLTP processes run on a continuous basis as well as a large number of scheduled processes and jobs.
All these processes connect to the primary production SQL Server database.

Issues with current solution

SQL server is not inherently horizontally scaleable which means that as the volume of data increase, it will become difficult of manage operationally
Prod server is being used for all analytical usecases without having prod realtime replica. This can severely degrade the prod server performance and can cause queries to fail and even data loss
No data modelling is done on SQL server optimised for data extraction
No role based access management is in place
No PII and sensitive data handling in place
No Data testing and validation in place
No monitoring and alerting in place for job completion, Failure notifications, cost alerts and resource consumption alerts
No infrastructure as a code present

What we need?

Data warehouse capable of ingesting realtime and batch data
Medallion architecture to separate ingestion layer from reporting layer
Config driven data pipelines to enable code reusing
CICD
PII and sensitive data handelling
Monitoring and alerting
Infrastructure as a code

Proposed Solution: Architecture

Realtime data ingestion

DLT Pipelines
Delta live streaming tables

Handling PII and sensitive data

Pulling sensitive and PII columns in separate schema in bronze
Masking sensitive and PII data using encryption techniques and dynamic functions
Limit access to sensitive datasets using roles-based access management

Version Control

One repo per data source
Config driven data pipelines
Single repo for infrastructure

Platform Engineering

Infrastructure as a code using terraform / azure resource manager

CICD

Implementation of CICD using GitHub actions

Formatting
Linting
Opensource vulnerability scan
Infrastructure plan and deploy
Unit tests
Integration tests
Preprod deployment
Production release

Monitoring and alerting

Monitoring platform using Azure application insights
Monitoring data pipelines using Monitoring framework
Ingest pipelines attributes against tags in the transactional database
Create Grafana dashboards
Create slack/teams alerts for airflow jobs