TransferRoom Data Engineer Assignment - ahmedmaken/TransferRoom GitHub Wiki
Data Engineer Assignment
Objective
Design a scalable and efficient data architecture that seamlessly ingests large volumes of data for both OLTP and OLAP use cases. Emphasize best practices to enhance performance, optimize costs, and uphold the highest security standards
Current Solution
- Customers leverage the platform via conventional native web and mobile experiences.
- In addition, large and expensive OLAP and OLTP processes run on a continuous basis as well as a large number of scheduled processes and jobs.
- All these processes connect to the primary production SQL Server database.
Issues with current solution
- SQL server is not inherently horizontally scaleable which means that as the volume of data increase, it will become difficult of manage operationally
- Prod server is being used for all analytical usecases without having prod realtime replica. This can severely degrade the prod server performance and can cause queries to fail and even data loss
- No data modelling is done on SQL server optimised for data extraction
- No role based access management is in place
- No PII and sensitive data handling in place
- No Data testing and validation in place
- No monitoring and alerting in place for job completion, Failure notifications, cost alerts and resource consumption alerts
- No infrastructure as a code present
What we need?
- Data warehouse capable of ingesting realtime and batch data
- Medallion architecture to separate ingestion layer from reporting layer
- Config driven data pipelines to enable code reusing
- CICD
- PII and sensitive data handelling
- Monitoring and alerting
- Infrastructure as a code
Proposed Solution: Architecture
Realtime data ingestion
- DLT Pipelines
- Delta live streaming tables
Handling PII and sensitive data
- Pulling sensitive and PII columns in separate schema in bronze
- Masking sensitive and PII data using encryption techniques and dynamic functions
- Limit access to sensitive datasets using roles-based access management
Version Control
- One repo per data source
- Config driven data pipelines
- Single repo for infrastructure
Platform Engineering
- Infrastructure as a code using terraform / azure resource manager
CICD
Implementation of CICD using GitHub actions
- Formatting
- Linting
- Opensource vulnerability scan
- Infrastructure plan and deploy
- Unit tests
- Integration tests
- Preprod deployment
- Production release
Monitoring and alerting
- Monitoring platform using Azure application insights
- Monitoring data pipelines using Monitoring framework
- Ingest pipelines attributes against tags in the transactional database
- Create Grafana dashboards
- Create slack/teams alerts for airflow jobs