Data integration design - ONSdigital/DRAFT_DE_learning_roadmap GitHub Wiki
Data integration involves combining data from different sources to provide a centralised view to help produce, transform and test data related products. Analytics, reporting and operations that rely on data from multiple systems are dependent on a well designed Data Integration design. An ONS specific example would be the Management Information system for Social Surveys, which has input data from various sources including Fusion, T-Mobile, reference data in Shared Drive and an app called Blaise and integrating into a data warehouse. The awareness level of the skill is currently not defined in the GDD framework but we include resources here to get started with this skill.
Awareness
Awareness: explain the basics and benefits of data integration design.
Skills:
- Basic knowledge of data formats (CSV, Excel, JSON)
- Be able to manually merge data using spreadsheets
- Understand what APIs and databases are (at a high level)
- Awareness of issues such as duplicates or mismatched IDs.
Tools: Excel
Informal learning:
Learning Hub: Awareness of Data Linkage – Even data engineers who do not belong to the sub role of linkage engineer will need awareness level in data linkage.
Working
Working: deliver data solutions in accordance with agreed organisational standards that ensure services are resilient, scalable and future-proof. Be able to write scrips/code for integration.
Skills:
- Be able to use Python/Pyspark/SQL to join or merge data
- Understand primary keys, foreign keys and joins
- Can pull data from databases
Tools: Python (Pandas), SQL, PySpark
Informal learning:
Practitioner
Practitioner: select and implement the appropriate technologies to deliver resilient, scalable and future-proofed data solutions. Be able to build scalable integration pipelines.
Skills:
- Design ETL pipelines that combine data from multiple sources
- Resolve schema mismatches and handle slowly changing dimensions
- Implement data mapping and transformation logic
- Use orchestration tools and monitor data flows.
Tools: Custom ONS packages including Plumb-ELT, DE Utils, and DLH Utils, and Apache Airflow (for GCP)
Informal learning:
Cloud Skills Boost Course: Building Batch Data Pipelines on Google Cloud. This is a long course (approximately 12 hours) it covers everything from beginner questions such as "What is an ETL pipeline?" up to "How do I facilitate a multi service ETL pipeline on GCP".
ONS colleagues will probably want to follow the Python track over the Java track.
Cloud Skills Boost Lab: Writing an ETL Pipeline using Apache Beam and Dataflow (Python)
Cloud Skills Boost Lab: Basic Airflow lab
Expert
Expert: Establish standards, keep them up to date and ensure adherence to them and keep abreast of best practice in industry and across government.
Skills:
- Architects batch integrations (possibly also real-time integrations but unaware of ONS conducting real-time integration).
- Designs for scalability, security and compliance
- Governs metadata, data lineage and change data capture (CDC)
Tools: Kafka, Snowflake, Cloud Composer (GCP)
General resources
We have a few in house tools and standards which are useful at all levels of this skill.
ons-python-template
At the more fundamental level we have an ONS Python project template.
Link: ONSdigital/ons-python-template
De-Utils
Open-source Data Engineering utility package that contains useful functions when working with HDFS, Spark and ETLs. The functions in this repo are not tailored to a specific project and can be implemented for wider use. As the package is open source, anyone can contribute by either changing code or contributing a function they have made themselves.
Link: ONSdigital/de-utils
Plumb-etl
Python library that contains pipeline stages, transformers and utilities for creating ETL PySpark pipelines. Teams within Data Engineering in DEOD are in the process of integrating plumb-etl into their own data processes. Plumb-ETL can provide a standardised pipeline infrastructure, resulting in easier communication and staff transition between data engineering teams as well as a cross-team collaboration in maintaining and updating the infrastructure.
DLH_Utils
Although this learning roadmap is supposed to be team agnostic, due to the wide need for data linkage across DGO teams we include here DLH_Utils.
Link: Data-Linkage/DLH_Utils: Library containing common profiling, cleaning and linkage functions