Data integration design - ONSdigital/DRAFT_DE_learning_roadmap GitHub Wiki

Data integration involves combining data from different sources to provide a centralised view to help produce, transform and test data related products. Analytics, reporting and operations that rely on data from multiple systems are dependent on a well designed Data Integration design. An ONS specific example would be the Management Information system for Social Surveys, which has input data from various sources including Fusion, T-Mobile, reference data in Shared Drive and an app called Blaise and integrating into a data warehouse. The awareness level of the skill is currently not defined in the GDD framework but we include resources here to get started with this skill.

Awareness

Awareness: explain the basics and benefits of data integration design.

Skills:

Basic knowledge of data formats (CSV, Excel, JSON)
Be able to manually merge data using spreadsheets
Understand what APIs and databases are (at a high level)
Awareness of issues such as duplicates or mismatched IDs.

Tools: Excel

Informal learning:

Learning Hub: Awareness of Data Linkage – Even data engineers who do not belong to the sub role of linkage engineer will need awareness level in data linkage.

Working

Working: deliver data solutions in accordance with agreed organisational standards that ensure services are resilient, scalable and future-proof. Be able to write scrips/code for integration.

Skills:

Be able to use Python/Pyspark/SQL to join or merge data
Understand primary keys, foreign keys and joins
Can pull data from databases

Tools: Python (Pandas), SQL, PySpark

Informal learning:

Practitioner

Practitioner: select and implement the appropriate technologies to deliver resilient, scalable and future-proofed data solutions. Be able to build scalable integration pipelines.

Skills:

Design ETL pipelines that combine data from multiple sources
Resolve schema mismatches and handle slowly changing dimensions
Implement data mapping and transformation logic
Use orchestration tools and monitor data flows.

Tools: Custom ONS packages including Plumb-ELT, DE Utils, and DLH Utils, and Apache Airflow (for GCP)

Informal learning:

Cloud Skills Boost Course: Building Batch Data Pipelines on Google Cloud. This is a long course (approximately 12 hours) it covers everything from beginner questions such as "What is an ETL pipeline?" up to "How do I facilitate a multi service ETL pipeline on GCP".

ONS colleagues will probably want to follow the Python track over the Java track.

Cloud Skills Boost Lab: Writing an ETL Pipeline using Apache Beam and Dataflow (Python)

Cloud Skills Boost Lab: Basic Airflow lab

Expert

Expert: Establish standards, keep them up to date and ensure adherence to them and keep abreast of best practice in industry and across government.

Skills:

Architects batch integrations (possibly also real-time integrations but unaware of ONS conducting real-time integration).
Designs for scalability, security and compliance
Governs metadata, data lineage and change data capture (CDC)

Tools: Kafka, Snowflake, Cloud Composer (GCP)

General resources

We have a few in house tools and standards which are useful at all levels of this skill.

ons-python-template

At the more fundamental level we have an ONS Python project template.

Link: ONSdigital/ons-python-template

De-Utils

Open-source Data Engineering utility package that contains useful functions when working with HDFS, Spark and ETLs. The functions in this repo are not tailored to a specific project and can be implemented for wider use. As the package is open source, anyone can contribute by either changing code or contributing a function they have made themselves.

Link: ONSdigital/de-utils

Plumb-etl

Python library that contains pipeline stages, transformers and utilities for creating ETL PySpark pipelines. Teams within Data Engineering in DEOD are in the process of integrating plumb-etl into their own data processes. Plumb-ETL can provide a standardised pipeline infrastructure, resulting in easier communication and staff transition between data engineering teams as well as a cross-team collaboration in maintaining and updating the infrastructure.

Link: ONSdigital/plumb-etl: Library containing common PySpark ETL pipeline stages, transformers and various utilities.

DLH_Utils

Although this learning roadmap is supposed to be team agnostic, due to the wide need for data linkage across DGO teams we include here DLH_Utils.

Link: Data-Linkage/DLH_Utils: Library containing common profiling, cleaning and linkage functions