Home - ONSdigital/DRAFT_DE_learning_roadmap GitHub Wiki

Welcome to the DRAFT Data Engineering Learning Roadmap Homepage

The pathway is currently still being developed and is still missing lots of learning resources. Any questions can be directed to the Head of Profession for Data Engineering, the Head of the Data Engineering Community of Practice, or the Capability Team in DGO, DALI.

The purpose of this learning roadmap is to be a place to find resources for learning core data engineering skills. These courses are just suggestions, you don’t have to learn things from these resources. If you have alternatives that you would like to share, please add them. The learning roadmap is not:

Project specific - project specific training should be managed by your team and treated as separate from the learning roadmap.
A brain dump - this is not intended to be a collection of 100 ways of how to learn git (as an example); it's supposed to be a curated list of the most useful and relevant training.
Reinventing the wheel - there are roadmaps and broad learning resources out there already, which cover more than one of the sub-topics we cover here, see for example DAPCats Wiki, Data Knowledge Hub, roadmap.sh, r/dataengineering, Cloud Skills Boost, and the DGO apprentice SharePoint which all contain lots of useful training links. What it means to be a data engineer in the ONS overlaps with what it means to be a data engineer elsewhere, so we recommend you use these resources. However, in this roadmap, we intend to collect learning resources that support all expectations of an ONS data engineer.

How to use this roadmap

Remember the 10:20:70 split

10% Formal training (e.g. structured learning, eLearning, classes)
20% Relationships (e.g. coaching, feedback, performance discussions)
70% Job experience

The roadmap is structured according to the GDD role description for a data engineer primarily, secondary to that it follows the ONS Data Engineer Role Profile, this role profile is currently undergoing restructuring. Firstly, you select a skill, and then you can look through the different levels of the skill.

The different levels of expertise are:

0 = No knowledge/Skills
1 = Awareness: You can describe the fundamentals of the skill and demonstrate basic knowledge of some of the skill's tools and techniques.
2 = Working: You can apply the skill with some support and adopt the most appropriate tools and techniques.
3 = Practitioner: You can apply the skill without support, determine and use the most appropriate tools and techniques, and share knowledge and experience of the skill.
4 = Expert: You can lead and guide a team or organisation in the skill's best practices, and teach the skill's advanced tools and techniques.

It is not enough to just do the training courses. To be an expert, for example, you would have to have led a team to complete a project where that skill was used. Working knowledge would require you to have used the skill in a work project.

Please note: If for example you are looking to upskill to practitioner level in a skill please don't just look at the practitioner section on the appropriate page. You will usually find lots of helpful resources in the general recourses section first.

This learning roadmap does not represent the total training offering of DGO. If you think a level 5 data engineering apprenticeship or bootcamp might benefit you, or if you are interested in doing some pair programming/ knowledge sharing, please contact Kayla directly.

Keeping a CPD log

A requirement for being part of the Government Digital and Data profession is to maintain your Continuous Professional Development (CPD). Keeping a log of this is a great way to ensure you’re fulfilling this. A CPD log can be a great basis for development chats in check-ins and is a great way to track progress towards your personal development plan. You can find a template here but you do not have to use this template, you can record your development in the way that suits you best.

Platforms

The further you travel on your data journey the more you are likely to specialise on one particular platform/ set of tools. However, there are always core concepts to master as well. Whatever cloud platform you are working on, you should always try to follow the government cloud first policy, although this is not always possible.

The Cloud Data Platform (CDP) can be thought of as an 'extra layer' on top of AWS cloud services, where Cloudera handles the pains of managing infrastructure - whereas if we were to use cloud services directly (e.g. directly use AWS), we would have to manage the infrastructure ourselves. There are positives/negatives of directly using cloud services verses using them via a third party like Cloudera, and depends on what the use case is. At the ONS, we don’t currently have access to all Cloudera services. These services will be added gradually over time. Additionally, both Cloudera and GCP have - and will continue to - change, add, and remove services over time. In the spirit of platform agnosticism, we include here a table of cloud components and the names of the most frequently used services on the two main ONS data platforms. This table is far from exhaustive, there are multiple GCP database services for example, but Cloud SQL is likely to be the one ONS colleagues will use for our business needs.

Component	CDP Service	GCP Service
Data lake	AWS S3 Buckets - accessed from HUE	Cloud Storage
Data warehouse	SQL Server and HDFS with Hive Tables	BigQuery
Data transfer	Polybase	Cloud Functions to initiate Dataproc
Data transformation	PySpark via Cloudera Machine Learning	PySpark via Dataproc (Cloudflow if a team created a brand new pipeline)
Database	SQL Server with SSMS	SQL Server with Cloud SQL
Version control	GitLab	GitHub

You can find platform specific training for GCP on Cloud Skills Boost, you can find ONS specific GCP training on confluence e.g. CATD homepage. There is a small amount of free Cloudera training on their website with lots of ONS specific CDP training on the DAPCats wiki.

Technology

We appreciate that users of this roadmap may want to upskill in particular technologies rather than skills. Unfortunately, there is no easy way to search the wiki on an on-net machine without cloning the wiki and ctrl+f using an IDE. So instead, we recommend where to look for each technology here.

SQL

If you are looking to upskill yourself in SQL, please look in Data analysis and synthesis. There are also some labs involving SQL on Cloud Skills Boost and Cloudera Training.

Python

If you are looking to upskill yourself in Python, first please see the guide for getting started with Python on an on-net laptop on the ASAP wiki. Then see Data analysis and synthesis and Programming and build (data engineering).

PySpark

And for a similar set of resources for PySpark, see first the DAPCATS wiki then also see Data analysis and synthesis and Programming and build (data engineering).

Essential Skills

The list of essential skills for any data engineer

Communicating between the technical and non-technical
Data analysis and synthesis
Data development process
Data innovation
Data integration design
Data modelling
Metadata management
Problem resolution (data) including Agile working
Programming and build (data engineering)
Technical understanding
Testing including test driven development (TDD)

Note that data engineers have four skills in common with data architects and so some resources you find on the communicating, data analysis, data modelling, and metadata management pages have been taken directly from the data architect learning pathway. We thank the DAB team for their help with these sections.

The data engineer role profile also outlines essential responsibilities for a data engineer at each level. The role profile includes the following key responsibilities:

Implement data flows to connect operational systems, data for analytics and business intelligence (BI) systems
Document source-to-target mappings
Re-engineer manual data flows to enable scaling and repeatable use
Support the build of data streaming systems
Write ETL (extract, transform, load) scripts and code to ensure the ETL process performs optimally
Develop business intelligence reports that can be reused
Build accessible data for analysis

Behaviours

As well as skills demonstrating behaviours at appropriate levels is an essential part of progression. The behaviours related to the data engineering role are:

Making effective decisions
Working Together
Changing and improving
Delivering at Pace
Leadership
Communicating and Influencing

You can find descriptions of these behaviours and more information here.

Development of this roadmap

This Roadmap is under construction and we attempt to measure completeness in this section.

GDD Skill	Awareness	Working	Practitioner	Expert	TOTAL COMPLETE %
Comms	50	50	50	50	50
Analysis	100	100	25	N/A	75
Data development	100	100	100	0	75
Data innovation	50	50	50	75	56
Data integration	100	100	100	25	81
Data modelling	100	100	100	100	INT REVIEW READY
Metadata	75	75	75	25	63
Problem resolution	100	100	100	100	EXT REVIEW READY
Programming & Build	100	50	0	0	38
Technical understanding	100	100	100	100	INT REVIEW READY
Testing	100	100	100	N/A	INT REVIEW READY
TOTAL COMPLETE %	89	84	73	53