Home - ONSdigital/DRAFT_DE_learning_roadmap GitHub Wiki
Welcome to the DRAFT Data Engineering Learning Roadmap Homepage
The pathway is currently still being developed and is still missing lots of learning resources. Any questions can be directed to the Head of Profession for Data Engineering, the Head of the Data Engineering Community of Practice, or the Capability Team in DGO, DALI.
The purpose of this learning roadmap is to be a place to find resources for learning core data engineering skills. These courses are just suggestions, you don’t have to learn things from these resources. If you have alternatives that you would like to share, please add them. The learning roadmap is not:
- Project specific - project specific training should be managed by your team and treated as separate from the learning roadmap.
- A brain dump - this is not intended to be a collection of 100 ways of how to learn git (as an example); it's supposed to be a curated list of the most useful and relevant training.
- Reinventing the wheel - there are roadmaps and broad learning resources out there already, which cover more than one of the sub-topics we cover here, see for example DAPCats Wiki, Data Knowledge Hub, roadmap.sh, r/dataengineering, Cloud Skills Boost, and the DGO apprentice SharePoint which all contain lots of useful training links. What it means to be a data engineer in the ONS overlaps with what it means to be a data engineer elsewhere, so we recommend you use these resources. However, in this roadmap, we intend to collect learning resources that support all expectations of an ONS data engineer.
How to use this roadmap
Remember the 10:20:70 split
- 10% Formal training (e.g. structured learning, eLearning, classes)
- 20% Relationships (e.g. coaching, feedback, performance discussions)
- 70% Job experience
The roadmap is structured according to the GDD role description for a data engineer primarily, secondary to that it follows the ONS Data Engineer Role Profile, this role profile is currently undergoing restructuring. Firstly, you select a skill, and then you can look through the different levels of the skill.
The different levels of expertise are:
- 0 = No knowledge/Skills
- 1 = Awareness: You can describe the fundamentals of the skill and demonstrate basic knowledge of some of the skill's tools and techniques.
- 2 = Working: You can apply the skill with some support and adopt the most appropriate tools and techniques.
- 3 = Practitioner: You can apply the skill without support, determine and use the most appropriate tools and techniques, and share knowledge and experience of the skill.
- 4 = Expert: You can lead and guide a team or organisation in the skill's best practices, and teach the skill's advanced tools and techniques.
It is not enough to just do the training courses. To be an expert, for example, you would have to have led a team to complete a project where that skill was used. Working knowledge would require you to have used the skill in a work project.
Please note: If for example you are looking to upskill to practitioner level in a skill please don't just look at the practitioner section on the appropriate page. You will usually find lots of helpful resources in the general recourses section first.
This learning roadmap does not represent the total training offering of DGO. If you think a level 5 data engineering apprenticeship or bootcamp might benefit you, or if you are interested in doing some pair programming/ knowledge sharing, please contact Kayla directly.
Keeping a CPD log
A requirement for being part of the Government Digital and Data profession is to maintain your Continuous Professional Development (CPD). Keeping a log of this is a great way to ensure you’re fulfilling this. A CPD log can be a great basis for development chats in check-ins and is a great way to track progress towards your personal development plan. You can find a template here but you do not have to use this template, you can record your development in the way that suits you best.
Platforms
The further you travel on your data journey the more you are likely to specialise on one particular platform/ set of tools. However, there are always core concepts to master as well. Whatever cloud platform you are working on, you should always try to follow the government cloud first policy, although this is not always possible.
The Cloud Data Platform (CDP) can be thought of as an 'extra layer' on top of AWS cloud services, where Cloudera handles the pains of managing infrastructure - whereas if we were to use cloud services directly (e.g. directly use AWS), we would have to manage the infrastructure ourselves. There are positives/negatives of directly using cloud services verses using them via a third party like Cloudera, and depends on what the use case is. At the ONS, we don’t currently have access to all Cloudera services. These services will be added gradually over time. Additionally, both Cloudera and GCP have - and will continue to - change, add, and remove services over time. In the spirit of platform agnosticism, we include here a table of cloud components and the names of the most frequently used services on the two main ONS data platforms. This table is far from exhaustive, there are multiple GCP database services for example, but Cloud SQL is likely to be the one ONS colleagues will use for our business needs.
Component | CDP Service | GCP Service |
---|---|---|
Data lake | AWS S3 Buckets - accessed from HUE | Cloud Storage |
Data warehouse | SQL Server and HDFS with Hive Tables | BigQuery |
Data transfer | Polybase | Cloud Functions to initiate Dataproc |
Data transformation | PySpark via Cloudera Machine Learning | PySpark via Dataproc (Cloudflow if a team created a brand new pipeline) |
Database | SQL Server with SSMS | SQL Server with Cloud SQL |
Version control | GitLab | GitHub |
You can find platform specific training for GCP on Cloud Skills Boost, you can find ONS specific GCP training on confluence e.g. CATD homepage. There is a small amount of free Cloudera training on their website with lots of ONS specific CDP training on the DAPCats wiki.
Technology
We appreciate that users of this roadmap may want to upskill in particular technologies rather than skills. Unfortunately, there is no easy way to search the wiki on an on-net machine without cloning the wiki and ctrl+f
using an IDE. So instead, we recommend where to look for each technology here.
SQL
If you are looking to upskill yourself in SQL, please look in Data analysis and synthesis. There are also some labs involving SQL on Cloud Skills Boost and Cloudera Training.
Python
If you are looking to upskill yourself in Python, first please see the guide for getting started with Python on an on-net laptop on the ASAP wiki. Then see Data analysis and synthesis and Programming and build (data engineering).
PySpark
And for a similar set of resources for PySpark, see first the DAPCATS wiki then also see Data analysis and synthesis and Programming and build (data engineering).
Essential Skills
The list of essential skills for any data engineer
- Communicating between the technical and non-technical
- Data analysis and synthesis
- Data development process
- Data innovation
- Data integration design
- Data modelling
- Metadata management
- Problem resolution (data) including Agile working
- Programming and build (data engineering)
- Technical understanding
- Testing including test driven development (TDD)
Note that data engineers have four skills in common with data architects and so some resources you find on the communicating, data analysis, data modelling, and metadata management pages have been taken directly from the data architect learning pathway. We thank the DAB team for their help with these sections.
The data engineer role profile also outlines essential responsibilities for a data engineer at each level. The role profile includes the following key responsibilities:
- Implement data flows to connect operational systems, data for analytics and business intelligence (BI) systems
- Document source-to-target mappings
- Re-engineer manual data flows to enable scaling and repeatable use
- Support the build of data streaming systems
- Write ETL (extract, transform, load) scripts and code to ensure the ETL process performs optimally
- Develop business intelligence reports that can be reused
- Build accessible data for analysis
Behaviours
As well as skills demonstrating behaviours at appropriate levels is an essential part of progression. The behaviours related to the data engineering role are:
- Making effective decisions
- Working Together
- Changing and improving
- Delivering at Pace
- Leadership
- Communicating and Influencing
You can find descriptions of these behaviours and more information here.
Development of this roadmap
This Roadmap is under construction and we attempt to measure completeness in this section.
GDD Skill | Awareness | Working | Practitioner | Expert | TOTAL COMPLETE % |
---|---|---|---|---|---|
Comms | 50 | 50 | 50 | 50 | 50 |
Analysis | 100 | 100 | 25 | N/A | 75 |
Data development | 100 | 100 | 100 | 0 | 75 |
Data innovation | 50 | 50 | 50 | 75 | 56 |
Data integration | 100 | 100 | 100 | 25 | 81 |
Data modelling | 100 | 100 | 100 | 100 | INT REVIEW READY |
Metadata | 75 | 75 | 75 | 25 | 63 |
Problem resolution | 100 | 100 | 100 | 100 | EXT REVIEW READY |
Programming & Build | 100 | 50 | 0 | 0 | 38 |
Technical understanding | 100 | 100 | 100 | 100 | INT REVIEW READY |
Testing | 100 | 100 | 100 | N/A | INT REVIEW READY |
TOTAL COMPLETE % | 89 | 84 | 73 | 53 |