Data analysis and synthesis - ONSdigital/DRAFT_DE_learning_roadmap GitHub Wiki
Data analysis and synthesis for a data engineer means Analysis and insight involves examining, interpreting and analysing data to help make informed decisions. The GDD framework only describes Working and Practitioner levels for data engineering. We have included descriptions for the Awareness and level here as well because of the wide range of data engineering skill levels we have at the ONS so that apprentices and colleagues coming from other professions can find appropriate resources.
- Awareness: explain the benefits of quantitative analysis, and how data profiling can be useful.
- Working: undertake data profiling and source system analysis present clear insights to colleagues to support the end use of the data
- Practitioner: understand and help teams to apply a range of techniques for data profiling source system analysis from a complex single source bring multiple data sources together in a conformed model for analysis
General Resources
Coffee & coding: monthly casual meetings where projects are presented with time for Q&A at the end.
RAP network: Similar to Coffee & Coding but focussed on reproducible analytical pipelines.
For getting started with Python on ONS on net machines please see the ASAP guide (note that you must be connected to the VPN to access the link).
At working and practitioner levels you will probably start to use PySpark and so readers may also find the DAP Community a useful place to get information.
And of course there is always viva engage if you need to ask for help because you can't find your answer elsewhere.
Awareness
Data analysis and synthesis is a skill separate from programming and build but in practice all data engineers will be expected to use Python or another programming language to perform data analysis and/or synthesis. In this section we will cover the creation of statistics and data profiling using common Python libraries and/or SQL where appropriate.
At the level of awareness you should be able to explain the benefits of quantitative analysis, and how data profiling can be useful.
Learning Hub: Awareness of coding tools โ this course will familiarise you with the basic principles of coding tools which are essential for a data engineer using data analysis.
Article: Data profiling this resource was recommended to the apprentices by DAB when they were first getting started with data profiling.
Article: PySpark vs Pandas โ working in the ONS, especially whilst working on DAP you will often have to choose between using PySpark or Python Pandas to write code.
Learning Hub: Foundations of SQL โ SQL is an essential tool for any data engineer and can often be easier and quicker to use for very simple data queries compared to Python. It is also the foundational language on which the Google BigQuery dialect is built on.
ydata-profiling: this isn't really a learning resource but its a nifty package to do some quick data profiling of a pandas dataframe.
SQL Noir - a fun way to learn some SQL.
SQL Premier League - similar to the above but focussed on sports (lots of sports not just football).
Working
You know how to undertake data profiling and source system analysis and can present clear insights to colleagues to support the end use of the data.
Learning Hub: Python - DataFrames, Manipulation, and Cleaning
Learning Hub: Data Visualisation in Python Data visualisation is not something that a data engineer should expect to master but itโs good to be able to create quick plots for the purposes of understanding.
Learning Hub: Introduction to PySpark again PySpark isnโt necessary for the theory behind data analysis and synthesis but at the ONS it has been essential and the language of choice for large-scale data processing in a distributed environment.
Practitioner
There is no single recommended package for conducting data profiling in ONS data engineering. Y-data profiling does work in PySpark but individual engineers will have their favourite packages.