User Guide - Youth-Transitions/resources GitHub Wiki

Introduction

The Longitudinal Education Outcomes (LEO) dataset is a data product created by the Department for Education (DfE), available for approved third-party researchers to access and analyse.

In this guide, we will provide an overview of process of gaining access to this dataset and some handy tips on working with it.

We have also provided some exemplar scripts, which you could even ingest into the SRS environment for your own LEO projects. These scripts cover the processing of all the component datasets which relate to the post-16 activities and transitions of young people.

Examples of projects where we have used this processing include:

Prerequisites

To apply to use LEO, or to work on a LEO project led by someone else, you must become an ONS accredited researcher.

Access to LEO is provided through the SRS environment, which can only be accessed from approved sites within the UK. If your organisation is not already approved, you will either need to gain an Approved Organisational Connectivity (AOC) agreement or use the secure facilities provided by ONS. Information on all these options is available here. Note that home access is not available for LEO.

Applying to use LEO

Accredited researchers can apply to use LEO from the ONS Project Accreditation Service for SRS (PASS). For information and guidance, see the page on applying to access the LEO dataset.

An application to use LEO consists of:

A project application (completed through PASS)
Either a UKSA ethics self-assessment or evidence of a project being approved by an ethics committee
A completed variable request form

It generally takes between 2 and 4 months from an application being submitted to the data being made available in the ONS Secure Research Service (SRS).

General hints and tips

Largely due to issues of size, source data is provided in a set of SQL Server databases. Some knowledge of SQL will prove beneficial, though there is a very basic quick-start guide available within the user guides folder inside SRS.

LEO data is provided under Part 5, Chapter 5, Section 64 of the Digital Economy Act.

Schools, Colleges, Universities and other corporate bodies cannot be directly identified in LEO. Rather than seeing identifiers such as LAESTAB, URN and UKPRN you will see pseudonymised versions instead. If you wish to ingest additional organisational-level data, you will need to submit it to the LEO pseudonymisation service ([email protected]). This can take several weeks to be completed.

Home access to LEO is not permitted, even where an organisation has home access arrangements for other projects in SRS.

Working in SQL Server

In LEO, source data is provided in linked databases.

NPD_i2 contains data from the National Pupil Database
ILR_i2 contains data from the Individualised Learner Record
HESA_i2 contains data from the HESA student record
LEO_i2 contains data relating to employment, earnings and benefits, plus bridging tables of identifiers

You will be able to look at (query) the data in the source databases but not change the data or create any new tables in these databases.

In addition, you will be given access to your own project workspace where you can create new tables and views, and store any data you plan to ingest into SRS and wish to work with alongside LEO data in the SQL environment.

The example scripts provided use local references (within a database) for write operations so, if you plan to use these, you will need to run the code in the context of the project database.

Limitations

We assume users have a basic knowledge of SQL.

You may need to edit the scripts slightly to get them to execute. This is because they depend to some extent on which tables and columns you have available to you on your LEO project. We tend to work with a common, fixed core of tables and columns from project to project. Here is an example of the LEO Severe Variables Workbook from a previous project of ours which shows which tables and columns we typically work with.

In the absence of a common person-level identifier across the source datasets that compose LEO, we have to rely on person-level matching carried out within government. Several person-level identifiers are used in LEO: PupilMatchingRefAnonymous is the primary identifier in National Pupil Database and EduKey is the primary identifier used in DWP/HMRC datasets (employment, earnings, benefits). These identifiers themselves are derived from person-level matching across component datasets using name, date of birth, gender and postcode. Names, postcodes and dates of birth are not available in LEO for analysts to rematch data.

Information on the quality of matching is not provided in LEO. There is scope for records to be over-matched (i.e. records relating to 2 or more different real-world people have been joined together) and under-matched (multiple records that belong to the same real-world person have not been matched).

The scripts implicitly impose our standards and assumptions. We welcome suggestions for improvements.