01 Introduction to privacy enhancing federated analysis and DataSHIELD - interconnectDiabetes/Standard_operating_procedures GitHub Wiki
Research is increasingly dependent on the analysis of sensitive data from multiple sources that cannot be easily shared. Making data available so that it can be used may be impeded by ethico-legal processes or by fear of loss of control of data. This results in a lot of work to get permission to gather several datasets together in a central location. Sharing analysis plans and gathering results centrally can be an administrative challenge and is inflexible in terms of exploratory analysis.
DataSHIELD provides a novel solution that can circumvent some of these challenges. The key feature of DataSHIELD is that data stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform only allows a user to pass analysis commands to each server and receive results that are designed to disclose summary data. For example, the user can fit a linear model to the data but not see the residuals. DataSHIELD is built so that is not possible for a user to request the whole dataset or access to detailed data. Other disclosure protection features are built in such as requiring a minimum cell count of data points in a summary. Thus DataSHIELD can be used to analyse data without physically sharing it with the users and without giving access to individual records. We refer to this as federated analysis.
More information about DataSHIELD can be found here.
To enable federated analysis to happen using DataSHIELD, the following steps are needed:
- Each participating group needs to load the data required for analysis to a server running DataSHIELD.
- The data then need to be harmonised to common scales and measures, a simple example being that weights should be converted to kg.
- A user wishing to run an analysis is then provided with an account and permissions to run the code.
- The user can then run the analysis across all the participating groups and build an overall result.