About - CERIT-SC/funnel-gdi GitHub Wiki
This repository is a fork of ohsu-comp-bio/funnel that we use for developing a customised Task Execution Service for the European Genomic Data Infrastructure (commonly referred to as the GDI).
- Keep our codebase up-to-date with the latest changes from the upstream.
- Develop our add-ons and customisation in a separate branch (
master-gdi
). - If a feature or fix could serve a wider audience, introduce the code to the ohsu-comp-bio repository.
Similarly to the original Funnel, we use semantic versioning.
Original Funnel versions contain three components: <major>.<minor>.<patch>
.
Our releases add the fourth component, which starts with 1
and is increased by
one with very release that is based on the same upstream Funnel version.
We release only amd64
and arm64
Docker images
next to the GitHub repository.
Software binaries can also be built from the source code of the repository.
In the GDI, we envision a task execution service next to each infrastructure that hosts some restricted genomic data. If it's legally difficult to move genomic data to some external research environment, this approach enables bringing data analysis scripts to the research data.
Researchers and medical doctors may request access to the data through the User Portal of the GDI (development version here). Data access requests will be reviewed by multiple parties, including the data-provider. Once the access has been approved, the users will be issued a visa to access the data (see GA4GH Passports).
Authorised researchers will find the local task execution service to use by the e-mail received when access was approved. Alternatively, there can be a central task execution service, an aggregator, that interacts with the local task execution services.
Researchers authenticate via a central authentication service, such as Life Science AAI, which also support the GA4GH Passports standard. This means that the services implement the OpenID Connect Protocol (OIDC) for requesting user authentication and for obtaining details (including passports and visas) about its authenticated users.
Although Task Execution Service and Funnel are not directly related to the GA4GH Passport standard, it is important that it would know how to interact with storages that support it (e.g. sensitive-data-archive and GA4GH htsget). Therefore, in the GDI, we need to add support for these storages.
In addition, some storages may serve data, which is encrypted according to the Crypt4gh specification. The Task Execution Service does not specify the support for different data retrieval protocols. Retrieving the data could be delegated to researchers who would have to decompress and decrypt the data using their own scripts. However, it would make using the service harder for its end-users. Therefore, in the GDI we need these features baked into the product while we still need to figure out how to simplify adding more data-retrieval protocols in the future.
Another important aspect for the GDI is the data release procedure: the files produced during the research are not downloadable before they have been reviewed by a designated party (e.g. the original data-provider). This could be achieved through careful deployment where the output-files could be stored in a designated S3 storage, where the researcher could list the files but not download them until the permission is given after data review.
Some noteworthy aspects about GDI-specific task execution services:
- Researchers must not see task executions of other users.
- Administrators may see and cancel task executions of the researchers.
- Administrators may use task execution logs for estimating usage costs.
- Administrators define a small number of storages that researches may use for input and output files.