EHR Pipeline Architecture and Requirements [DRAFT] - Analyticsphere/ehr-pipeline-documentation GitHub Wiki
Requirements
General requirements
- Patient data never leaves secure cloud environment
- Data files are not downloaded to local machines for processing
- All pipeline functionality is containerized
- OAuth is used for authentication whenever possible
- EHR pipeline is executed within a specific user group
Raw file processing
- Files received from study sites undergo various validations, and the results of these validations are saved to a report.
1.1 The report is available as a flat file in a GCS bucket, as well as a BigQuery table.
- The pipeline supports CSV and Parquet file formats.
2.1 CSV file formatting is validated.
2.1.1 Files are in UTF-8 or ASCII format.
2.1.2 All special characters (e.g. quotes within values) are properly escaped.
2.1.3 Carriage returns or new line characters within values are properly formatted.
2.2 Parquet file schema is converted to NCI CCC standardized schema.
- Data files are evaluated for their adherence to the OMOP CDM schema.
3.1 The pipeline supports designating some tables and columns within the schema as "required".
3.2 The pipeline is able to evaluate files against v5.3 and v5.4 of the CDM.
3.3 The pipeline can identify data files within deliveries that represent tables that are and are not in the OMOP CDM.
3.4 The pipeline can identify columns within data files that are and are not part of the OMOP CDM.
3.5 Pipeline validation supports the flexible nature of some data types within the CDM.
3.6 The pipeline identifies rows of data which contain values that do not meet data type specifications in the CDM.
- Raw data files are recreated as Parquet files with a standardized schema.
4.1 Raw data may need to be converted or edited due to the flexible nature of some data types within the CDM.
4.1.1 Timezone information, if provided, is stripped from datetime and time values.
4.1.2 Integer fields populated with non-fractional float values (i.e. #.0) are converted from float to integer.
4.2 Columns missing from the CDM are added to the Parquet file.
4.2.1 Missing _concept_id columns are populated with 0.
4.2.2 Missing non-required columns are populated with NULL values.
4.2.3 Missing required columns are populated with blank, 1970-01-01, -1, or -1.0 values (depending on their data type).
4.3 Tables missing from the CDM are recreated as blank Parquet files.
4.4 Missing standardized derived tables are populated following OHDSI guidelines.
4.4.1 drug_era script
4.4.2 condition_era script
4.5 Missing observation_period tables are populated following OHDSI conventions.
4.6 If the cdm_source table is not populated by the site, the pipeline creates a row containing the site name and the date of dataset delivery.
- Primary keys are recalculated.
5.1 The person_id values in all OMOP tables are replaced with a subject's Connect ID.
5.2 To prevent collisions, primary keys (other than person_id) are recreated using a deterministic algorithm that generates values which are unique across all sites.
- The pipeline can be executed with multiple scopes.
6.1 The pipeline can process files from the latest delivery from all sites within a single run.
6.1.1 The pipeline automatically identifies the latest delivery from a site.
6.1.2 The pipeline ensures each delivery is processed exactly once, preventing duplicate processing.
6.2 The pipeline can be executed against a single delivery from a single site.
6.2.1 Users will specify the site and delivery at the time of pipeline execution.
6.2.2 A single delivery from a single site can be reevaulated by the pipeline any number of times.
6.2.3 The output from executing the pipeline against a single delivery from a single site overwrites any existing data.
- The pipeline provides a mechanism to modify OMOP data for a given site.
7.1 The pipeline accepts a SQL script written against a standard OMOP schema, and can execute that script against specified sites.
-
The pipeline upgrades data from OMOP Common Data Model version 5.3 to version 5.4, making all necessary structural modifications.
-
The pipeline harmonizes vocabulary versions across different sites.
Architecture
- The pipeline follows a serverless architecture in which containerized applications are hosted in Google Cloud Run.
- Functionality is driven by API calls made by a centralized orchestrator - Airflow, hosted in Google Cloud Composer.
- DuckDB will be used an OLAP solution to process raw data files.
- Non-OHDSI tooling deployed by the pipeline is written such that functionality can be independently executed against individual files.
- Files will be processed concurrently to improve pipeline performance.