OMOP Pipeline User Guide - Analyticsphere/ehr-pipeline-documentation GitHub Wiki
Table of Contents
0. Background
- 0.1 Incoming Data Delivery Requirements
- 0.2 Pipeline Configuration Options
- 0.3 Study Site Configuration
- 0.4 Pipeline Operation
- 0.5 Pipeline Logging
- 0.6 OMOP Pipeline Directory Structure
1. File Discovery
2. File Conversion
3. File Validation
4. File Normalization
- 4.1 Core Normalization Functions
- 4.2 Invalid Row Processing
- 4.3 Connect ID Handling
- 4.4 Normalization Output
5. CDM Standardization
6. Connect Participant Filtering
- 6.1 Exclusion Rules
- 6.2 Connect Data Export
- 6.3 Table-Level Filtering Behavior
- 6.4 Connect Reporting Artifacts
7. Vocabulary Harmonization
- 7.1 Overview and Purpose
- 7.2 Harmonization Stages
- 7.3 Domain-Based Table Reassignment
- 7.4 Consolidation and Primary Key Deduplication
- 7.5 Harmonization Outputs
8. Derived Table Generation
9. BigQuery Loading and CDM Finalization
10. Reporting and Analyzer Outputs
- 10.1 Delivery Report CSV
- 10.2 Data Quality Dashboard (DQD)
- 10.3 Achilles
- 10.4 PASS
- 10.5 Atlas Results Tables
- 10.6 HTML Delivery Report
0. Background
The OMOP pipeline is an automated workflow that prepares OMOP deliveries for use in BigQuery and then runs downstream analyses and reporting.
The pipeline is made up of three deployed components:
ccc-omop-file-processor: Python service that validates, standardizes, harmonizes, and loads OMOP files (link)ccc-omop-analyzer: R-based service that runs DQD, Achilles, PASS, Atlas results setup, and HTML report generation (link)ccc-orchestrator: Airflow workflows that coordinate the pipeline run (link)
These components are deployed and maintained separately, but work together to complete a single pipeline run.
The ccc-omop-file-processor and ccc-omop-analyzer components expose API endpoints that perform specific tasks within a pipeline run.
The ccc-orchestrator component calls those endpoints in the correct order and monitors the run as it progresses. That ordered Airflow workflow is called a Directed Acyclic Graph, or DAG. In this guide, "DAG" and "pipeline" may be used interchangeably.
0.1 Incoming Data Delivery Requirements
Connect deliveries must use one consistent set of technical choices within a single delivery. File format, CDM version, and date/datetime formatting must be consistent across all files in that delivery. These choices may change in later deliveries.
File Format
The pipeline supports OMOP data files that are individual flat files of formats:
.csv.csv.gz.parquet
CSV File Requirements
- UTF-8 or ASCII encoded
- Comma-delimited
- RFC 4180-style escaping for quotes, commas, backslashes, and other special characters within field values
- Consistent line endings using either LF or CRLF
- Field values must be single-line values with no embedded line breaks or carriage returns
- Exactly one header row, positioned as the first row
File Naming
- File names must exactly match the lowercase OMOP table name plus the correct extension (i.e.
person.csv) - Column names should be lowercase and match the OMOP CDM exactly
Date and Datetime Format
- Date and datetime values must use one consistent format across all files in a delivery
- ISO 8601 formats are strongly preferred, for example
2024-03-15and2024-03-15 14:30:00
Schema Requirements
- All OMOP tables in a delivery must use the same CDM version
- Deliveries may use OMOP CDM
5.3,5.3.1, or5.4
Subject Identifiers
- Every OMOP table that contains
person_idmust include both the Study ID (generated by the study site) and the Connect ID (generated by the Connect Coordinating Center) - An additional
connect_idcolumn must be included to store the Connect ID - Exclude participants whose Connect ID is unknown or cannot be provided
0.2 Pipeline Configuration Options
Configuration is split across the three pipeline components. At a high level, the available configuration includes:
ccc-orchestrator: Cloud Composer environment variables that define which processor and analyzer services to call, which Connect dataset to query, and which target OMOP CDM version and target vocabulary version the pipeline should standardize toccc-omop-file-processor: Cloud Run service and job deployment settings, storage behavior, vocabulary file location, logging configuration, and temporary DuckDB storage configurationccc-omop-analyzer: analyzer service and job deployment settings, authentication and secret configuration, artifact output locations, and job inputs for DQD, Achilles, PASS, Atlas results setup, and report generation
For the complete option lists and current variable names, see the README files for each of the individual components.
0.3 Study Site Configuration
Site-specific information is stored in dags/dependencies/ehr/config/site_config.yml in the Airflow DAG project. The DAG reads this file to determine where to find deliveries, which OMOP version was delivered, how to parse dates, where to load BigQuery tables, and how to query Connect data.
The fields currently read by the DAG are shown below:
site:
'synthea_54':
display_name: 'Synthea Synthetic Data'
gcs_bucket: 'synthea_cdm54'
file_delivery_format: '.csv'
project_id: 'nih-nci-dceg-connect-dev'
cdm_bq_dataset: 'synthea_cdm54'
analytics_bq_dataset: 'synthea_atlas_results'
omop_version: '5.4'
date_format: '%Y-%m-%d'
datetime_format: '%Y-%m-%d %H:%M:%S'
overwrite_site_vocab_with_standard: true
site_connect_id: 13
Field definitions:
display_name: Human-readable label used in reports and analyzer outputsgcs_bucket: Bucket name only, withoutgs://file_delivery_format: Which file extension the DAG will look for in the delivery folderproject_id: GCP project for BigQuery and Cloud Run job executioncdm_bq_dataset: BigQuery dataset that receives the OMOP CDM tables. Can be the same asanalytics_bq_datasetanalytics_bq_dataset: BigQuery dataset used by DQD, Achilles, and Atlas results tables. Can be the same ascdm_bq_datasetomop_version: OMOP CDM version delivered by the sitedate_formatanddatetime_format: Site-specific parsing formats used during normalizationoverwrite_site_vocab_with_standard:true: load the configured target vocabulary into BigQuery and skip site-delivered vocabulary filesfalse: do not load the target vocabulary; site-delivered vocabulary files are loaded if present
site_connect_id: Site identifier used in the Connect export query
Some site configuration files may also include post_processing, but the current DAG does not read that field.
0.4 Pipeline Operation
The pipeline follows an API-driven architecture orchestrated by Airflow. The production DAG runs daily and can also be triggered manually from Airflow. A non-exhaustive list of tasks executed by the pipeline, in order, includes:
- Check processor service health
- If not available, create optimized vocabulary files for the target vocabulary
- Find the latest date-based delivery for each site
- Query the BigQuery pipeline log table to decide whether each delivery needs processing; if so:
- Create artifact directories and build file configuration objects
- Convert incoming files to working Parquet files
- Validate table names and column names against the delivered OMOP CDM version
- Normalize data types, fill defaults, and isolate invalid rows
- Upgrade delivered CDM files to the target CDM version when required
- Remove data rows for patients who do not meet Connect eligibility rules
- Populate
cdm_sourcetable, if needed - Run eight vocabulary harmonization stages on clinical tables
- Generate derived OMOP tables
- Clear the BigQuery CDM dataset and load data tables
- Generate the delivery report CSV
- Run DQD, Achilles, and PASS in parallel
- Generate the interactive HTML delivery report
- Mark the delivery complete in the pipeline log table
Multiple sites can be processed within a single pipeline run, and multiple files are processed in parallel during runtime.
0.5 Pipeline Logging
The pipeline records execution state in three places:
- Airflow task logs
- Cloud Run service and job logs
- a BigQuery logging table managed by the processor service
The BigQuery logging table tracks one row per site + delivery_date. The key states are:
started: written whenget_unprocessed_filesbegins processing a deliveryrunning: refreshed by most downstream tasks while work is in progresserror: written when a task failscompleted: written only after the analyzer stage, Atlas table creation, and HTML report generation succeed
The delivery is selected for processing when:
- no log row exists for the site and delivery date, or
- the existing row has status
error
The delivery is skipped when the latest log row has status:
startedrunningcompleted
Operational notes:
- Reprocessing a delivery usually means removing or correcting its BigQuery log row and rerunning the workflow
- The pipeline log table is updated through BigQuery DML, so concurrent writes can still cause transient
Too many DML statements outstanding against tableerrors during busy runs
0.6 OMOP Pipeline Directory Structure
The processor creates the following directory layout under each delivery:
gs://{site_bucket}/{YYYY-MM-DD}/
βββ artifacts/
βββ converted_files/
β βββ person.parquet
β βββ condition_occurrence.parquet
β βββ ...
βββ invalid_rows/
β βββ person.parquet
β βββ condition_occurrence.parquet
β βββ ...
βββ connect_data/
β βββ participant_status.parquet
βββ harmonized_files/
β βββ condition_occurrence/
β β βββ condition_occurrence_source_target_remap.parquet
β β βββ condition_occurrence_target_remap.parquet
β β βββ condition_occurrence_target_replacement.parquet
β β βββ condition_occurrence_domain_check.parquet
β βββ ...
βββ omop_etl/
β βββ condition_occurrence/
β β βββ condition_occurrence.parquet
β βββ ...
βββ derived_files/
β βββ condition_era.parquet
β βββ drug_era.parquet
β βββ observation_period.parquet
βββ delivery_report/
β βββ delivery_report_{site}_{date}.csv
β βββ omop_delivery_report.html
β βββ tmp/
β βββ delivery_report_part_{uuid1}.parquet
β βββ ...
βββ dqd/
β βββ dqdashboard_results.json
β βββ dqdashboard_results.csv
β βββ errors/
βββ achilles/
β βββ achilles_results.csv
β βββ results/
βββ pass/
βββ pass_overall.csv
βββ pass_table_level.csv
βββ pass_field_level.csv
βββ pass_composite_overall.csv
βββ pass_composite_components.csv
Key points:
converted_files/is the main working area for file-level processing. Conversion, normalization, CDM upgrade, Connect filtering, andcdm_sourcepopulation all write back into this area.harmonized_files/stores intermediate vocabulary harmonization outputs by source table.omop_etl/stores the final, consolidated harmonized tables that are loaded to BigQuery and used to generate derived tables.derived_files/stores generated OMOP tables such ascondition_era,drug_era, and the standardizedobservation_period.delivery_report/tmp/stores small Parquet artifacts that are later consolidated into the final CSV report.dqd/,achilles/, andpass/are populated by the analyzer stage, not by the processor stage.
The artifact directory structure is created by the create_artifact_directories processor endpoint.
1. File Discovery
1.1 Folder Structure Requirements
The pipeline runs daily to check for new deliveries. Automatic discovery and execution relies on this file and directory structure:
gs://{site_bucket}/{YYYY-MM-DD}/{files}
Where:
{site_bucket}is the site bucket configured insite_config.yml{YYYY-MM-DD}is the delivery folder name{files}are the delivered OMOP files using the configuredfile_delivery_format
Only the most recent top-level folder that parses as YYYY-MM-DD is considered for each site.
1.2 Discovery Process
The file discovery phase begins in id_sites_to_process and get_unprocessed_files.
id_sites_to_process does the following:
- Reads the site list from
site_config.yml - Calls
create_optimized_vocabonce for the target vocabulary version - Finds the latest delivery folder for each site
- Queries the BigQuery pipeline log table for that site and delivery date
- Returns only the deliveries that are new or previously failed
get_unprocessed_files then:
- Writes the
startedlog entry for each selected site delivery - Creates the artifact directories in the delivery folder
- Calls
get_file_listwith the site bucket, delivery date, and configured file extension - Builds one
FileConfigobject per file, containing information from thesite_config.ymlfile
1.3 Processing Decision
If there are no deliveries to process, the end_if_all_processed task short-circuits the workflow by skipping the remaining tasks.
2. File Conversion
Incoming files are standardized into Parquet by the convert_file DAG task, which executes the process_incoming_file processor logic through a Cloud Run job. All converted files are written to artifacts/converted_files/. Note that the note_nlp.offset field has special handling logic in the pipeline, as offset is a reserved keyword.
All converted Parquet files, regardless of incoming data file type:
- use lowercase file names
- use lowercase, cleaned column names
- store all columns as strings at this stage
2.1 Processing Parquet Input
Incoming Parquet files are validated for readability and then copied into artifacts/converted_files/.
2.2 Processing CSV and CSV.GZ Input
Incoming .csv and .csv.gz files are converted to Parquet using DuckDB. During this conversion, invalid characters, erroneous formatting, and other common CSV issues are corrected.
The pipeline detects the file encoding before reading the CSV, and then attempts to convert to Parquet using strict parsing rules. If that fails, it retries with more permissive DuckDB CSV options:
store_rejects=Trueignore_errors=Trueparallel=False
If conversion fails with the permissive rules, the entire task is failed.
3. File Validation
Converted Parquet files are validated by the validate_file task. Validation compares the converted file against the delivered OMOP CDM version declared for the site.
Validation creates report artifacts but does not itself transform the file.
3.1 Table Name Validation
The processor derives the OMOP table name from the file name and checks whether that table exists in the CDM schema JSON for the delivered version.
Artifacts are created for:
- valid table names
- invalid table names
3.2 Column Name Validation
If the table name is valid, the processor compares the file's columns to the schema definition for that table.
Artifacts are created for:
- valid column names
- invalid column names
- missing schema columns
These artifacts are stored in delivery_report/tmp/ and later merged into the final delivery report CSV.
4. File Normalization
The normalize_file task runs the processor's normalization logic through a Cloud Run job. It rewrites the working Parquet file in artifacts/converted_files/ so downstream tasks can assume a consistent schema and consistent data types.
4.1 Core Normalization Functions
Normalization performs these operations:
-
Data type conversion
- casts each OMOP column to its target type
- tries to parse date and datetime columns with the site's configured formats
- falls back to default values when a required field cannot be parsed
-
Schema completion
- adds missing OMOP columns
- fills missing required columns with placeholder defaults
- fills missing
_concept_idcolumns with0 - drops extra columns that are not part of the OMOP table schema
-
Column standardization
- lowercases column names
- writes columns in consistent OMOP schema order
-
Primary key generation
- for tables with surrogate primary keys, builds a deterministic composite key from the other column values
- Note: primary keys at this stage can still collide when rows are duplicated or moved during harmonization, so a later harmonization step deduplicates them
-
Special-case handling
person.birth_datetimeuses dedicated parsing logic
4.2 Invalid Row Processing
Normalization also splits rows into valid and invalid groups.
A row is considered invalid when a required field cannot be cast to its appropriate type after applying the pipeline's parsing and fallback logic. In contrast:
- if a required column is missing entirely, the pipeline inserts a default value
- if a required column is present but
NULL, the pipeline inserts a default value
Those rows are not treated as invalid.
For invalid rows, the processor:
- writes the rejected rows to
artifacts/invalid_rows/{table}.parquet - removes those rows from the normalized working table
Report artifacts for valid and invalid row counts are created as well.
4.3 Connect ID Handling
Normalization also looks for site-delivered Connect identifiers. If any column name contains connectid or connect_id, that column is treated as the authoritative identifier and is used to populate person_id.
This happens in all OMOP tables that contain a person_id column.
4.4 Normalization Output
After normalization:
- the normalized table overwrites the file in
artifacts/converted_files/ - invalid rows are written to
artifacts/invalid_rows/. If there are no invalid rows, a file with 0 rows will be created. - row-count artifacts are written to
delivery_report/tmp/
5. CDM Standardization
The cdm_upgrade task upgrades delivered OMOP files to the configured target CDM version. In the current pipeline, the supported upgrade path is from 5.3 to 5.4.
5.1 Upgrade Determination
For each file:
- if the delivered CDM version already matches the target CDM version, the task does nothing
- otherwise, the processor checks whether that table changed between the two versions
5.2 Table Handling During Upgrade
The current 5.3 -> 5.4 handling is:
-
Removed tables
attribute_definitionis deleted from the processed delivery because it does not exist in CDM 5.4
-
Changed tables
- Version-specific SQL is applied to these tables:
visit_occurrencevisit_detailprocedure_occurrencedevice_exposuremeasurementobservationnotelocationmetadatacdm_source
- Version-specific SQL is applied to these tables:
-
New tables
- These are not generated during file upgrade. They are created later in BigQuery during dataset finalization.
episodeepisode_eventcohort
- These are not generated during file upgrade. They are created later in BigQuery during dataset finalization.
-
Unchanged tables
- tables that are same in both 5.3 and 5.4 pass through without modification
5.3 Upgrade Implementation
Upgrade SQL scripts are stored under reference/sql/cdm_upgrade/.
The processor:
- selects the script for the relevant table and version transition
- runs the SQL against the normalized working Parquet file
- overwrites the same working Parquet file in
artifacts/converted_files/
If the table was removed in the target CDM version, the processor deletes the processed Parquet artifact (not the source file) instead of rewriting it.
6. Connect Participant Filtering
Data for patients who are not consented and verified Connect participants, or whose participation status prohibits use of their EHR data (i.e. revoked HIPAA authorization, withdrew, and/or requested data destruction), are removed. Connect participant filtering runs after normalization and CDM upgrade so that filtering rules are applied to a standardized, consistent data structure.
Filtering runs once per pipeline execution and uses the participant information that is current at the time of execution. If a participant's status later changes, the pipeline can be rerun against an already processed delivery to apply the updated status. Additional filtering outside of the EHR pipeline is completed prior to data being released for research.
6.1 Exclusion Rules
The participant filter removes rows when any one of the following is true:
- the participant's Connect ID is missing, non-numeric, or
-1(the pipeline default value) - the participantβs Connect ID in the EHR data is absent from the Connect BigQuery
Participanttable for that site - the participant is not
Verified - the participant has withdrawn consent
- the participant has revoked HIPAA authorization
- the participant has requested data destruction
Internally, the processor applies the following Connect concept ID rules:
- verification status (
914594314) must equal197316935 - exclusion flags for HIPAA revocation, withdrawn consent, and data destruction requests (
773707518,747006172, and831041022, respectively) are triggered when their concept ID equals353358909
6.2 Connect Data Export
The retrieve_connect_data task runs once per site delivery and calls the processor's get_connect_data logic. It queries the Connect BigQuery Participant table for the current participant list associated with the site, along with their verification status and participation variables, then writes the result to artifacts/connect_data/participant_status.parquet.
This task also generates report artifacts identifying participants who should be excluded per the rules above, or should be included in the EHR delivery but are missing.
6.3 Table-Level Filtering Behavior
The filter_participants task runs once per file against the working Parquet file in artifacts/converted_files/. Filtering is applied to tables that contain a Connect ID; tables without a Connect ID are not filtered. Tables without a Connect ID contain metadata or vocabulary information - not clinical data.
Rows are retained only they belong to a patient that can be verified as a Connect participant, and the participant does not meet any exclusion rule.
Report artifacts are generated describing the number of rows removed; artifacts are generated even if 0 rows are removed.
6.4 Connect Reporting Artifacts
The Connect filtering stage adds report artifacts for:
- counts of rows removed (by table) because the Connect ID was missing or invalid
- counts of rows removed (by table) because the Connect ID in the EHR data was not found in the Connect
ParticipantBigQuery table - counts of rows removed (by table) because at least one of the participant exclusion rules applied
- A list ofConnect IDs present in OMOP delivery but missing from the Connect
ParticipantBigQuery table - A list of eligible Connect IDs missing from the delivery
7. Vocabulary Harmonization
Vocabulary harmonization standardizes clinical concept usage to the configured target vocabulary version. Harmonization applies only to these clinical tables:
visit_occurrencecondition_occurrencedrug_exposureprocedure_occurrencedevice_exposuremeasurementobservationnotespecimen
7.1 Overview and Purpose
Vocabulary harmonization exists because concept meanings, mappings, and domains change across vocabulary releases. The goal is to produce a final dataset whose clinical tables are aligned to one target vocabulary version, even when sites delivered data built against older vocabulary releases.
7.2 Harmonization Stages
The vocabulary harmonization process is split into eight stages:
-
source_target- remaps source concept IDs to updated target mappings. Run per file
-
target_remap- remaps non-standard target concepts when a newer standard mapping exists. Run per file
-
target_replacement- replaces concepts that have direct replacement relationships. Run per file
-
domain_check- verifies that the concept domain still matches the OMOP table where the row currently lives. Run per file
-
omop_etl- transforms rows into their destination OMOP tables. Run per file
-
consolidate_etl- merges per-file ETL outputs into one consolidated table per destination OMOP table. Run per delivery
-
discover_tables_for_dedup- inspects consolidated ETL outputs and identifies which destination tables need primary key deduplication. Run per ETL'ed file
-
deduplicate_single_table- rewrites each identified destination table so primary keys are unique. Run per ETL'ed file
7.3 Domain-Based Table Reassignment
During domain_check, the harmonizer assigns each row a target_table based on the current domain of its harmonized target concept. During omop_etl, rows are written into the destination OMOP table named in target_table.
If the current concept domain still matches the source table, the row stays in that table. If the concept now belongs to a different supported OMOP domain, the row is written to the corresponding destination table. If the domain is unknown or does not map to one of the supported harmonized domains, target_table defaults to the source table.
This means harmonization can create destination tables that were not present in the original delivery. For example:
- a site does not deliver a
notetable - a vocabulary update changes some harmonized concepts to the
Notedomain omop_etlwrites those rows to thenotedestination table
As a result, the harmonized output can include a valid note table even though the site did not originally deliver one.
7.4 Consolidation and Primary Key Deduplication
The first five stages operate per source file. After that:
consolidate_etlmerges all destination fragments for one site deliverydiscover_tables_for_dedupwrites a temporary table-config JSON file so the DAG can fan out deduplication workdeduplicate_single_tableruns in parallel across the discovered destination tables
These site-level consolidation and table discovery processes are required because domain_check and omop_etl can generate tables that were not in the site's original delivery; these newly generated tables require FileConfig objects in order to be processed by the DAG.
7.5 Harmonization Outputs
The harmonization stage produces two distinct artifact areas:
artifacts/harmonized_files/- per-source intermediate mapping outputsartifacts/omop_etl/- final consolidated destination tables used for: - BigQuery loading - derived table generation - downstream reporting
8. Derived Table Generation
The DAG generates derived OMOP tables after vocabulary harmonization is complete.
8.1 Purpose of Derived Tables
Derived OMOP tables are part of the standard CDM, but they are produced from other OMOP tables rather than loaded directly from source systems. The pipeline generates these according to OHDSI and THEMIS guidelines.
8.2 Generated Tables
The pipeline generates these tables:
-
condition_era- requires
condition_occurrence - groups related condition records into eras
- requires
-
drug_era- requires
drug_exposure - groups drug exposures into eras
- requires
-
observation_period- requires
person,visit_occurrence, anddeath - is always standardized by the pipeline, even when a site delivered its own
observation_period
- requires
observation_period uses one of three SQL paths:
visit_occurrenceplusdeath, if both are presentvisit_occurrenceonly, ifdeathis absent- if neither table is present, a generic observation period is created
Because load_derived_tables runs after load_remaining, the generated observation_period replaces any site-delivered observation_period table in BigQuery.
8.3 Implementation Approach
Derived table generation:
- checks whether the required source tables exist
- reads harmonized tables from
artifacts/omop_etl/when the source table was vocabulary-harmonized - reads working files from
artifacts/converted_files/orartifacts/omop_etl/ - executes the relevant SQL script from
reference/sql/derived_tables/ - writes the result to
artifacts/derived_files/{table}.parquet
Current implementation details:
drug_erauses a two-part SQL flow because it is more resource-intensive than the other derived tables- when a required source table is missing, the processor logs a warning and skips writing that derived table. The task does not fail.
9. BigQuery Loading and CDM Finalization
9.1 Data Loading Sequence
The BigQuery load order is:
-
prepare_bq- deletes all tables in the site's CDM dataset before loading the new delivery
-
load_harmonized_tables- loads consolidated
artifacts/omop_etl/tables to BigQuery - skips cleanly when no harmonized tables were created for the delivery
- loads consolidated
-
load_target_vocab- loads target vocabulary tables only when
overwrite_site_vocab_with_standardistrue
- loads target vocabulary tables only when
-
load_remaining- loads the remaining processed Parquet files from
artifacts/converted_files/ - skips:
- vocabulary tables when standard vocabulary loading is enabled
- clinical tables already loaded from
omop_etl/ cdm_source, which is loaded later
- loads the remaining processed Parquet files from
-
load_derived_tables- loads all Parquet files in
artifacts/derived_files/ - replaces same-named tables already in BigQuery, including
observation_period
- loads all Parquet files in
-
cleanup- loads
cdm_source.parquet - creates missing OMOP tables
- normalizes BigQuery date and datetime column types through the OMOP DDL
- loads
After cleanup, the DAG generates the delivery report CSV and then starts the analyzer phase.
10. Reporting and Analyzer Outputs
After the CDM dataset is finalized in BigQuery, the pipeline produces two kinds of reporting outputs:
- a delivery-level CSV generated by
ccc-omop-file-processor - downstream analyzer outputs generated by the separately deployed
ccc-omop-analyzer
The analyzer phase uses a small set of R packages, including OHDSI DataQualityDashboard, OHDSI Achilles, PASS, and omopDeliveryReport. It also creates the BigQuery results tables needed for OHDSI ATLAS.
10.1 Delivery Report CSV
The generate_report_csv task runs after cleanup and produces the final delivery report CSV in artifacts/delivery_report/. This step is executed by ccc-omop-file-processor. The remaining outputs in this section are produced by ccc-omop-analyzer.
The report is assembled from temporary artifact Parquet files written throughout pipeline execution. It serves as a structured summary of the delivery and is one input to the final HTML report.
The final file name is:
artifacts/delivery_report/delivery_report_{site}_{delivery_date}.csv
10.2 Data Quality Dashboard (DQD)
After the report CSV is created, the DAG triggers the analyzer job ccc-omop-analyzer-dqd-job to run OHDSI DataQualityDashboard.
DQD runs standardized data quality checks against the finalized OMOP dataset and writes these artifacts to GCS:
artifacts/dqd/dqdashboard_results.jsonartifacts/dqd/dqdashboard_results.csvartifacts/dqd/errors/*.txtwhen DQD produces error files
The dqdashboard_results.csv output is also written to the analytics BigQuery dataset as a table named dqdashboard_results.
10.3 Achilles
The DAG also triggers ccc-omop-analyzer-achilles-job to run OHDSI Achilles.
Achilles produces database characterization outputs and writes supporting results tables in BigQuery for ATLAS and related OHDSI tooling.
The Achilles job also generates these artifacts in GCS:
artifacts/achilles/achilles_results.csvartifacts/achilles/results/**/*.json
10.4 PASS
The DAG also triggers ccc-omop-analyzer-pass-job to run PASS.
PASS produces additional scoring outputs about the suitability of the finalized dataset for downstream use.
Current PASS outputs are written to artifacts/pass/ and include:
pass_overall.csvpass_table_level.csvpass_field_level.csvpass_composite_overall.csvpass_composite_components.csv
10.5 Atlas Results Tables
Once DQD, Achilles, and PASS finish successfully, the analyzer service endpoint /create_atlas_results_tables creates additional BigQuery tables used by OHDSI ATLAS.
10.6 HTML Delivery Report
After the Atlas results tables are created, the analyzer service endpoint /generate_delivery_report uses omopDeliveryReport to combine the following into a single HTML delivery report:
- the delivery report CSV
- the DQD results
- the PASS outputs
The output is written to:
artifacts/delivery_report/omop_delivery_report.html
After both analyzer service calls complete, mark_delivery_complete writes the completed status to the pipeline log table.