Initial EHR Ingestion [DRAFT] - Analyticsphere/ehr-pipeline-documentation GitHub Wiki
1. Schema validation
- Use Python + DuckDB
- Confirm delivery has all excepted v5.3 CDM fields and tables
- Identify any fields and tables in delivery that do not exist in v5.3
- Identify rows with invalid data types
- Basic checks to ensure DQD will run (i.e. at least 1 run in cdm_source table, vocab tables are present, etc.)
- Generate report:
- Failing fields + rows with failing data types
- Vocabulary version (select vocabulary_version from vocabulary where vocabulary_id = 'None')
- Row counts per table
2. Load CSV files into BQ tables
- Use Python + DuckDB to automate
- Can do manually for now
3. Execute DataQualityDashboard
- Collect/store results.json and results.csv files
- Write results to BQ (manually as package doesn't work to do this automatically)
4. Execute Achilles
- Run package as-is
- Run achilles_results_concept_counts script (pull from Atlas package)
5. Refresh OHDSI tools
- Make API calls to refresh ATLAS and Ares
Future state TODOs:
- Harmonize vocabulary versions between different deliveries (how to map to new standards, etc.)
- Set up Broadsea (ATLAS, HADES, Ares, etc.) on cloud VM
- Containerize R, RStudio, Java, rJava, OHDSI library installation