Extract Locations And Change log - som-shahlab/femr GitHub Wiki

FEMR v2 Extracts

STARR-OMOP

The following are the canonical paths for where you can find meds extracts:

On shahlab-compute: /labs/shahlab/datasets/ehr_ml_extracts/meds_extracts

On GCP: gs://extract_backups/

All MEDS extract have the following naming schema starr_omop_cdm5_confidential_${TYPE}_${DATE}_extract

TYPE is the name of the extract (1pcent indication a 1% extract, lite indicates no text) and DATE is the official STARR-OMOP version date.

Currently, we only have one set of extracts: starr_omop_cdm5_confidential_2024_03_08_extract and starr_omop_cdm5_confidential_lite_2024_03_24_extract which are our V1 MEDS extracts.

FEMR v1 Extracts

STARR-OMOP

Version 2:

  • First functional version
  • /local-scratch/nigam/projects/ethanid/som-rit-phi-starr-prod.starr_omop_cdm5_deid_2022_09_05_extract2

Version 3:

  • The visit_detail table is now being processed so inpatient transfers between units are now added as events. All our existing code will automatically use transfers as features for learning and we can now use transfers in labeling functions
  • Metadata is now being stored in the database. This means visit end dates, visit_ids, clarity_tables, omop_tables, etc, etc are now being persisted. Previously those were dropped during the ETL process.
  • Events way before birth (1 month prior to birth) are now being dropped due to being from ETL errors. We are dropping 50,790 events with this new filter (which is a tiny tiny fraction of our whole number of events).
  • Every database now has a unique id generated during the extraction process. This will help avoid bugs where people mix extract versions.
  • I optimized our database serialization logic, reducing the size of our database by 4% (which should also improve the speed by about 4% (note that it's 4% even with the additional visit_detail info, the actual optimization is somewhere around 7-8% or so)

Version 4:

  • Ontology structure is now retained for flowsheet values. The parent of flowsheet code is now the flowsheet form. This way you can better track which flowsheet values come from which form
  • Units are now saved for all codes with values. e.units contains the units for an event. Note that both string and numeric codes can come with units attached.

Version 5:

  • source_code field (which contains the source clarity codes) added
  • /local-scratch/nigam/projects/ethanid/som-rit-phi-starr-prod.starr_omop_cdm5_deid_2022_09_05_extract_v5

Version 6:

  • Original concept id mappings to map things back to OMOP concept ids
  • Add 3 more months of data
  • Visit rework, fixing dropped visits and improving the visit timing
  • omop_table now fixed to properly refer to the correct OMOP table
  • /local-scratch/nigam/projects/ethanid/som-rit-phi-starr-prod.starr_omop_cdm5_deid_2023_01_02_extract_v6

Version 7:

  • Fixing more dropped visits
  • /local-scratch/nigam/projects/ethanid/som-rit-phi-starr-prod.starr_omop_cdm5_deid_2023_01_02_extract_v7

Version 8:

  • Switch to a more recent version of the data with fewer filtered patients
  • New extract binary version (binary version 3), so requires latest CLMBR branch
  • No longer dropping short patients
  • [Carina] /share/pi/nigam/data/som-rit-phi-starr-prod.starr_omop_cdm5_deid_2023_02_08_extract_v8
  • [GCP] gs://michael-2/som-rit-phi-starr-prod.starr_omop_cdm5_deid_2023_02_08_extract_v8

Version 9:

  • 6 more months of data
  • Completely rework handling of source codes so that the extracts are 25% lighter and automatically featurize source codes
  • Enable handing of data elements that are not mappable to any standard OMOP concept
  • More dictionary compression to shrink extracts by 25%
  • Fix drug code ontology
  • Improve visit_detail processing by taking advantage of OMOP visit ontologies
  • Fix FEMR handling of observation and measurement tables to process value_as_concept_id
  • Fix the following bugs in STARR-OMOP
    • Typos in visit mappings
    • Incorrectly duplicated visit rows
    • Poorly designed duplicate source concept ids
    • Bad mapping to ICD10 instead of ICD10CM
    • value_as_concept_id for measurement and observation tables
  • [Carina] /share/pi/nigam/data/som-rit-phi-starr-prod.starr_omop_cdm5_deid_2023_08_13_extract_v9_lite (no notes, no metadata)
  • [GCP] gs://extract_backups/som-rit-phi-starr-prod.starr_omop_cdm5_deid_2023_08_13_extract_v9 (includes notes and metadata)

MERATIVE

  • [GCP] gs://motor_paper_backup/MOTOR_NERO/truven_v2/truven_extract_v2

  • Used for the MOTOR paper

MIMIC-IV

gs://extract_backups/femr_mimic_extract.tar.gz

  • [GCP] Used for the MOTOR paper. Based on a MIMIC-IV OMOP ETL from Lawrence.