N3C National COVID Cohort Collaborative - onetomapanalytics/Meta_Data GitHub Wiki

N3C - National COVID Cohort Collaborative

General description

  1. Database primary purpose - Provide a large, centralized data resource to allow research teams to study COVID-19 and identify potential treatments as the pandemic evolves.
  2. Overall data type - Health outcomes
  3. Dataset type - Longitudinal
  4. Data source - Electronic Health Records (EHR)
  5. Data level - Patient level
  6. Geographic location of the data collection sites - United States (more than 83 institutions around the country)
  7. Sponsor, manager, or home institution - National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH)
  8. Date range - Jan 1, 2018 to most recent data partner extraction date
  9. Geolocation data - Zip codes for patients available under the limited dataset
  10. Dates - Available under the limited dataset
  11. Hospital identifiers - Synthetic data partner ID
  12. Physician identifiers - Synthetic provider ID
  13. Longitudinal tracking - Track patients within and across participating hospitals on the inpatient, outpatient, ED, and office levels. Track providers across 45 hospitals that currently provide the provider ID and hospitals using the occurrences (e.g., visit, procedure) and the above-mentioned IDs
  14. Clinical areas of interest - all
  15. Number of records - N3C contains information on approximately 22.3 million anonymized persons as of Sep 8th, 2022. The Enclave has 32.3 billion total rows containing more than 8.5 million COVID+ cases, 3.2 billion clinical observations, 15.6 billion lab results, 5.1 billion medication records, 1.2 billion procedures, and 1.7 billion visits (April 4th, 2024)
  16. Variables that are uniquely present in this dataset - COVID variables, inpatient medications, drugs in general, labs, zip codes and dates for patients in the limited dataset, and connection among inpatient, outpatient, ED, and office data
  17. Database caveats and limitations - (1) Hospitals cannot be identified, as defined by the Data Use Agreement (DUA); (2) it is restricted to patients who have undergone a COVID test; (3) all instances of a condition or procedure might be mapped to different OMOP-CDM concepts; and (4) lab results values are not necessarily consistent across hospitals.
  18. Other - There are three levels of data available for analysis, each one with specific eligible user and access requirements: limited (patient data retain PHI such as dates and zip codes), de-identified (PHI are changed to protect patients' privacy), and synthetic (data derived from the limited dataset that statistically resemble patient information but are not real patient data) datasets.

Applicable methods

  1. Association, such as logistic regression models (1, 2)
  2. Machine learning (3, 2, 4)
  3. Propensity score (5, 6)
  4. Sensitivity analysis (7, 8)

High-impact designs

  • Evaluate COVID-19 severity and risk factors (3, 9)

  • Examine the characteristics associated with COVID among children (10)

  • Characterize the use of different drugs (11, 12)

  • Evaluate the association between post-recovery COVID-19 and incident heart failure (13)

Data dictionary

To access the N3C data dictionary, click here

Variable categories

  1. Patient demographics (e.g., year of birth, sex, race, ethnicity, location)
  2. Biological samples (e.g., specimen type, date of obtention)
  3. Death (e.g., date, type, cause)
  4. Visit [e.g., occurrence, concept (i.e., inpatient, outpatient, ED, long-term care), start and end dates/time, type]
  5. Procedure (e.g., concept, type, quantity, date, both for diagnostic and/or therapeutic purposes)
  6. Drug exposure (e.g., concept, type, exposure dates, quantity, dose, reason it was stopped)
  7. Device exposure (e.g., concept, exposure dates, quantity)
  8. Condition occurrence (e.g., concept, type, occurrence dates)
  9. Measurements (e.g., laboratory results, vital signs, quantitative findings from pathology reports)
  10. Observation (i.e., capture of data not represented by other domains, including unstructured measurements, medical history, and family history)
  11. COVID-19 test

Linkage to other datasets

  • Linkages can be established for any dataset that might have Zipcode information.