Dev Environment and Google BigQuery Database Access Setup - HealthRex/CDSS GitHub Wiki

Primary tech stack includes

  • Python (scipy, pandas, scikit-learn, matplotlib)
  • R for certain statistical analyses (e.g., propensity score matching)
  • Google BigQuery database (mostly replacing local PostgreSQL 9.6 instances)
  • HTML/JavaScript for web applications
  • Git(Hub) for source control, issue+project tracking, Wiki notes

Basic dev skills recommended

Windows dev environment notes

Google BigQuery Database Access

DevWorkshop

See workshop tutorial notes for guidance

Another workshop tutorial on practicing clinical data SQL queries

Access Control

Multiple tiers of access security

  • Publicly available Google Cloud Platform / BigQuery that anyone can be granted access to
  • Access restricted to those with Stanford.edu account (still not secure enough for PHI / High-Risk data)
  • Access restricted to those connected Stanford VPN with respective access controls (The last is supported by Nero / Stanford Research Computing Center and is deemed acceptable for high-risk = protected health information, whereas the former are not)

Non-PHI / Non-High Risk Data

Google Cloud Platform mining-clinical-decisions is a Stanford.edu restricted project for the group

PHI / High Risk Data

Google Cloud Platform som-nero-phi-jonc101 is a Stanford Nero-PHI compliant restricted project for the group

API and Script / Programmatic Access

Setting up GCP Virtual Machine Environment

  1. Navigate to GCP -> ComputeEngine -> VM Instances
  2. Click "Create Instance" and input necessary specifications (eg, compute power necessary)
  3. Click "Start VM"; there is a "SSH" button on the side to copy the CLI for your local machine (needs Stanford Full Tunnel VPN)
  4. Can use the instructions below for Authentication of GCP, including creating the JSON token for logging into a specific project.
  5. Install Rstudio Server or whichever language/software as necessary.

Authenticating

source: https://cloud.google.com/bigquery/docs/authentication/

Google Cloud SDK (requires Python) https://cloud.google.com/sdk/install

  • Install then run

    gcloud init

Need to create authentication key JSON file

gcloud auth application-default login

Should require a web login, then creates local JSON file that can use to simulate login via stanford.edu account to a specific project. (The key appears tied to one user login and one GCP project.)

2. Set an environment variable so your code knows where to find this key file.

  • Mac/Linux: export GOOGLE_APPLICATION_CREDENTIALS="[PathToKeyFile]"
  • Windows(cmd): set GOOGLE_APPLICATION_CREDENTIALS=[PathToKeyFile] (Or Python scripts in medinfo code tree will likely import LocalEnv.py that can set it)

Big Query Python API reference: https://googleapis.github.io/google-cloud-python/latest/index.html

Install Python modules for connection

pip install --upgrade google-cloud-bigquery

STARR Electronic Health Record Database Notes

STRIDE (Inpatient)

Much of our research is based on the Stanford Clinical Data Warehouse (aka STARR, formerly STRIDE). Our primary data source within STARR is Epic "Clarity" database, from which STARR has provided us data for all Stanford affiliated patients from 2008 onward. For more information about the STARR data warehouse, see this overview. You can look online for a data dictionary, though ultimately a lot requires just browsing through the data tables and reconstructing what everything means.

In 2019, we now have access to a version of the database that includes both inpatient and outpatient cases (essentially the "STARR Tahoe Datalake), accessible via a Google BigQuery interface (functionally it looks like querying a SQL database). Data Dictionary

If you are not familiar with SQL, please refer to the following resources.

Data "gotchas"

As with any data set, STRIDE can be misleading. In particular, modeling the complexities of inpatient hospital operations to a set of SQL tables inevitably results in information loss. As a result, simple questions may have surprising (i.e. likely incorrect) answers simply by virtue of quirks in the data collection process.

That being said, we still need to do research. Therefore, our approach is simply to try to understand the limitations in our data set. Here is a (non-exhaustive) list of known limitations.

  • Incomplete data at time boundaries: While we describe the database as covering the years "2008 – ", we know that coverage is incomplete at the boundaries. For example, we have data for some patients in 2008 but not all, likely because Epic was rolled out slowly to the various departments in the hospital, so there is a period during which only a fraction of patient encounters yielded EMR data.
  • Inconsistent mortality rates: The mortality rate declines precipitously from ~2000 per year to ~1400 between 2011 and 2012 (without a corresponding drop in admissions). We suspect that this is caused by changes in how deaths are reported. In particular, the hospital cannot track deaths occurring soon after discharge, and instead relies on the Social Security Death Index (SSDI) to retroactively update its records. We suspect that a change in the SSDI's recording practices/efficacy affected STRIDE's records downstream.
  • Shifting diagnosis frequency: The top 10 admit and problem list diagnoses shift over the years. While in theory it's possible that these shifts are explained by changes in the patient population, we believe it's more likely the result of changes in medical coding patterns (though it's very difficult to disentangle the two). In particular, diagnoses shift from less specific diagnoses (e.g. general symptoms, abdominal pain, chest symptoms) to more specific ones (e.g. shortness of breath, dyspnea, fever). In addition, after 2014 there is a spike in admissions for chemotherapy and immunotherapy. Similarly, kidney failure starts to dominate problem lists after 2014. We believe part of this change in coding patterns may be driven by the shift from ICD9 to ICD10 codes.
  • Inconsistent proc_id:proc_code mapping: order_proc models all procedures ordered at the hospital. Each unique procedure type is defined by a proc_id:proc_code pair. However, this mapping changes in 2014 (e.g. proc_code LABA1C maps to proc_id 472505 in 2010 and 1036 in 2016).

How do I query that?

For examples of how to answer specific clinical research questions, see the SQL Dev Workshop for directed examples on learning how to query for information out of the database.