Dev Environment and Google BigQuery Database Access Setup - HealthRex/CDSS GitHub Wiki

Primary tech stack includes

Python (scipy, pandas, scikit-learn, matplotlib)
R for certain statistical analyses (e.g., propensity score matching)
Google BigQuery database (mostly replacing local PostgreSQL 9.6 instances)
HTML/JavaScript for web applications
Git(Hub) for source control, issue+project tracking, Wiki notes

Basic dev skills recommended

SQL queries (mostly to pull things out of databases)
https://www.w3schools.com/sql/default.asp
https://technically.dev/posts/sql-for-the-rest-of-us.html https://brohrer.github.io/sql_resources.html
https://cloud.google.com/bigquery/docs/how-to
Unix command-line use (digital notebooks and interactive parsers good for playing around, but eventually you'll need to produce and execute reusable code modules. Respectively, being able to SSH connect and remote control a server). https://practicalunix.org/
Git or other Version Control system Standard software engineering practice. Commit and Push any code changes to repository by end of any day code is edited. Don't wait for a server meltdown, hard drive failure, or stolen laptop to learn this lesson the hard way. Git has more complex (branching and pull request) functionality than is necessary in most cases. At the least, learn how to git clone, add, commit, push, pull and merge conflicts with other users.
Python programming (including object-oriented programming and automated unit testing)
https://wiki.python.org/moin/BeginnersGuide
https://brohrer.github.io/python_resources.html
https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html
R tidyverse, dplyr data manipulation, regression / glmnet models
https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
The Computational Social Science Course has a nice introductory lab workshop that reviews many basic data manipulation concepts in R-dplyr, Unix command line, Python streaming
Review the DevWorkshop code directory for various tutorials on core tools and skills, such as practicing SQL queries, unit testing, PivotTables, and more.
Medical Statistics Review
https://lagunita.stanford.edu/courses/Medicine/MedStats./Summer2015/about

Windows dev environment notes

Example profile script to setup command-line environment variables for Windows dev environment https://github.com/HealthRex/CDSS/blob/master/setup/windows.profile.bat

Google BigQuery Database Access

DevWorkshop

See workshop tutorial notes for guidance

https://github.com/HealthRex/CDSS/blob/master/scripts/DevWorkshop/ReadMe.GoogleCloud-BigQuery-VPC.txt

Another workshop tutorial on practicing clinical data SQL queries

https://github.com/HealthRex/CDSS/blob/master/scripts/DevWorkshop/ReadMe.SQLQuery.txt

Access Control

Multiple tiers of access security

Publicly available Google Cloud Platform / BigQuery that anyone can be granted access to
Access restricted to those with Stanford.edu account (still not secure enough for PHI / High-Risk data)
Access restricted to those connected Stanford VPN with respective access controls (The last is supported by Nero / Stanford Research Computing Center and is deemed acceptable for high-risk = protected health information, whereas the former are not)

Non-PHI / Non-High Risk Data

Google Cloud Platform mining-clinical-decisions is a Stanford.edu restricted project for the group

Project administrator needs to grant project or BigQuery access to GCP project
- https://console.cloud.google.com/iam-admin/iam?authuser=3&project=mining-clinical-decisions

PHI / High Risk Data

Google Cloud Platform som-nero-phi-jonc101 is a Stanford Nero-PHI compliant restricted project for the group

User needs to download install and run VPN client Full Traffic non-split-tunnel (requires two factor authentication)
- https://uit.stanford.edu/service/vpn
Project administrator needs to add Stanford user to workgroup for Nero-PHI compliant access
- https://workgroup.stanford.edu/main/WGApp
- If extra privileges still needed, will need to contact SRCC sys admins to grant.

API and Script / Programmatic Access

Setting up GCP Virtual Machine Environment

Navigate to GCP -> ComputeEngine -> VM Instances
Click "Create Instance" and input necessary specifications (eg, compute power necessary)
Click "Start VM"; there is a "SSH" button on the side to copy the CLI for your local machine (needs Stanford Full Tunnel VPN)
Can use the instructions below for Authentication of GCP, including creating the JSON token for logging into a specific project.
Install Rstudio Server or whichever language/software as necessary.

Authenticating

source: https://cloud.google.com/bigquery/docs/authentication/

Google Cloud SDK (requires Python) https://cloud.google.com/sdk/install

Install then run

gcloud init

Need to create authentication key JSON file

gcloud auth application-default login

Should require a web login, then creates local JSON file that can use to simulate login via stanford.edu account to a specific project. (The key appears tied to one user login and one GCP project.)

2. Set an environment variable so your code knows where to find this key file.

Mac/Linux: export GOOGLE_APPLICATION_CREDENTIALS="[PathToKeyFile]"
Windows(cmd): set GOOGLE_APPLICATION_CREDENTIALS=[PathToKeyFile] (Or Python scripts in medinfo code tree will likely import LocalEnv.py that can set it)

Big Query Python API reference: https://googleapis.github.io/google-cloud-python/latest/index.html

Install Python modules for connection

pip install --upgrade google-cloud-bigquery

STARR Electronic Health Record Database Notes

STRIDE (Inpatient)

Much of our research is based on the Stanford Clinical Data Warehouse (aka STARR, formerly STRIDE). Our primary data source within STARR is Epic "Clarity" database, from which STARR has provided us data for all Stanford affiliated patients from 2008 onward. For more information about the STARR data warehouse, see this overview. You can look online for a data dictionary, though ultimately a lot requires just browsing through the data tables and reconstructing what everything means.

In 2019, we now have access to a version of the database that includes both inpatient and outpatient cases (essentially the "STARR Tahoe Datalake), accessible via a Google BigQuery interface (functionally it looks like querying a SQL database). Data Dictionary

If you are not familiar with SQL, please refer to the following resources.

Free online SQL courses: Codecademy, Khan Academy, and w3schools
SQL syntax reference: w3schools

Data "gotchas"

As with any data set, STRIDE can be misleading. In particular, modeling the complexities of inpatient hospital operations to a set of SQL tables inevitably results in information loss. As a result, simple questions may have surprising (i.e. likely incorrect) answers simply by virtue of quirks in the data collection process.

That being said, we still need to do research. Therefore, our approach is simply to try to understand the limitations in our data set. Here is a (non-exhaustive) list of known limitations.

Incomplete data at time boundaries: While we describe the database as covering the years "2008 – ", we know that coverage is incomplete at the boundaries. For example, we have data for some patients in 2008 but not all, likely because Epic was rolled out slowly to the various departments in the hospital, so there is a period during which only a fraction of patient encounters yielded EMR data.
Inconsistent mortality rates: The mortality rate declines precipitously from ~2000 per year to ~1400 between 2011 and 2012 (without a corresponding drop in admissions). We suspect that this is caused by changes in how deaths are reported. In particular, the hospital cannot track deaths occurring soon after discharge, and instead relies on the Social Security Death Index (SSDI) to retroactively update its records. We suspect that a change in the SSDI's recording practices/efficacy affected STRIDE's records downstream.
Shifting diagnosis frequency: The top 10 admit and problem list diagnoses shift over the years. While in theory it's possible that these shifts are explained by changes in the patient population, we believe it's more likely the result of changes in medical coding patterns (though it's very difficult to disentangle the two). In particular, diagnoses shift from less specific diagnoses (e.g. general symptoms, abdominal pain, chest symptoms) to more specific ones (e.g. shortness of breath, dyspnea, fever). In addition, after 2014 there is a spike in admissions for chemotherapy and immunotherapy. Similarly, kidney failure starts to dominate problem lists after 2014. We believe part of this change in coding patterns may be driven by the shift from ICD9 to ICD10 codes.
Inconsistent proc_id:proc_code mapping: order_proc models all procedures ordered at the hospital. Each unique procedure type is defined by a proc_id:proc_code pair. However, this mapping changes in 2014 (e.g. proc_code LABA1C maps to proc_id 472505 in 2010 and 1036 in 2016).

How do I query that?

For examples of how to answer specific clinical research questions, see the SQL Dev Workshop for directed examples on learning how to query for information out of the database.