Home - gpcnetwork/grouse-cms GitHub Wiki

Welcome to the GROUSE wiki!

Please review the main README page for all the following essential information about GROUSE:

The figure below illustrates system architecture of the GROUSE environment, which is composed of a data lake, a data warehouse and analytic workbenches.

(please refresh page if the figure is not immediately showing)

grouse-architecture

  • A) Data Lake: data (including GDIT physical media) are loaded into secure S3 buckets via Secure Shell File Transfer Protocol (SFTP) or Transport Layer Security (TLS) 1.2 Protocol.
  • B) Data Warehouse: data is extracted and loaded into Snowflake for data transformation into the PCORnet Common Data Model and de-identification.
  • C) Analytic Workbench: to minimize the burden on researchers of learning to navigate the cloud environment, we adopted an AWS solution —service workbench, where approved users can self-service to deploy either Windows or Linux analytic “workspaces” of multiple analytical applications (e.g., R, Python, SAS) and varying computing power based upon their needs. From each analytical “workspace”, a dedicated connection can be created to the backend GROUSE database where researchers have full visibility to multiple schemas and can choose to either query from the original CMS schema or a transformed CDM schema.

Key Stakeholders

  • Principal Investigator: Dr. Lemuel R. Waitman (DUA Custodian)
  • Research Lead: Dr. Xing Song (Lead Faculty, Point of Contact)
  • Technical Lead: Shaun Ferguson (Lead DevOps)
  • Security Lead: Ernest Anye (Senior InfoSec)

GROUSE Roles and Responsibilities

We design three primary roles following the key separation factor of "accessibility to raw CMS data":

  • Role A (Data Provider): Project Staff designated in Role A generates finder file sending to GDIT and EMR datasets sending to MU, who will not be granted with access to CMS data.
  • Role B (Administrator): Project Staff designated in Role B will function as GROUSE database administrator with full access to raw CMS data and user accounts, upon completion of extensive trainings.
  • Role C (Analyst): Project Staff designated in Role C will be granted with access to part of CMS data or its limited and de-identified version in accordance with CMS DUA, upon completion of necessary trainings.

User Manuals

We have put together a set of reference and training material organized into the following wiki pages:

Technical Documents

We will continue to publish technical documents to demystify the data provenance and lineage (e.g., extract, load, transformation, linkage and deidentification processes) to better serve the research purposes:

Frequent Q&A

  • Source Data Catalog: You may only be given access to a subset of the listed data schemas approved (DROC) for the use of the study following "minimally necessary dataset" principle
  • Other Frequent Q&A