Home - gpcnetwork/grouse-cms GitHub Wiki
Welcome to the GROUSE wiki!
Please review the main README page for all the following essential information about GROUSE:
The figure below illustrates system architecture of the GROUSE environment, which is composed of a data lake, a data warehouse and analytic workbenches.
(please refresh page if the figure is not immediately showing)
- A)
Data Lake
: data (including GDIT physical media) are loaded into secure S3 buckets via Secure Shell File Transfer Protocol (SFTP) or Transport Layer Security (TLS) 1.2 Protocol. - B)
Data Warehouse
: data is extracted and loaded into Snowflake for data transformation into the PCORnet Common Data Model and de-identification. - C)
Analytic Workbench
: to minimize the burden on researchers of learning to navigate the cloud environment, we adopted an AWS solution —service workbench, where approved users can self-service to deploy either Windows or Linux analytic “workspaces” of multiple analytical applications (e.g., R, Python, SAS) and varying computing power based upon their needs. From each analytical “workspace”, a dedicated connection can be created to the backend GROUSE database where researchers have full visibility to multiple schemas and can choose to either query from the original CMS schema or a transformed CDM schema.
Key Stakeholders
- Principal Investigator: Dr. Lemuel R. Waitman (DUA Custodian)
- Research Lead: Dr. Xing Song (Lead Faculty, Point of Contact)
- Technical Lead: Shaun Ferguson (Lead DevOps)
- Security Lead: Ernest Anye (Senior InfoSec)
GROUSE Roles and Responsibilities
We design three primary roles following the key separation factor of "accessibility to raw CMS data":
- Role A (Data Provider): Project Staff designated in Role A generates finder file sending to GDIT and EMR datasets sending to MU, who will not be granted with access to CMS data.
- Role B (Administrator): Project Staff designated in Role B will function as GROUSE database administrator with full access to raw CMS data and user accounts, upon completion of extensive trainings.
- Role C (Analyst): Project Staff designated in Role C will be granted with access to part of CMS data or its limited and de-identified version in accordance with CMS DUA, upon completion of necessary trainings.
User Manuals
We have put together a set of reference and training material organized into the following wiki pages:
Technical Documents
We will continue to publish technical documents to demystify the data provenance and lineage (e.g., extract, load, transformation, linkage and deidentification processes) to better serve the research purposes:
Frequent Q&A
- Source Data Catalog: You may only be given access to a subset of the listed data schemas approved (DROC) for the use of the study following "minimally necessary dataset" principle
- Other Frequent Q&A