Service Workbench User Manual - gpcnetwork/grouse-cms GitHub Wiki

What is Service Workbench (SWB)

Service workbench is a self-service analytical platform that operates on a web-based portal, which provide a collaborative research and data democratization solution to the researchers. Researchers in an organization can access a portal, quickly find data that they are interested in, with a few clicks, analyze data using their favorite analytical tools as well as machine learning service such as the Amazon SageMaker notebooks. This platform can also be used to manage and facilitate virtual classrooms.


Account Initialization

Upon approval of account request, an AWS user account will be created for you. A verification email will be sent to the registered email address for account initialization, which includes a) verifying email address, b) setting up password, c) setting up multi-factor authentication (MFA) on your chosen device. Once the initial setup is completed, go to the AWS SWB login portal:

aws.nextgenbmi.umsystem.edu

You should be able to see the following landing page and log into the service workbench portal with your AWS credentials and MFA. Make sure to bookmark the above url for recurrent visits.

landing-page

Note: you will be re-directed to the AWS SSO log-in page to sign into your service workbench page. If not re-directed, that usually suggests that your current AWS session hasn’t expired.

After successfully loggin in, the home page of your AWS SWB space looks like the following with 4 sub-sections found on the sidebar:

swb-home-page

  • Dashboard: default home page to show your computational spending over the past 30 days
  • SSH Keys: SSH keys generation for linux workspace log-in
  • Studies: Navigate to create or access My Study, Organization Study or Open Data study.
  • Workspaces: Navigate to create stand-along workspaces or access existing workspaces

Studies and Workspaces

What is a Study?

A Study is a mechanism creating a shared and persistent storage space (i.e. s3 bucket) accessible by multiple workspaces and/or multiple users. However, you may skip the study creation study to start a stand-alone workspace. But note that the stand-alone workspace will only have non-persistent storage (i.e. ephemeral storage), which will cannot be recovered if your workspace gets terminated on purpose or due to other system issue. There are two types of study: My Study and Organization Study.

My Study creates an individual study data storage space which is only visible to the creator. Following the steps show in the figure below to create a My Study.

my-study

Organization Study creates a shared study data storage space accessible by multiple approved users to collaborate. Most of the time, SWB administrators should have pre-created an Organization Study for you based on your request. However,you may still create your organizaion study if needed, which may require additional parameters as shown below:

org-study

There are three pre-defined roles can be used to manage an organization study: “admin”, “read-write”, and ”read-only”. Please note that only existing and active SWB users can be added from a pre-populated list. If someone you want to add can not be found, please email [email protected] for support.

org-study-role

What is a Workspace?

A Workspace is a virtual machine or computing instance, where you can deploy a linux, windows or sagemaker instance to use your favorite analytic tool to analyze the data. There are two ways to create your workspace: a) create a study-associated workspace (highly recommended): associating workspace with a study will enable the creation of shared and persistent storage space. OR b) create a stand-alone workspace: such workspace will not have persistent storage and you may risking loosing all of your data if the workspace gets terminated for any reason;

We have pre-configured the following 6 templated workspace types for research use, 3 standard workspaces for general use and 2 specialized workspace with licensed software:

workspace-type-overview

  • Standard workspace types include:
    • Windows Virtual Machine ("EC2-Windows")
    • Web-based Jupyter Notebook ("AWS-Sagemaker-Notebook")
    • Linux Virtual Machine ("EC2-Linux")
  • Specialised workspce types include:
    • Windows Virtual Machine with SAS 9.4 (enterprise license) preinstalled ("EC2-Special-SAS")
    • Windows Virtual Machine with CLAMP Software preinstalled ("EC2-Special-CLAMP")

A complete list of configuration details and softwares installed for each workspace type can be found in the Service Workbench Workspace Types wiki page.

Launch a Workspace

There are currently three types of workspace that can be launched for any study. For each workspace type, we provisioned multipled types of instances with different levels of memory and computational power. It is always highly recommended to deploy a workspace under a study for presistent storage:

workspace1

workspace2


Launch a Windows Workspace

Windows workspace are deployed using remote desktop protocol (RDP) under service workbench. If you are using a windows machine, RDP comes with the OS system. If your operating system (OS) is macOS, then you will need to install a RDP client.

Workspace Parameters

Once a windows workspace has been successfully provisioned for your study, you will be provided with RDP launching parameters as follows:

workspace-windows-param

Remote Desktop Connection

If you are a windows user, type “RDP” in the search box on taskbar and open a new Remote Desktop Connection session as shown in the figures below:

workspace-windows-rdp1

workspace-windows-rdp2

It would take around ~20 minutes for the remote desktop to launch. When you first launch the remote desktop session, it will ask for a network option “Do you want to allow your PC to be discoverable by other PCs and devices on this network?“ - choose “Yes” to make it discoverable, so that you will be able to log back in next time.

workspace-windows-discoverable


Launch a SageMaker Notebook

The Python workspace will be deployed as a Jupyter notebook fully managed by AWS SageMaker service. For more information, see the SageMaker Getting Started Guide


Launch a Linux Workspace

Linux CLI

Please refer to the following AWS documents for full instructions on launching Linux instance via SSH with the baseline command line (CLI) interface from:

Linux GUI - MobaXterm

Assumptions: these instructions assume you have not created a key pair yet.

  1. Download and Install MobaXterm - University of Missouri users should use this link to download the educational licensed MobaXterm installer. Other users can use the private version

When downloading from the SharePoint link, download both files individually by right clicking on the files and selecting download on both.

mobaxterm

Make sure both files are fully downloading and are in the same location then double click on the .msi to start the installation wizard. Use the default settings for the installation.

mobaxterm-install

  1. Configure MobaXterm User Environment - Once complete, launch MobaXterm to have it set up your user environment, and click “Start Local Terminal.”

mobaxterm-start-terminal

Enter the following bash commands to create your ssh key folder for use with these instructions:

mkdir .ssh
chmod 700 .ssh
  1. Connect to SWB Linux workspace - Confirm where <path-to-home>.ssh used by MobaXterm. The path is likely to be similar to the following:

C:\Users\(Your Username)\(OneDrive-University of Missouri)\Documents\MobaXterm\home\.ssh

But if you are uncertain, you may also navigate to MobaXterm -> Settings -> General -> Persistent home directory to confirm the path to the MobaXterm default home directory:

mobaxterm-dir

Go to Service Workbench (SWB), navigate to the Linux workspace you launched, and click the Connections button. Then click Create Key as indicated, and follow the prompts. Download your private key, and if prompted for a location, navigate and save to the <path-to-home>.ssh folder, to make it accessible by MobaXterm. If not prompted, open your Downloads folder on your local PC, and cut and paste the <keyname>.pem file you downloaded into the <path-to-home>.ssh folder. If you copy it anywhere else, make sure to only store this private key file in a location you know to be secure, preferably encrypted!

swb-linux-connect

Return to SWB and click Done in the download prompt, then click Use This SSH Key next to the key you just created. Retrieve public-host information by clicking the Copy icon to the right of the first Host (public) line in the directions that appear, for later use in these instructions. Note that you only have 60 seconds to complete make the connection:

swb-linux-ssh

Return to your MobaXterm local terminal, and first enter the following command to fully secure and enable use of your private key from within MobaXterm:

chmod 600 .ssh/*

Now you are ready to connect! Make sure the workspace connection timer hasn’t expired in Service Workbench, and renew it if so, then just enter the following command in the MobaXterm local terminal pasting the Host you copied from above where indicated:

ssh -CY -i "[path-to-ssh-key].pem" ec2-user@[public-host]

Note that the [path-to-ssh-key] is by default at /home/mobaxterm/.ssh/(keyname), but can be different if you have the ssh key saved in a different directory. This will securely connect you to your AWS Linux workspace over an SSH tunnel with X11 Forwarding configured, so any graphical applications you run will display in your local MobaXterm environment.


Allowlist IP Addresses

If you want to work on the same workspace but from different location (thus different IP address), you will need to whitelist your new IP address using the Edit CIDRs option. To minimize risk, we highly recommend you to always clear up old IP addresses and add current one below launching any workspace.

add-ip

!!Please follow the best practice and only access your workspace from “Domain networks” (such as a workplace network) or “Private networks” (such as your home or work networks)!!

Stop a Workspace

Go to Workspace page from sidebar and stop the unused workspace to minimize costs. You can always re-start the workspace, which only takes less than 1 minute.

stop-workspace

Terminate a Workspace

Go to Workspace page from sidebar and terminate the workspace that you will never use or want to destroy. Please note that the terminated workspace cannot be recovered, except for data saved on D: (data) drive when workspace was created with association to a study.

terminate-workspace

Study Data Sync Verification

For workspaces which are "fully synced" there is a background application that handles keeping the data on the workspace study drive backed up into the corresponding study S3 bucket/folder. For very large files, this can take extended time and the probability exists that a sync will get interrupted by a workspace shutdown before the disk to bucket sync can complete.

To compare and verify study data between the cloud bucket and a workspace sync drive:

  1. Within your Service Workbench workspace, log in to https://aws.nextgenbmi.umsystem.edu
  2. Navigate to your study entry under the Studies sidebar entry
  3. Expand the Files list under your study, to review the file entries and sizes, and compare them to the files under the study drive (typically D:)
  4. If there are any discrepancies, you can use the Upload Files button within the SWB web interface to manually back up any files that have not synchronized.

Standard Operating Procedure (SOP)

  • No cell (e.g. admittances, discharges, patients, services) 10 or less may be displayed outside of the provisioned AWS computing environment. Also, no use of percentages or other mathematical formulas may be used if they result in the display of a cell 10 or less.

  • Researchers should not download, copy and paste, or transmit any raw data (i.e. patient-level or encounter-level identifiers in conjunction with medical records) off of the provisioned AWS computing environment. Patient-level or encounter-level identifiers includes: patient number, encounter number, a combination of any characteristics that could uniquely identify the individual subject.

  • Researchers should not install any unvetted applications without seeking an approval from the system administrators. Any system vulnerability of high risk caused by such installation will result in account dispension immediately.

  • Researchers should not post any sensitive infrastructural information (e.g. server names, credentials) to external newsgroups, social media, other types of third-party individuals, websites applications, public forums without authority.

  • Always stop instance when not using it for cost optimization. Sagemaker instances have an auto-shutdown capability, but not the other workspace types (i.e., Linux and Windows). An auto-stop features has been developed to stop the workspace is CPU usage is below 5% for 1 hour.

  • Avoid accessing workspace from “Public networks” such as those in airports and coffee shops, because these networks often have little or no security.


Use R/Rstuio on Windows Workspace

It will be as simple as running R/Rtusio on your own Windows by double clicking the application logo on your virtual machine desktop.

Set up ODBC Connector on Windows Workspace

Step 1: Validate ODBC Driver Installation

An ODBC driver have been pre-installed in your windows system. Click “start” button and type “ODBC”, you should be able to see two "ODBC Data Share Administrator" applications as shown below. Please make sure to select the "64-bit" version (the 32-bit version doesn't support snowflake driver).

odbc-app

Step 2: Configure ODBC Driver

To configure the ODBC driver in a Windows environment, follow the next steps described in this post to create the ODBC DSN and test if the ODBC DSN is working fine with the following parameters:

1.  Data Source: [arbitrary DSN name] (e.g., snowflake_cdm)
2.  User: [your snowflake username]
3.  Password: [your snowflake password]
4.  Server:[snowflake_account_locator].us-east-2.aws.snowflakecomputing.com (snowflake_account_locator can be found from your snowflake account)
5.  Database: [snowflake database where you want to download data from] (e.g., GROUSE_DEID_ANALYTICS_DB_DROC#)
6.  Schema: (optional) [snowflake schema where you want to download data from]
7.  Warehouse: [snowflake warehouse for downloading data] (e.g., GROUSE_WH)
8.  Role: [your designated snowflake role] (e.g., GROUSE_ROLE_C_ANALYTICS_DROC#)
9.  Tracing: 6 [any number between 1-6, 6 retains the most details in logging]

Note: once your snowflake has been activated, the [snowflake_account_name] can be found from the url link to snowflake log-in page. These key parameters can also be found from the "ODBC_Connector_Snowflake.md" file.

Database and Schema are optional. You may have visibility to all other databases and schema once the connection is established. However, you may not be able to query all databases depending on your role privilege on the Snowflake side.

Step 3: Connect to Snowflake with ODBC driver

You will need to install the DBI and odbc packages before making the database. After ODBC connector is properly configured, you can simply run the following R command to establish the Snowflake database connection:

myconn <- DBI::dbConnect(
    drv = odbc::odbc(),
    dsn = <your DSN name>,
    uid = <your snowflake username>,
    pwd = <your snowflake password>
)

Alternatively, as security best practice, you never want to have any of your credentials inline your codes. To save your credentials as environment variable in R, navigate to C:\Program Files\R\R-4.1.2\etc, open file Rprofile.site in any text editor or in R and add your credential information with proper comments such as:

# add snowflake log-in credentials as environment variables
Sys.setenv(
    ODBC_DSN_NAME = '<the data source name given in to the ODBC driver>',
    SNOWFLAKE_USER = '<your snowflake user name>',
    SNOWFLAKE_PWD = '<your snowflake password>'
)

After saving the file, restart R (.rs.restartR()), and run Sys.getenv() to make sure that the environment variables are successfully loaded. You can then make the database connection call by implicitly calling for the credentials saved in the environment:

# make database connection
myconn <- DBI::dbConnect(
    drv = odbc::odbc(),
    dsn = Sys.getenv("ODBC_DSN_NAME"),
    uid = Sys.getenv("SNOWFLAKE_USER"),
    pwd = Sys.getenv("SNOWFLAKE_PWD")
)

Export data from Snowflake into R

The followings are a few sample codes to query permitted snowflake tables/views into R, which requires the following two additional R packages:

Load data in using R function by embedding SQL codes

The classic method for loading data from SQL database is to directly send SQL queries over to the database engine, run the queries and return the result back as a data.frame.

dat1a<-dbGetQuery(
    myconn, 
    "select * from ALS.ALS_PT_TABLE"
)

Load data in using R function interfacing with SQL

There are existing R packages with wrapper function that can interface with SQL. In other words, instead of writing out the SQL codes, you can just call the R function to perform some simple SQL operations, such as read a single table into R and write a single table over to SQL database.

dat1b<-tbl(myconn,in_schema("ALS","ALS_PT_TABLE")) %>% 
  collect() # the collect statement to materialize the result

OR

dat1c<-tbl(myconn,sql("select * from ALS.ALS_PT_TABLE")) %>%
  collect() # the collect statement to materialize the result

Load data in using SQL chunk supported by Rmd

R markdown further simplified the above codes for embedding SQL queries by including "SQL code chunks". SQL code chunk is implemented with the chunk header of parameters sql, connection, output.var.

select * from ALS.ALS_PT_TABLE

Integrate Git with Rstudio

  1. Double click the "Install Github Desktop" icon to install the pre-loaded github desktop, which automatically install the most updated version of git
  2. Open R-studio
  3. Select Tools > Global Options > Git/SVN
  4. Check "Enable version control interface for Rstudio project
  5. In order to "Browse" to the git executable location, you may enter the following path to the prompted file explorer and select the git.exe
C:\Users\Administrator\AppData\Local\GitHubDesktop\app-3.3.5\resources\app\git\cmd\
  1. Restart R and Open a "Terminal"
  2. Type git --version to confirm that git is successfully integrated

Use SAS on Windows Workspace

Connect to Snowflake with SAS/ACCESS on Windows

SAS provides SAS/ACCESS interfaces to Snowflake database. Using this functionality when running SAS queries can reduce or eliminate the need to create copies of the PCORnet CDM as SAS files. A brief introduction to the SAS concepts that are needed to create these connections in the EDC report is provided in the initial sections of this document. The final two sections provide SAS codes to use as templates.

Understanding &variables, %let, and %includes in SAS

A %let statement in SAS allows a value to be assigned to an &variable for use later in the program. When SAS compiles, it replaces occurrences of &variable in the code with the value of that variable. Similarly, the %include command places any code in the included file into the SAS program as though it was typed in at the location of the %include command.

For example, consider a file named “login.sas”, which contains three %let statements. The contents of this file look like:

%let myuser = %str(myusername);
%let mypw = %str(mypassword);
%let myserver = %str(myservername); 

These statements create three variables, &myuser, &mypw, and &myserver, which can be used later. Using the %str() syntax causes SAS to consider everything inside the () as part of the variable, even special characters. The statement

%include “C:\login.sas”;

at the beginning of a program creates the three variables within that program. Later in the program these variables can be used by the command

libname snowflake snow server=”&myserver” ….

This command acts as if the code read

libname snowflake snow server=”myservername” ….

This has two desirable properties: first, it masks the actual value of username/password when this code is shared with other users and second, it allows changes to this value to be made in a single location, rather than needing to occur each time the information is required. Using login files and &variables in this manner even permits different users to share code, simply by having each user create their own login file.

Use SAS Libname Statement to Connect to Snowflake Database

The SAS libname statement tells SAS where to find data. The general syntax for this statement is the same for all supported data source, but there are different options depending on the data type. In the

%let dpath = ;
libname indata “&dpath”;

statement, libname is the SAS command, while indata is the name given to this data library. A SAS program can have many libname statements, each defining a different data library with a different name. When the data source is a set of SAS files, then the libname statement will have the form

libname indata “C:\MyDir\FileLocation\”;

Note that the double quotation marks around the directory are needed to ensure that SAS correctly interprets the entire string as a file system location and does not misinterpret any special characters.

When connecting to another data source, a keyword is required by the libname statement is to indicate the type of data in the library. SAS has interfaces to many different types of data including Excel files, Microsoft SQL Server, Oracle, PostgreSQL, and Snowflake databases and many other databases. You can run the following libname statement for connecting to Snowflake and point to the target schema you want to query from:

libname indata snow server="&myserverID" user="&mysnowuser" password="&mysnowpw" 
warehouse=<your snowflake data warehouse> database=<your snowflake database> 
role=<your snowflake role> SCHEMA=<your snowflake target schema> 
ACCESS=READONLY readbuff=32000 insertbuff=32000 dbcommit=0;

proc sql ...;
run;

Use Sagemaker Jupyter Notebook

Connect to Snowflake with Pyhon Connector

Install Dependencies

You will need to install the following dependent packages in order to call the Snowflake Python connector:

###### Requirements without Version Specifiers ######
snowflake-connector-python

###### Requirements with Version Specifiers ######
cffi > 1.10.1

Snowflake Python Connector

The following is the python codes for establishing snowflake connection

import snowflake.connector
# Connecting to Snowflake using the default authenticator
ctx = snowflake.connector.connect(
  user=<snowflake_username>,
  password=<snowflake_password>,
  account=<snowflake_account> #e.g. XXXXXXX.us-east-2.aws
)

With MFA set up on the snowflake side, a Duo push will automatically be triggered. Please make sure to accept the connection.  


References

⚠️ **GitHub.com Fallback** ⚠️