Applied Modeling Project Components and Interactions - EpiModel/EpiModeling GitHub Wiki

An Applied Modeling Project usually implies the following elements:

Software Repositories:

  • a ResearchProj git repository from the EpiModelHIV-Template
  • a project specific EpiModelHIV-p@ResearchProj branch.

Places:

  • Local Computer / Locally: your computer where you have RStudio installed
  • GitHub / GH
  • High Performance Computing systems (HPC)

The rest of this page will describe the purpose of each element, what should be done at each place and how the data and code should be transferred between them.

Software Repositories

The code for an applied project includes the EpiModel framework which is pulled as dependencies and two custom repositories that you (the researcher) is in charge to edit to fit your project.

The researchProj Repository

The easiest way to reason about this repository is to think of it as the repository that another researcher would clone to reproduce your analysis once your paper is published. Therefore, it should contain the minimal set of elements necessary to reproduce your analysis and compare your results.

The Applied Modeling Project Structure page describes in details the content of this repo as well as the artifacts produced by the code. (Things that are not saved on GitHub as they are simply byproducts of the computations).

The EpiModelHIV-p@researchProj Branch

You already know that EpiModel is a modular modeling framework. The modules for your analysis will leave in another repository. EpiModelHIV-p is the shared repository we use for such modules when modeling the HIV epidemic.

The usual way to proceed, is to make a branch on this repository with the same name as your project repo. In this example: EpiModelHIV-p@ResearchProj.

This setup allows you to tailor the modules to your needs as well as easily pull in the changes made to the upstream EpiModelHIV-p@main branch.

Note: EpiModelCOVID works in a similar fashion for the COVID epidemic. And in general, spliting the "analysis code" from the modules is the recommended way to proceed.

Places

Due to the computational cost of running many simulations, the analysis cannot be performed entirely on one's computer. At some point an HPC has to be used. Also, the code is versioned using git with a hosting service, like GitHub.

In this section, we explore the role of each location.

Local Computer

Your local computer is simply the computer you usually run RStudio on. This is where you will write code and experiment.

On it you would have both the researchProj repository and the EpiModelHIV-p@researchProj branch checked out.

The recommended workflow is to develop everything locally and test it on small versions of the networks. For EpiModelHIV-p we use 5k nodes networks locally simply to test that the code behave as expected. The full analysis is done later on 100k nodes networks with many replications, only when everything has been validated locally.

The EpiModelHPC project and the EpiModelHIV-Template offers a way to harmonize the code so that if it works locally on your computer, the transition to an HPC system should be seamless.

GitHub

GitHub is a hosting service we use to share the code we make on our local computer. Most of the time, the code is changed on the local computer, then pushed to GitHub to save and share it. Sometimes, especially for EpiModelHIV-p@researchProj, we can pull in changes from another branch (usually main) directly on GitHub. In this case, the changes also needs to be pulled in locally to avoid divergence.

HPC

An HPC is a remote computer with massively more computational power as your machine on which we will run many (thousands) replications of our modeling scenarios.

The HPC is interacted with on the command line (CLI) using Secured Shell (SSH). This software allows you to connect to a remote session on the HPC using a terminal emulator on your local computer. This means that on your local computer you have a terminal (I recommend using the built-in RStudio terminal) which runs the SSH program. This program then shows you inside the terminal a prompt that will execute commands on the HPC. Therefore, you should always be aware of "where you are" on the terminal. That is, are you running a command on your local computer or on the HPC. The best way to know is to look at the prompt. On an HPC it should look like this: [username@hpc_name current/working/directory/].

When working with an HPC, it is important to make sure that the code run on it and the data there are correct. For code, you should NEVER edit code on the HPC directly. Always make the changes on your local computer, push them to GitHub, and let the "workflows" update them. For the data, EpiModelHIV-Template describes for which workflow what data should be move where.

Furthermore, an HPC require a workload manager like slurm to operate. In order to ease the transition from local computations to remote ones on the HPC, the EpiModelHPC project provides utilities to create "workflows" to be run on the HPC using the slurmworkflow package see the vignette for details.

Interactions

The following chart give an overview of how data should flow between the different components.

layout_img