Applied Modeling Project Components and Interactions - EpiModel/EpiModeling GitHub Wiki
An Applied Modeling Project usually implies the following elements:
Software Repositories:
- a ResearchProj
git
repository from the EpiModelHIV-Template - a project specific
EpiModelHIV-p@ResearchProj
branch.
Places:
- Local Computer / Locally: your computer where you have RStudio installed
- GitHub / GH
- High Performance Computing systems (HPC)
The rest of this page will describe the purpose of each element, what should be done at each place and how the data and code should be transferred between them.
Software Repositories
The code for an applied project includes the EpiModel
framework which is pulled as
dependencies and two custom repositories that you (the researcher) is in charge
to edit to fit your project.
The researchProj Repository
The easiest way to reason about this repository is to think of it as the repository that another researcher would clone to reproduce your analysis once your paper is published. Therefore, it should contain the minimal set of elements necessary to reproduce your analysis and compare your results.
The Applied Modeling Project Structure page describes in details the content of this repo as well as the artifacts produced by the code. (Things that are not saved on GitHub as they are simply byproducts of the computations).
The EpiModelHIV-p@researchProj Branch
You already know that EpiModel
is a
modular modeling framework. The modules for your analysis will leave in another
repository. EpiModelHIV-p
is the
shared repository we use for such modules when modeling the HIV epidemic.
The usual way to proceed, is to make a branch on this repository with the same
name as your project repo. In this example: EpiModelHIV-p@ResearchProj
.
This setup allows you to tailor the modules to your needs as well as easily pull
in the changes made to the upstream EpiModelHIV-p@main
branch.
Note: EpiModelCOVID
works in
a similar fashion for the COVID epidemic. And in general, spliting the "analysis
code" from the modules is the recommended way to proceed.
Places
Due to the computational cost of running many simulations, the analysis cannot
be performed entirely on one's computer. At some point an HPC has to be used.
Also, the code is versioned using git
with a hosting
service, like GitHub.
In this section, we explore the role of each location.
Local Computer
Your local computer is simply the computer you usually run RStudio on. This is where you will write code and experiment.
On it you would have both the researchProj
repository and the
EpiModelHIV-p@researchProj
branch checked out.
The recommended workflow is to develop everything locally and test it on small
versions of the networks. For EpiModelHIV-p
we use 5k nodes networks locally
simply to test that the code behave as expected. The full analysis is done later
on 100k nodes networks with many replications, only when everything has been
validated locally.
The EpiModelHPC
project and the
EpiModelHIV-Template offers
a way to harmonize the code so that if it works locally on your computer, the
transition to an HPC system should be seamless.
GitHub
GitHub is a hosting service we use to share the code we make on our local
computer. Most of the time, the code is changed on the local computer, then
pushed to GitHub to save and share it. Sometimes, especially for
EpiModelHIV-p@researchProj
, we can pull in changes from another branch
(usually main
) directly on GitHub. In this case, the changes also needs to be
pulled in locally to avoid divergence.
HPC
An HPC is a remote computer with massively more computational power as your machine on which we will run many (thousands) replications of our modeling scenarios.
The HPC is interacted with on the command line (CLI) using Secured Shell (SSH).
This software allows you to connect to a remote session on the HPC using a
terminal emulator on your local computer. This means that on your local
computer you have a terminal (I recommend using the built-in RStudio terminal)
which runs the SSH program. This program then shows you inside the terminal a
prompt that will execute commands on the HPC. Therefore, you should always be
aware of "where you are" on the terminal. That is, are you running a command on
your local computer or on the HPC. The best way to know is to look at the
prompt. On an HPC it should look like this: [username@hpc_name current/working/directory/]
.
When working with an HPC, it is important to make sure that the code run on it and the data there are correct. For code, you should NEVER edit code on the HPC directly. Always make the changes on your local computer, push them to GitHub, and let the "workflows" update them. For the data, EpiModelHIV-Template describes for which workflow what data should be move where.
Furthermore, an HPC require a workload manager like
slurm
to operate. In order to ease the
transition from local computations to remote ones on the HPC, the
EpiModelHPC
project provides
utilities to create "workflows" to be run on the HPC using the slurmworkflow
package see the vignette for
details.
Interactions
The following chart give an overview of how data should flow between the different components.