Running the pipeline - HopkinsIDD/cholera-mapping-pipeline GitHub Wiki

This page gives a short description of how to run the pipeline. Currently it is only possible to run the pipeline on our local cluster idmodeling, or using the docker image as described the section on writing tests. Before starting work on the pipeline, you will also want to review the Best Practices page for information about the project workflows.

The pipeline is contained in two repositories, HopkinsIDD/cholera-mapping-pipeline (public) which contains scripts and the taxdat package, and HoplinsIDD/cholera-covariates (private) which contains the raw raster covariates used in the pipeline. This page assumes you have access to the latter.

A general workflow for running the pipeline consists in:

writing configuration files (see the section on config files)
running the pipeline calling the main script Analysis/R/set_parameters.R
producing run reports with scripts in Analysis/output

Writing a config

The pipeline works with a configuration file to specify the parameters of the run. A description of the file is given in the section on config files. We will use the example in Analysis/configs/example_config.yml.

Running the pipeline

The pipeline is run calling:

Rscript Analysis/R/set_parameters.R -c Analysis/example_config.yml

Data and output are written to Analysis/data.

An example of SLURM script to run the model on idmodeling is:

#!/bin/bash
#SBATCH --job-name=run_setup_%j.job
#SBATCH --output=logs/run_setup_%A_%a.log
#SBATCH --time=7-00:00
#SBATCH --mem=15G
#SBATCH -c 4

echo "Beginning of script"
date
TAXDIR=<path_to_cholera-mapping-pipeline>
CONFIG=$1
RSCRIPT=/opt/R/4.0.3/bin/Rscript

cd $TAXDIR

$RSCRIPT $TAXDIR/Analysis/R/set_parameters.R -c $TAXDIR/$CONFIG  || exit 1
echo "End of script"
date

and can be run:

sbatch run_map.slurm Analysis/example_config.yml

Environmental variables

In the old pipeline, environmental variables (envvars) may be set at runtime in order to specify run-specific options. The following envvars are available:

set_parameters.R

CHOLERA_CONFIG: (default: "config.yml")
CHOLERA_ON_MARCC: (default: FALSE)
PRODUCTION_RUN: (default: TRUE)
REINSTALL_TAXDAT: (default: TRUE)
CHOLERA_SKIP_STAN: stops running after prepare_initial_values.R (default: FALSE)

prepare_map_data_revised.R

CHOLERA_API_USERNAME: (default: "NONE", which has the effect from pulling the envvar from the file "Analysis/R/database_api_key.R")
CHOLERA_API_KEY: (default: "NONE", which has the effect from pulling the envvar from the file "Analysis/R/database_api_key.R")
CHOLERA_API_WEBSITE: Only one option is available (default: "", which has the effect of pulling from the middle-distance website).
CHOLERA_SQL_USERNAME: (default: "NONE", which has the effect from pulling the envvar from the file "Analysis/R/database_api_key.R")
CHOLERA_SQL_PASSWORD: (default: "NONE", which has the effect from pulling the envvar from the file "Analysis/R/database_api_key.R")
CHOLERA_SQL_WEBSITE: Specify the website from which the SQL data pull should be made. Enables pulling data from the database snapshot on idmodeling when the value is 'localhost' (default: "", which has the effect of pulling from the middle-distance website).

Build run report

Reports can be built with Analysis/output/country_data.Rmd. This report is currently set up for single country runs.

Segmented Job

You may want to submit jobs so that the data preparation is done in one job and the stan submission is done in another job (these sections have different space/core requirements). You could use two scripts stan_not_included.sh (first) and stan_included.sh (second):

#!/bin/bash
#SBATCH --job-name=run_setup_%j.job
#SBATCH --nodelist=idmodeling2
#SBATCH --output=logs/run_setup_%A_%a_preparation.log
#SBATCH --time=1-00:00
#SBATCH --mem=15G

echo "Beginning of script"
date
TAXDIR=<path_to_cholera-mapping-pipeline>
CONFIG=$1
RSCRIPT=/opt/R/4.0.3/bin/Rscript

cd $TAXDIR
export CHOLERA_SKIP_STAN=TRUE
$RSCRIPT $TAXDIR/Analysis/R/set_parameters.R -c $TAXDIR/$CONFIG  || exit 1
sbatch stan_included.sh
echo "End of script"
date

and

#!/bin/bash
#SBATCH --job-name=run_setup_%j.job
#SBATCH --output=logs/run_setup_%A_%a_stan.log
#SBATCH --time=7-00:00
#SBATCH --mem=15G
#SBATCH -c 4

echo "Beginning of script"
date
TAXDIR=<path_to_cholera-mapping-pipeline>
CONFIG=$1
RSCRIPT=/opt/R/4.0.3/bin/Rscript

cd $TAXDIR
export CHOLERA_SKIP_STAN=FALSE
$RSCRIPT $TAXDIR/Analysis/R/set_parameters.R -c $TAXDIR/$CONFIG  || exit 1
echo "End of script"
date

Postprocessing results

The following steps allow to postprocess results of interest and produce output figures.

Run Analysis/postprocess_results.R: this script extract and postprocesses results for all configs in a user-specified directory. The user needs to specify were data objects are located, and where intermediate and final outputs should be saved to.
Run Analysis/postprocess_results_figures_and_tables.R: this script produces output figures and tables for a given user-specified config directory. The script assumes that Analysis/postprocess_results.R has already been run on the given config directory. The aim of these figures and tables is mostly to verify that validity of the postprocessing steps.
Run Analysis/make_final_figures_and_tables.R: this script takes in two distinct output filename prefixes and produces "final" figures and tables in the perspective of publication. The script is intended to compare different time periods. It assumes that Analysis/postprocess_results.R has already been run on both config directories.