Processing Scenarios Tibbles - EpiModel/EpiModeling GitHub Wiki

Introduction

In this tutorial we cover the creation of a tibble containing one row per simulation and one column per outcome of interest:

scenario_name batch_number sim outcome1 outcome2 ...
sc_1 1 1 10 73 ...
... ... ... ... ... ...
sc_14 2 15 1024 39 ...

We will first tackle the interactive exploration and definition of the scenarios outcomes then we will see how to automate the code to process many scenarios in parallel.

Here we assume that the scenarios have been run already and that they were combined into one tibble per scenario using either EpiModelHPC::merge_netsim_scenarios_tibble or the corresponding slurmworkflow step template.

Local and interactive processing

Setup

The merged tibbles are stored in the sc_dir directory.

library("dplyr")

sc_dir <- "data/intermediate/scenarios/merged_tibbles"
sc_infos_tbl <- EpiModelHPC::get_scenarios_tibble_infos(sc_dir)

The EpiModelHPC::get_scenarios_tibble_infos is a helper function returning the file path and scenario name for each of the tibble present in this directory.

Loading the first scenario

For our interactive exploration, we focus on the first scenario tibble in the list. For this we subset the first line of sc_infos_tbl:

sc_info <- sc_infos_tbl[1, ]

With it we can read in the file and add a scenario_name column to the loaded tibble. I like to reorder the columns to get the identifying elements first.

d_sc <- readRDS(sc_info$file_path)
d_sc <- mutate(d_sc, scenario_name = sc_info$scenario_name) |>
  select(scenario_name, batch_number, sim, time, everything())

glimpse(d_sc)
#>
#> Rows: 1,272
#> Columns: 121
#> $ scenario_name             <chr> "0_no_interv", "0_no_interv", "0_no_interv", "0_…
#> $ batch_number              <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ sim                       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ time                      <int> 5409, 5410, 5411, 5412, 5413, 5414, 5415, 5416,…
#> $ disease.mr100             <dbl> 0.2493048, 0.2492929, 0.9970759, 0.2493048, 0.2…
#> $ incid.gc                  <dbl> 210, 199, 202, 203, 205, 192, 191, 204, 207, 20…
#> $ incid.ct                  <dbl> 301, 285, 304, 322, 304, 280, 310, 311, 316, 28…
#> $ s.num                     <dbl> 95206, 95199, 95205, 95212, 95214, 95212, 95210…
#> $ i.num                     <dbl> 20859, 20861, 20858, 20856, 20840, 20831, 20823…
#> ...
#

Note: To uniquely identify a simulation, we need: scenario_name, batch_number and sim

Adding colunms

Before creating the per simulation outcomes, we sometimes want to calculate columns directly from the loaded data. This allow us to reduce the amount of calculation and storage needed during the simulation. It's especially useful when the outcomes are multiple permutations of raw data. Store the raw data in the simulation and calculate the permutations later on.

Here is a very simple example for learning purposes.

d_sc <- d_sc |>
  mutate(
    prev = i.num / (i.num + s.num),
    incid.sti = incid.gc + incid.ct
  )

We add 2 columns, the prevalence prev and the combined incidence of both STIs incid.sti.

Calculating last year summaries

We often want to see the value of some outcomes at the end of the simulation. Here we will calculate the prevalence and the HIV-related deaths per 100 HIV-infected person-years disease.mr100.

d_sc_ly <- d_sc |>
  filter(time > max(time) - 52) |>
  group_by(scenario_name, batch_number, sim) |>
  summarise(
    across(
      c(prev, disease.mr100),
      ~ mean(.x, na.rm = TRUE),
      .names = "{.col}_ly"
    ),
    .groups = "drop" # ungroup the tibble after the summary
  )

This block does the following processing:

  1. Keep only the last 52 time steps (a year if a time step is a week)
  2. Group uniquely each individual simulation
  3. Take the mean of each group for the columns prev and disease.mr100

Note: we use across to apply the same function to each columns selected. They are renamed respectively prev_ly and disease.mr100_ly by the .names argument

Note: the .groups = "drop" argument to summarise removes all grouping at the end of the call. It's a good idea to always add it to your summarise calls.

glimpse(d_sc_ly)
#>
#> Rows: 24
#> Columns: 5
#> $ scenario_name    <chr> "0_no_interv", "0_no_interv", "0_no_interv", "0_inte…
#> $ batch_number     <chr> "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2…
#> $ sim              <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2…
#> $ prev_ly          <dbl> 0.1791361, 0.1801060, 0.1839084, 0.1806637, 0.179752…
#> $ disease.mr100_ly <dbl> 0.2789982, 0.2487883, 0.2060229, 0.2487799, 0.253226…

Calculating cumulative summaries

The other type of summaries we usually have are cumulative ones. Here we will calculate the cumulative incidence over the 10 years of intervention for gonorrhea (gc), chlamydia (ct) and the combined STI.

d_sc_cml <- d_sc |>
  filter(time > max(time) - 10 * 52) |>
  group_by(scenario_name, batch_number, sim) |>
  summarise(
    across(
      starts_with("incid."),
      ~ sum(.x, na.rm = TRUE),
      .names = "{.col}_cml"
    ),
    .groups = "drop" # ungroup the tibble after the summary
  )

The code is very similar to the previous one.

  • We take 10 years instead of 1.
  • The columns are prepended with _cml instead of _ly
  • We use starts_with("incid.") to select all the columns whose name starts with incid..
glimpse(d_sc_cml)
#>
#> Rows: 24
#> Columns: 6
#> $ scenario_name    <chr> "0_no_interv", "0_no_interv", "0_no_interv", "0_inte…
#> $ batch_number           <chr> "1", "1", "1", "1", "1", "1", "1", "1", "2", "…
#> $ sim                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8…
#> $ incid.gc_cml           <dbl> 11010, 14219, 10077, 10462, 14770, 11156, 8397…
#> $ incid.ct_cml           <dbl> 15142, 14041, 11982, 13223, 16685, 11017, 1311…
#> $ incid.sti_cml          <dbl> 26152, 28260, 22059, 23685, 31455, 22173, 2151…

Combining everything

Finally, we use left_join to combine the two tibbles into one:

d_cmb <- left_join(
  d_sc_ly, d_sc_cml,
  by = c("scenario_name", "batch_number", "sim")
)

glimpse(d_cmb)
#>
#> Rows: 24
#> Columns: 8
#> $ scenario_name    <chr> "0_no_interv", "0_no_interv", "0_no_interv", "0_inte…
#> $ batch_number     <chr> "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "…
#> $ sim              <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3,…
#> $ prev_ly          <dbl> 0.1791361, 0.1801060, 0.1839084, 0.1806637, 0.1797525, 0…
#> $ disease.mr100_ly <dbl> 0.2789982, 0.2487883, 0.2060229, 0.2487799, 0.2532263, 0…
#> $ incid.sti_cml    <dbl> 25641, 27683, 21641, 23229, 30852, 21742, 21123, 25366, …
#> $ incid.gc_cml     <dbl> 10800, 13925, 9882, 10273, 14500, 10957, 8252, 11768, 13…
#> $ incid.ct_cml     <dbl> 14841, 13758, 11759, 12956, 16352, 10785, 12871, 13598, …

Conclusion

This section was the most important. We should spend quite a lot of time here to define all the outcomes we need on a single scenario. This allow us to quickly iterate over our tests without burdening ourselves with many files.

Once we are done with it, we can move to the next task, automation and parallelization.

Automating and parallelizing the process

Creating a process_one_scenario_tibble function

Now that all the processing step are defined, we can create a function to encapsulate it.

process_one_scenario_tibble will take the scenario info in a one row tibble format as argument.

process_one_scenario_tibble <- function(sc_info) {
    # loading the file
    d_sc <- readRDS(sc_info$file_path)
    d_sc <- mutate(d_sc, scenario_name = sc_info$scenario_name) |>
    select(scenario_name, batch_number, sim, time, everything())

    # global mutate
    d_sc <- d_sc |>
    mutate(
        prev = i.num / (i.num + s.num),
        incid.sti = incid.gc + incid.ct
    )

    # last year summaries
    d_sc_ly <- d_sc |>
    filter(time > max(time) - 52) |>
    group_by(scenario_name, batch_number, sim) |>
    summarise(
        across(
        c(prev, disease.mr100),
        ~ mean(.x, na.rm = TRUE),
        .names = "{.col}_ly"
        ),
        .groups = "drop" # ungroup the tibble after the summary
    )

    # cummulative summaries
    d_sc_cml <- d_sc |>
    filter(time > max(time) - 10 * 52) |>
    group_by(scenario_name, batch_number, sim) |>
    summarise(
        across(
        starts_with("incid."),
        ~ sum(.x, na.rm = TRUE),
        .names = "{.col}_cml"
        ),
        .groups = "drop" # ungroup the tibble after the summary
    )

    # joining
    d_cmb <- left_join(
        d_sc_ly, d_sc_cml,
        by = c("scenario_name", "batch_number", "sim")
    )

    return(d_cmb)
}

Running the function over all files

We will use lapply to sequentially process all tibbles.

d_ls <- lapply(
  seq_len(nrow(sc_infos_tbl)),
  \(i) process_one_scenario_tibble(sc_infos_tbl[i, ])
)

Here iterate over a sequence of values 1:nrow(sc_infos_tbl) with the seq_len function.

We define an anonymous function to call process_one_scenario_tibble on each row of sc_infos_tbl. (the \(i) syntax was introduced in R 4.1 along the native pipe |>).

d_ls is now a list of tibbles that we merge together with the bind_rows function.

d_sc_raw <- bind_rows(d_ls)
readr::write_csv(d_sc_raw, "sc_raw.csv")

Finally we save the result into sc_raw.csv. We call it "raw" as we still have one row per simulation and not a single row per scenario.

Making it parallel

On the HPC we may want to parallelize this processing. It's useful when working with hundreds of scenarios.

The future.apply package makes the transition very smooth:

future::plan("multisession", workers = 4)

d_ls <- future.apply::future_lapply(
  seq_len(nrow(sc_infos_tbl)),
  \(i) process_one_scenario_tibble(sc_infos_tbl[i, ])
)

d_sc_raw <- bind_rows(d_ls)
readr::write_csv(d_sc_raw, "sc_raw.csv")

We can see that the only 2 changes are:

  1. we replaced lapply with future.apply::future_lapply
  2. a call to future::plan to define how we want to parallelize

In this case, the future::plan("multisession", workers = 4) say that the code should be run across 4 separate R sessions in the background. This plan should work almost everywhere.

Conclusion

In this tutorial we saw how to efficiently process scenarios tibbles down to a single tibble with one row per simulation. In another tutorial we will explore how to efficiently and automatically process this further down into a tibble with one line per scenario, clean names and formatted results.

⚠️ **GitHub.com Fallback** ⚠️