Processing Scenarios Tibbles - EpiModel/EpiModeling GitHub Wiki
In this tutorial we cover the creation of a tibble
containing one row per simulation and one column per outcome of interest:
scenario_name | batch_number | sim | outcome1 | outcome2 | ... |
---|---|---|---|---|---|
sc_1 | 1 | 1 | 10 | 73 | ... |
... | ... | ... | ... | ... | ... |
sc_14 | 2 | 15 | 1024 | 39 | ... |
We will first tackle the interactive exploration and definition of the scenarios outcomes then we will see how to automate the code to process many scenarios in parallel.
Here we assume that the scenarios have been run already and that they were combined into one tibble
per scenario using either EpiModelHPC::merge_netsim_scenarios_tibble
or the corresponding slurmworkflow
step template.
The merged tibbles
are stored in the sc_dir
directory.
library("dplyr")
sc_dir <- "data/intermediate/scenarios/merged_tibbles"
sc_infos_tbl <- EpiModelHPC::get_scenarios_tibble_infos(sc_dir)
The EpiModelHPC::get_scenarios_tibble_infos
is a helper function returning the file path and scenario name for each of the tibble
present in this directory.
For our interactive exploration, we focus on the first scenario tibble
in the list. For this we subset the first line of sc_infos_tbl
:
sc_info <- sc_infos_tbl[1, ]
With it we can read in the file and add a scenario_name
column to the loaded tibble
. I like to reorder the columns to get the identifying elements first.
d_sc <- readRDS(sc_info$file_path)
d_sc <- mutate(d_sc, scenario_name = sc_info$scenario_name) |>
select(scenario_name, batch_number, sim, time, everything())
glimpse(d_sc)
#>
#> Rows: 1,272
#> Columns: 121
#> $ scenario_name <chr> "0_no_interv", "0_no_interv", "0_no_interv", "0_…
#> $ batch_number <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ sim <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ time <int> 5409, 5410, 5411, 5412, 5413, 5414, 5415, 5416,…
#> $ disease.mr100 <dbl> 0.2493048, 0.2492929, 0.9970759, 0.2493048, 0.2…
#> $ incid.gc <dbl> 210, 199, 202, 203, 205, 192, 191, 204, 207, 20…
#> $ incid.ct <dbl> 301, 285, 304, 322, 304, 280, 310, 311, 316, 28…
#> $ s.num <dbl> 95206, 95199, 95205, 95212, 95214, 95212, 95210…
#> $ i.num <dbl> 20859, 20861, 20858, 20856, 20840, 20831, 20823…
#> ...
#
Note: To uniquely identify a simulation, we need: scenario_name
, batch_number
and sim
Before creating the per simulation outcomes, we sometimes want to calculate columns directly from the loaded data. This allow us to reduce the amount of calculation and storage needed during the simulation. It's especially useful when the outcomes are multiple permutations of raw data. Store the raw data in the simulation and calculate the permutations later on.
Here is a very simple example for learning purposes.
d_sc <- d_sc |>
mutate(
prev = i.num / (i.num + s.num),
incid.sti = incid.gc + incid.ct
)
We add 2 columns, the prevalence prev
and the combined incidence of both STIs incid.sti
.
We often want to see the value of some outcomes at the end of the simulation. Here we will calculate the prevalence and the HIV-related deaths per 100 HIV-infected person-years disease.mr100
.
d_sc_ly <- d_sc |>
filter(time > max(time) - 52) |>
group_by(scenario_name, batch_number, sim) |>
summarise(
across(
c(prev, disease.mr100),
~ mean(.x, na.rm = TRUE),
.names = "{.col}_ly"
),
.groups = "drop" # ungroup the tibble after the summary
)
This block does the following processing:
- Keep only the last 52 time steps (a year if a time step is a week)
- Group uniquely each individual simulation
- Take the
mean
of each group for the columnsprev
anddisease.mr100
Note: we use across
to apply the same function to each columns selected. They are renamed respectively prev_ly
and disease.mr100_ly
by the .names
argument
Note: the .groups = "drop"
argument to summarise removes all grouping at the end of the call. It's a good idea to always add it to your summarise
calls.
glimpse(d_sc_ly)
#>
#> Rows: 24
#> Columns: 5
#> $ scenario_name <chr> "0_no_interv", "0_no_interv", "0_no_interv", "0_inte…
#> $ batch_number <chr> "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2…
#> $ sim <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2…
#> $ prev_ly <dbl> 0.1791361, 0.1801060, 0.1839084, 0.1806637, 0.179752…
#> $ disease.mr100_ly <dbl> 0.2789982, 0.2487883, 0.2060229, 0.2487799, 0.253226…
The other type of summaries we usually have are cumulative ones. Here we will calculate the cumulative incidence over the 10 years of intervention for gonorrhea (gc
), chlamydia (ct
) and the combined STI.
d_sc_cml <- d_sc |>
filter(time > max(time) - 10 * 52) |>
group_by(scenario_name, batch_number, sim) |>
summarise(
across(
starts_with("incid."),
~ sum(.x, na.rm = TRUE),
.names = "{.col}_cml"
),
.groups = "drop" # ungroup the tibble after the summary
)
The code is very similar to the previous one.
- We take 10 years instead of 1.
- The columns are prepended with
_cml
instead of_ly
- We use
starts_with("incid.")
to select all the columns whose name starts withincid.
.
glimpse(d_sc_cml)
#>
#> Rows: 24
#> Columns: 6
#> $ scenario_name <chr> "0_no_interv", "0_no_interv", "0_no_interv", "0_inte…
#> $ batch_number <chr> "1", "1", "1", "1", "1", "1", "1", "1", "2", "…
#> $ sim <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8…
#> $ incid.gc_cml <dbl> 11010, 14219, 10077, 10462, 14770, 11156, 8397…
#> $ incid.ct_cml <dbl> 15142, 14041, 11982, 13223, 16685, 11017, 1311…
#> $ incid.sti_cml <dbl> 26152, 28260, 22059, 23685, 31455, 22173, 2151…
Finally, we use left_join
to combine the two tibbles
into one:
d_cmb <- left_join(
d_sc_ly, d_sc_cml,
by = c("scenario_name", "batch_number", "sim")
)
glimpse(d_cmb)
#>
#> Rows: 24
#> Columns: 8
#> $ scenario_name <chr> "0_no_interv", "0_no_interv", "0_no_interv", "0_inte…
#> $ batch_number <chr> "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "…
#> $ sim <int> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3,…
#> $ prev_ly <dbl> 0.1791361, 0.1801060, 0.1839084, 0.1806637, 0.1797525, 0…
#> $ disease.mr100_ly <dbl> 0.2789982, 0.2487883, 0.2060229, 0.2487799, 0.2532263, 0…
#> $ incid.sti_cml <dbl> 25641, 27683, 21641, 23229, 30852, 21742, 21123, 25366, …
#> $ incid.gc_cml <dbl> 10800, 13925, 9882, 10273, 14500, 10957, 8252, 11768, 13…
#> $ incid.ct_cml <dbl> 14841, 13758, 11759, 12956, 16352, 10785, 12871, 13598, …
This section was the most important. We should spend quite a lot of time here to define all the outcomes we need on a single scenario. This allow us to quickly iterate over our tests without burdening ourselves with many files.
Once we are done with it, we can move to the next task, automation and parallelization.
Now that all the processing step are defined, we can create a function to encapsulate it.
process_one_scenario_tibble
will take the scenario info in a one row tibble
format as argument.
process_one_scenario_tibble <- function(sc_info) {
# loading the file
d_sc <- readRDS(sc_info$file_path)
d_sc <- mutate(d_sc, scenario_name = sc_info$scenario_name) |>
select(scenario_name, batch_number, sim, time, everything())
# global mutate
d_sc <- d_sc |>
mutate(
prev = i.num / (i.num + s.num),
incid.sti = incid.gc + incid.ct
)
# last year summaries
d_sc_ly <- d_sc |>
filter(time > max(time) - 52) |>
group_by(scenario_name, batch_number, sim) |>
summarise(
across(
c(prev, disease.mr100),
~ mean(.x, na.rm = TRUE),
.names = "{.col}_ly"
),
.groups = "drop" # ungroup the tibble after the summary
)
# cummulative summaries
d_sc_cml <- d_sc |>
filter(time > max(time) - 10 * 52) |>
group_by(scenario_name, batch_number, sim) |>
summarise(
across(
starts_with("incid."),
~ sum(.x, na.rm = TRUE),
.names = "{.col}_cml"
),
.groups = "drop" # ungroup the tibble after the summary
)
# joining
d_cmb <- left_join(
d_sc_ly, d_sc_cml,
by = c("scenario_name", "batch_number", "sim")
)
return(d_cmb)
}
We will use lapply
to sequentially process all tibbles
.
d_ls <- lapply(
seq_len(nrow(sc_infos_tbl)),
\(i) process_one_scenario_tibble(sc_infos_tbl[i, ])
)
Here iterate over a sequence of values 1:nrow(sc_infos_tbl)
with the seq_len
function.
We define an anonymous function to call process_one_scenario_tibble
on each row of sc_infos_tbl
. (the \(i)
syntax was introduced in R 4.1 along the native pipe |>
).
d_ls
is now a list
of tibbles
that we merge together with the bind_rows
function.
d_sc_raw <- bind_rows(d_ls)
readr::write_csv(d_sc_raw, "sc_raw.csv")
Finally we save the result into sc_raw.csv
. We call it "raw" as we still have one row per simulation and not a single row per scenario.
On the HPC we may want to parallelize this processing. It's useful when working with hundreds of scenarios.
The future.apply
package makes the transition very smooth:
future::plan("multisession", workers = 4)
d_ls <- future.apply::future_lapply(
seq_len(nrow(sc_infos_tbl)),
\(i) process_one_scenario_tibble(sc_infos_tbl[i, ])
)
d_sc_raw <- bind_rows(d_ls)
readr::write_csv(d_sc_raw, "sc_raw.csv")
We can see that the only 2 changes are:
- we replaced
lapply
withfuture.apply::future_lapply
- a call to
future::plan
to define how we want to parallelize
In this case, the future::plan("multisession", workers = 4)
say that the code should be run across 4 separate R sessions in the background. This plan should work almost everywhere.
In this tutorial we saw how to efficiently process scenarios tibbles
down to a single tibble
with one row per simulation. In another tutorial we will explore how to efficiently and automatically process this further down into a tibble
with one line per scenario, clean names and formatted results.