Postprocessing model estimates - sparklabnyc/resources GitHub Wiki
When we get death rates, we want to convert them into usable quantities (e.g. probability of dying, life expectancy).
Extracting rates
The first step is to extract the model quantities that we need to recreate the death rate. There are two ways of doing this:
- Monitor the rates in the model using
numpyro.Deterministic("rates", latent_rate)
, which is expensive and memory intensive but fine for small models. - Remake the latent rates from the variables. An example is below. The basic idea is to get the data into an array in
R
, which makes it easy to calculate summary statistics along different axes usingapply
. Note, for really large model outputs, you might want to consider only the variables required for the paper (e.g. just the first and last year).
library(RNetCDF)
library(tidyverse)
nc <- open.nc("output/model_age_time_interaction_samples.nc")
vars <- c("slope", "age_drift", "age_time_drift")
N_t <- 31
time <- 1990:2020
N_age <- 18
ages <- seq(0, 85, 5)
N_draws <- 500
N_chains <- 2
posterior <- list()
for (var in vars) {
# dim(var.get.nc(nc, var)) = (..., draws, chains)
posterior[var](/sparklabnyc/resources/wiki/var) <- var.get.nc(nc, var)
print(dim(var.get.nc(nc, var)))
}
# relevant line from the file
# latent_rate = slope_cum + age_effect + age_time_effect
posterior$age_effect <- apply(X = posterior$age_drift, MARGIN = c(2, 3), FUN = cumsum)
posterior$age_time_effect <- apply(X = posterior$age_time_drift, MARGIN = c(2, 3, 4), FUN = cumsum)
latent_rate <- array(
data = NA,
dim = c(N_age, N_t, N_draws, N_chains),
dimnames = list(
ages,
time,
1:N_draws,
1:N_chains
)
)
for (a in 1:N_age) {
# first year
latent_rate[a, 1, , ] <- posterior$age_effect[a, , ]
# all other years
for (t in 2:N_t) {
latent_rate[a, t, , ] <- posterior$slope * (t - 1) + posterior$age_time_effect[t-1, a, , ]
}
}
write_rds(
# optionally transform back to outcome space
plogis(latent_rate),
"output/model_age_time_interaction_rate.rds"
)
If you are more comfortable in python, you can also do the exact same steps
import pickle
import arviz as az
import numpy as np
import xarray as xr
import pandas as pd
from scipy.special import expit
samples = xr.open_dataset("output/model_age_time_interaction_samples.nc"")
# check convergence
summary = az.summary(
samples,
round_to=2,
var_names=["sigma_rw_age", "sigma_rw_age_time", "slope", "age_drift", "age_time_drift"],
filter_vars="regex"
)
print(summary["r_hat"].max())
def calculate_mx(idata):
N_t = idata.posterior.age_time_drift.age_time_drift_dim_0.max().values
slope_cum = idata.posterior.slope.to_numpy() * np.arange(N_t)
age_effect = np.cumsum(idata.posterior.age_drift.to_numpy(), axis=-2)
age_time_effect = np.pad(np.cumsum(idata.posterior.age_time_drift, -1), [(0, 0), (1, 0)])
latent_rate = slope_cum + age_effect + age_time_effect
return expit(latent_rate)
latent_rate = calculate_mx(samples)
latent_rate = latent_rate.reshape(-1, *latent_rate.shape[-2:]) # stack draws and chains on top of each other
samples = xr.DataArray(
latent_rate,
coords = {
"sample": np.arange(1000),
"age": np.arange(0, 85 + 1, 5),
"year": np.arange(1990, 2020 + 1)
},
dims=["sample", "age", "time"]
)
samples.to_netcdf("output/model_age_time_interaction_rate.nc")
Calculating health metrics
When presenting age-specific rates, we would not present all the samples. Instead, we would summarise over the samples to get a summary measure and credible intervals.
posterior_median <- apply(X = plogis(latent_rate), MARGIN = c(1, 2), FUN = median)
posterior_q025 <- apply(X = plogis(latent_rate), MARGIN = c(1, 2), FUN = quantile, q = 0.025)
posterior_q975 <- apply(X = plogis(latent_rate), MARGIN = c(1, 2), FUN = quantile, q = 0.975)
Often, we don't present age-specific rates, and instead want to collapse over the age dimension for ease of presentation.
This is where arrays really come into play.
We can apply
functions over the age dimension and preserve all other dimensions, before summarising into quantiles.
Below is an example for the probability of dying:
probs <- future_apply(
X = death_rates,
MARGIN = c(2, 3, 4), # act over the age dimension (1) and preserve the other dimensions
FUN = nqx,
age = seq(0, 85, 5),
ax = rep(NA, 18),
n = 80,
x = 0
)
# name the dimensions for ease of access in future
# dimnames(probs) <- list(...)
Oftentimes, this is computationally intensive.
We can parallelise this operation by setting up a cluster and substituting apply
for future_apply
.
There are other vectorised functions which work well with arrays.
For example, below is an example of how to use sweep
to calculate age-standardised rates.
# standard population of same length as age dimension
population
# age-standardised death rates
# sweep multiplies the rates along the age dimension (1) by the vector population
deaths <- sweep(death_rates, 1, population, "*")
asdr <- apply(deaths, MARGIN = c(2, 3, 4), FUN = sum) / sum(population)
Here are links to parallelised examples for calculating life expectancy (function), probability of dying (function), and mean age at death.