Postprocessing model estimates - sparklabnyc/resources GitHub Wiki

When we get death rates, we want to convert them into usable quantities (e.g. probability of dying, life expectancy).

Extracting rates

The first step is to extract the model quantities that we need to recreate the death rate. There are two ways of doing this:

Monitor the rates in the model using numpyro.Deterministic("rates", latent_rate), which is expensive and memory intensive but fine for small models.
Remake the latent rates from the variables. An example is below. The basic idea is to get the data into an array in R, which makes it easy to calculate summary statistics along different axes using apply. Note, for really large model outputs, you might want to consider only the variables required for the paper (e.g. just the first and last year).

library(RNetCDF)
library(tidyverse)

nc <- open.nc("output/model_age_time_interaction_samples.nc")

vars <- c("slope", "age_drift", "age_time_drift")

N_t <- 31
time <- 1990:2020
N_age <- 18
ages <- seq(0, 85, 5)
N_draws <- 500
N_chains <- 2

posterior <- list()
for (var in vars) {
    # dim(var.get.nc(nc, var)) = (..., draws, chains)
    posterior[var](/sparklabnyc/resources/wiki/var) <- var.get.nc(nc, var)
    print(dim(var.get.nc(nc, var)))
}

# relevant line from the file
# latent_rate = slope_cum + age_effect + age_time_effect
posterior$age_effect <- apply(X = posterior$age_drift, MARGIN = c(2, 3), FUN = cumsum)
posterior$age_time_effect <- apply(X = posterior$age_time_drift, MARGIN = c(2, 3, 4), FUN = cumsum)

latent_rate <- array(
  data = NA,
  dim = c(N_age, N_t, N_draws, N_chains),
  dimnames = list(
    ages,
    time,
    1:N_draws,
    1:N_chains
  )
)

for (a in 1:N_age) {
  # first year
  latent_rate[a, 1, , ] <- posterior$age_effect[a, , ]
  # all other years
  for (t in 2:N_t) {
    latent_rate[a, t, , ] <- posterior$slope * (t - 1) + posterior$age_time_effect[t-1, a, , ]
  }
}

write_rds(
  # optionally transform back to outcome space
  plogis(latent_rate),
  "output/model_age_time_interaction_rate.rds"
)

If you are more comfortable in python, you can also do the exact same steps

import pickle
import arviz as az
import numpy as np
import xarray as xr
import pandas as pd
from scipy.special import expit

samples = xr.open_dataset("output/model_age_time_interaction_samples.nc"")

# check convergence
summary = az.summary(
    samples,
    round_to=2,
    var_names=["sigma_rw_age", "sigma_rw_age_time", "slope", "age_drift", "age_time_drift"],
    filter_vars="regex"
)
print(summary["r_hat"].max())

def calculate_mx(idata):
    N_t = idata.posterior.age_time_drift.age_time_drift_dim_0.max().values
    slope_cum = idata.posterior.slope.to_numpy() * np.arange(N_t)
    age_effect = np.cumsum(idata.posterior.age_drift.to_numpy(), axis=-2)
    age_time_effect = np.pad(np.cumsum(idata.posterior.age_time_drift, -1), [(0, 0), (1, 0)])
	
    latent_rate = slope_cum + age_effect + age_time_effect
    return expit(latent_rate)

latent_rate = calculate_mx(samples)
latent_rate = latent_rate.reshape(-1, *latent_rate.shape[-2:]) # stack draws and chains on top of each other

samples = xr.DataArray(
    latent_rate,
    coords = {
        "sample": np.arange(1000),
        "age": np.arange(0, 85 + 1, 5),
        "year": np.arange(1990, 2020 + 1)
    },
    dims=["sample", "age", "time"]
)

samples.to_netcdf("output/model_age_time_interaction_rate.nc")

Calculating health metrics

When presenting age-specific rates, we would not present all the samples. Instead, we would summarise over the samples to get a summary measure and credible intervals.

posterior_median <- apply(X = plogis(latent_rate), MARGIN = c(1, 2), FUN = median)
posterior_q025 <- apply(X = plogis(latent_rate), MARGIN = c(1, 2), FUN = quantile, q = 0.025)
posterior_q975 <- apply(X = plogis(latent_rate), MARGIN = c(1, 2), FUN = quantile, q = 0.975)

Often, we don't present age-specific rates, and instead want to collapse over the age dimension for ease of presentation. This is where arrays really come into play. We can apply functions over the age dimension and preserve all other dimensions, before summarising into quantiles. Below is an example for the probability of dying:

probs <- future_apply(
  X = death_rates,
  MARGIN = c(2, 3, 4), # act over the age dimension (1) and preserve the other dimensions
  FUN = nqx,
  age = seq(0, 85, 5),
  ax = rep(NA, 18),
  n = 80,
  x = 0
)

# name the dimensions for ease of access in future
# dimnames(probs) <- list(...)

Oftentimes, this is computationally intensive. We can parallelise this operation by setting up a cluster and substituting apply for future_apply.

There are other vectorised functions which work well with arrays. For example, below is an example of how to use sweep to calculate age-standardised rates.

# standard population of same length as age dimension
population

# age-standardised death rates
# sweep multiplies the rates along the age dimension (1) by the vector population
deaths <- sweep(death_rates, 1, population, "*")
asdr <- apply(deaths, MARGIN = c(2, 3, 4), FUN = sum) / sum(population)

Here are links to parallelised examples for calculating life expectancy (function), probability of dying (function), and mean age at death.