Configuration file - HopkinsIDD/cholera-mapping-pipeline GitHub Wiki

Here is an example configuration file with the baseline model settings. This will be kept up-to-date for the version of pipeline in dev. Last updated 6 Nov 2023

name: custom_name
countries: ['405']
countries_name: ['KEN']
aoi: raw
res_space: 20
res_time: '1 years'
grid_rand_effects_N: 1
case_definition: 'suspected'
start_time: '2016-01-01'
end_time: '2020-12-31'
data_source: 'sql'
ovrt_metadata_table: no
OCs: ~
summary_admin_levels:
- 0.0
- 1.0
- 2.0
covariate_choices: ~
adjust_pop_UN: yes
obs_model: 3
inv_od_sd_adm0: 0.01
inv_od_sd_nopool: 1.0
mu_sd_w: 10.0
sd_sd_w: 3.0
ncpus_parallel_prep: 2
do_parallel_prep: yes
drop_multiyear_adm0: yes
drop_censored_adm0: yes
drop_censored_adm0_thresh: 2.0
time_effect: yes
time_effect_autocorr: no
spatial_effect: yes
do_sd_w_mixture: yes
use_intercept: no
beta_sigma_scale: 1.0
sigma_eta_scale: 1.0
mu_alpha: 0.0
sd_alpha: 1.0
exp_prior: no
do_infer_sd_eta: no
do_zerosum_cnst: yes
use_weights: no
use_rho_prior: yes
covar_warmup: yes
warmup: yes
aggregate: yes
tfrac_thresh: 0
censoring: yes
censoring_thresh: 1
set_tfrac: yes
snap_tol: 0.0191781
use_pop_weight: yes
sfrac_thresh_border: 0.3
sfrac_thresh_conn: 0.05
ingest_covariates: no
ingest_new_covariates: no
drop_low_pop_lps: yes
drop_low_pop_lps_thresh: 15
stan:
  ncores: 4
  model: 'mapping_model_inference.stan'
  genquant: 'mapping_model_generate.stan'
  iter_warmup: 1000
  iter_sampling: 1000
  recompile: yes
file_names:
  observations_filename: KEN_2016-01-01_2020-12-31_custom_name_<hash>.preprocess.rdata
  covariate_filename: KEN_2016-01-01_2020-12-31_custom_name_<hash>.covar.rdata
  stan_input_filename: KEN_2016-01-01_2020-12-31_custom_name_<hash>.stan_input.rdata
  initial_values_filename: KEN_2016-01-01_2020-12-31_custom_name_<hash>.initial_values.rdata
  stan_output_filename: KEN_2016-01-01_2020-12-31_custom_name_<hash>.stan_output.rdata
  stan_genquant_filename: KEN_2016-01-01_2020-12-31_custom_name_<hash>.stan_genquant.rds
  country_data_report_filename: KEN_2016-01-01_2020-12-31_custom_name_<hash>.country-data-report.html
  data_comparison_report_filename: KEN_2016-01-01_2020-12-31_custom_name_<hash>.data-comparison-reports.html

In the file names above, refers to an md5sum hash of the entire above config that gets created automatically when using the config writer script in Analysis/R/write_batch_mapping_config_general.R.

Here is an example of how to specify a model with covariates and covariate transformations.

covariate_choices: ['dist_to_water', 'dist_to_coast', 'water_access', 'san_access']
covariate_transformations: 
 - name: 'pop_1_years_20_20'
   transform_name: 'log_pop'
   transform_function: !expr |
    .Primitive("log")

Argument dictionary

Basic model specification

  • name: custom name for model run; does not need to include the country ISO code as countries_name will have this information
  • countries: location ID in the middle distance database
  • countries_name: country ISO code
  • aoi: area of interest, change from the default value only for testing purposes. At baseline, this is "raw".
  • res_space: spatial resolution of the model in km
  • res_time: temporal resolution of the model. At baseline, this is "1 years".
  • grid_rand_effects_N: number of time slices of the spatial random effects; formerly named smoothing_period. Values different than 1 are not currently supported. (default: 1)
  • case_definition: suspected means sCh will be used for modeling. At baseline, this is "suspected" and no other options are currently valid.
  • start_time: modeling start date, format 'yyyy-mm-dd'
  • end_time: modeling end date, format 'yyyy-mm-dd'
  • data_source: At baseline, this is 'sql'. The API is not currently functional for mapping pipeline purposes.
  • ovrt_metadata_table: yes/no overwrite metadata table; should always be no except for integration tests (default: no)
  • summary_admin_levels: Vector of summary admin levels for genquant output shapefiles (default: [0,1,2])
  • adjust_pop_UN: yes/no Should we adjust the total country population to match the UN WPP 2022 estimates for that year (default: yes)

Optional arguments

  • OCs: list of OCs that should be used as model input data; if none are listed, all available OCs are used (no default)
  • taxonomy: 1/11/23 I believe this is a vestigial argument that no longer needs to be included in configs. Formerly, "taxonomy-working/working-entry1"

Model Structure

  • covariate_choices: A list of covariates to include in the model. If no values are provided, it is a no-covariate model; refer to https://github.com/HopkinsIDD/cholera-covariates/blob/main/covariate_dictionary.yml
  • obs_model: Designate which observation model should be used in the Stan model; 1: poisson, 2: quasipoisson, 3: negative binomial (default: 1)
  • inv_od_sd_adm0: the sd of the prior of the inverse dispersion on the national level observations
  • inv_od_sd_nopool: the sd of the prior of the inverse dispersion on the subnational level observations when there is no pooling
  • h_mu_sd_inv_od: the sd of the prior of the mean of the hierarchical inverse dispersion on the subnational level observations when there is pooling
  • h_sd_sd_inv_od: the sd of the prior of the sd of the hierarchical inverse dispersion on the subnational level observations when there is pooling
  • spatial_effect: yes/no Should we include a spatial random effect in the model (default: yes)
  • mu_sd_w: the mean of the hyperprior for the standard deviation of w, the spatial random effect (default: 10)
  • sd_sd_w: the standard deviation of the hyperprior for the standard deviation of w (default: 3)
  • time_effect: yes/no Is there a random effect for each time slice? At baseline, this should be yes. (default: no)
  • time_effect_autocorr: yes/no Is there assumed autocorrelation between time slices, modeled with a zero-sum constraint on etas? At baseline, this should be no. (default: no)
  • use_intercept: 0/1 Should a global intercept be used (1) or not (0) in the model? (default: 0)

Optional arguments

  • covariate_transformations: a list of covariates that need to be transformed (e.g., log scale). 'name' refers to the name of the covariate that need to be transformed and the name could be found in the covar_cube. 'transform_name': the name of the newly transformed covariate. 'transformation_function': to specify the transformation function for a specific covariate. the function format is: "!expr |" followed by the actual function. At baseline, no covariate transformations are applied.

Priors

  • beta_sigma_scale: precision for the prior on regression coefficients (default: 1)
  • sigma_eta_scale: value for scaling the time slice random effects (default: 1) The baseline value was initially 5. 11/2022: considering moving to 2 (narrower eta prior)
  • mu_alpha: the mean of the prior for the intercept parameter Alpha (default: 0)
  • sd_alpha: the standard deviation of the prior for the intercept parameter Alpha (default: 1)
  • use_rho_prior: whether or not to use the rho prior (default: no/F)
  • exp_prior: Should a double exponential prior be applied to covariate coefficients? Useful when considering many covariates at once and want a shrinkage prior (default: no)
  • do_infer_sd_eta: Should the SD of the prior of the etas be inferred? If this is set to 0, the config parameter sigma_eta_scale is used as SD of the prior. If this is set to 1, the config parameter is inferred. (default: 0)
  • do_zerosum_cnst: Should there be a soft zero-sum constraint on etas for the non-temporal auto-correlation model? 0 means no. 1 means yes. (default: 0)
  • do_sd_w_mixture: yes/no Should we use a mixture prior on std dev of w? If yes, the sigmas on std dev of w are estimated. If no, the prior on std dev of w is normal(5, 0.5) and std dev of w is fixed (default: yes)

Optional arguments

  • use_weights: Apply alternate weights to observations contributing to the likelihood. 8/2022 This option is obsolete as we do not anticipate using this method. (default: no)

GAM Warmup

  • covar_warmup: yes/no should the GAM warmup should be run with covariates (default: yes)
  • warmup: yes/no should the GAM warmup be run (default: yes)

Observation Data Processing

The below operations are processed in the order listed below in the old pipeline (where applicable). For example, since tfrac_thresh is applied after aggregate, only aggregated observations with a tfrac below the tfrac_thresh are dropped from modeling.

  • aggregate: yes/no Is there an attempt to aggregate observations to res_time? At baseline, this should be yes. (default: yes)
  • tfrac_thresh: All observations with a tfrac below tfrac_thresh are dropped from the model. Do not include this argument unless you want to use it. 08/2022: We decided to use 0.25 at baseline in our models (i.e., drop observations with tfrac below 0.25), but the default value was not changed in the code. 12/2022: We changed our mind and decided to try and use 0 at baseline in models moving forward. (default: 0)
  • drop_multiyear_adm0: yes/no Should we drop multi-year admin 0 observations (default: no)
  • drop_censored_adm0: yes/no Should we drop censored admin 0 observations if there are other full observations that exist? This applies only to non-multi-year admin 0 observations. (default: yes)
  • drop_censored_adm0_thresh: must be a value greater than 1. If the censored observation has less than drop_censored_adm0_thresh times fewer cases than the maximum full observation, then the observation is dropped. e.g., the max full observation has 1000 cases, drop_censored_adm0_thresh is 5, censored observation has 199 cases --> 199*5 < 1000 --> drop the censored observation (default: 2)
  • censoring: yes/no Should observations below the censoring_thresh contribute to the model likelihood as censored observations? (default: no)
  • censoring_thresh: All observations with a tfrac below tfrac_thresh are considered censored observations by the model. (default: 0.95)
  • set_tfrac: yes/no Should all tfrac values be overwritten with the value of 1? When specified in conjunction with a model with censoring, this value is assigned only to observations above censoring_thresh. (default: no)
  • snap_tol: Two-year observations have two tfracs in a model with yearly time slices, one for each year it spans. If the smaller tfrac for a multi-year observation falls below the snap_tol value, the observation's TL or TR will be changed such that the new, modified observation time range will fall solely within the year with the larger tfrac. If aggregation is on, tfrac "snapping" occurs twice - both before and after aggregation. For example, with a snap_tol = 7/365, an observation from 2020-12-30 to 2021-01-05 (left tfrac = 2/365, right tfrac = 5/365) will be changed to have a time range of 2021-01-01 to 2021-01-05. (default: 7/365)
  • ncpus_parallel_prep: Number of cpus that should be assigned to data processing parallelization. The argument here should match the SLURM shell script --cpus-per-task (-c).
  • do_parallel_prep: yes/no Toggle data processing parallelization on and off
  • drop_low_pop_lps: TRUE / FALSE Drop location periods with a total population below drop_low_pop_lps_thresh (default: FALSE)

Optional arguments

  • drop_low_pop_lps_thresh: Population threshold below which a location period will get dropped from the model. Dependent on drop_low_pop_lps = TRUE (default 15)

Spatial Grid Settings

  • use_pop_weight: Weight 20 x 20 km grid cells according to their 1 x 1 km subgridded population weight. A weight of 0.9 means that 90% of the population in the 20x20 km grid cell should be used as the denominator in the incidence rate calculation for that grid cell (i.e. 10% of the 1x1 km subgrid cells fall outside of the observation's location period border) 12/2022: While this option is still configurable in the R data processing pipeline, it has been removed from the Stan model. That is to say, the population weight feature should always be turned on and the option cannot be turned off in the Stan model code. We should treat this option as obsolete. (default: yes)
  • sfrac_thresh_border: This parameter is used to drop grid cells from the master grid if the proportion of spatial overlap (based on pop-weighted area) with a location period does not exceed the threshold. (default: 1e-3)
  • sfrac_thresh_conn: This parameter is used to unlink cells from location periods if the proportion of spatial overlap (based on pop-weighted area) does not exceed the threshold. (default: 1e-3)

Covariate Ingestion

  • ingest_covariates: ingest covariates that are not previously ingested, including new aggregations of existing covariates; used only for covariate ingestion runs (default: no)
  • ingest_new_covariates: creates metadata tables for covariates that have not been previously ingested in any form; used only for covariate ingestion runs (default: no)

Stan

All Stan arguments should be specified under a stan section of the config.

  • ncores: 4
  • model: Name of the Stan model file for basic modeling. At baseline, "mapping_model_inference.stan"
  • genquant: Name of the Stan model file with the generated quantities section. At baseline, "mapping_model_generate.stan"
  • iter_warmup: Number of iterations to run the warm up model with a default as 1100
  • iter_sampling: Number of iterations to run the official Stan model with a default as 1000
  • recompile: Indicates whether the Stan model code should be recompiled at runtime. At baseline, "yes"

Optional arguments

  • debug: Turn on Stan debugging (default: no)

Custom filenames

The user may specify custom filenames, otherwise default names will be constructed according to the model settings. All filenames should be listed without the path but with the file extension. If using the config writer ...

Optional arguments

  • observations_filename: Name of the preprocess rdata file
  • covariate_filename: Name of the covar rdata file
  • stan_input_filename: Name of the stan input rdata file
  • initial_values_filename: Name of the initial values rdata file
  • stan_output_filename: Name of the stan output rdata file
  • stan_genquant_filename: Name of the generated quantities rds file
  • country_data_report_filename: Name of the country data report html file
  • data_comparison_report_filename: Name of the data comparison report html file

Writing a config file

write_batch_mapping_config_general.R script under Analysis/R folder in the cholera mapping pipeline directory is used to generate config files automatically. Configs can be generated in a batch process using this script. Cholera_directory, config path, countries and cholera start date/end date need to be specified by users. Parameters need to be specified based on each run.

⚠️ **GitHub.com Fallback** ⚠️