Covariate database maintenance - HopkinsIDD/cholera-mapping-pipeline GitHub Wiki

Ingesting covariates into the database

System setup

Covariate ingestion requires the following additional R packages installed:

  • ncd4
  • doParallel
  • rts

Preparation

  • Copy covariate files in branch dev_jk from the repo cholera-covariates to a new folder called Layers, which should be placed in the main pipeline folder. The subfolders in the copied Layers folder will match covariate names defined in the covariate_dictionary.yml. These files will serve as the source of the covariate ingestion. Currently, we use the 2020's population raw data for 2021, 2022, 2023 and 2024.

Overview of ingestion procedure

  1. For covariate data available raw at the 1 km grid scale, first create the 1 km x 1 km grid table public.grid_1_1. Rscript Analysis/R/prepare_grid_cmd.R -d ./ -u database_user_name -r 1 -i TRUE

  2. Then ingest the covariate at scale 1 km x 1 km for all years available in the Layers folder. Rscript .../cholera-mapping-pipeline/Analysis/R/prepare_covariates_cmd.R -u database_user_name -d .../cholera-mapping-pipeline/Layers -r 1 -t 1 years -i TRUE -o TRUE -x TRUE -m TRUE -c p -g public.grid_1_1 -p FALSE

  3. For covariate data available raw at the 5 km grid scale, repeat the above two steps to create public.grid_5_5 and ingest data at the 5 km scale.

  4. As cholera maps were generated at the 20 km grid scale, always create the 20 km x 20 km grid table public.grid_20_20. Rscript .../Analysis/R/prepare_grid_cmd.R -u pfang -d .../Layers -r 20 -i TRUE

  5. Aggregate the 1 km or 5 km gridded covariate to the 20 km x 20 km scale for all years in the Layers folder. Note that the spatial aggregation function (mean or sum) is already embedded in the covariates_dictionary.yml file. Rscript .../cholera-mapping-pipeline/Analysis/R/prepare_covariates_cmd.R -u database_user_name -d .../cholera-mapping-pipeline/Layers -r 1 -t 1 years -i TRUE -o TRUE -x TRUE -m TRUE -c p -g public.grid_1_1 -p FALSE

For population covariate ingestion only

Population is a critical variable for modeling incidence and the cholera mapping pipeline leverages 1 km and 20 km gridded population estimates even producing modeling estimates on the 20 km grid. Consequently, there is an extra ingestion step for population covariates, which ensures that the population of a 20 km grid cell equals the sum of all corresponding 1 km grid cells. This validation procedure is performed in the cholera-covariates/population_aggregation.R script in dev_jk branch.

  1. Step 1 Extraction: Rscript $TAXDIR/Analysis/R/population_aggregation.R -b 1 -e TRUE -s FALSE Currently, we have 2010-2024 population covariate saved at the database. The value range of -b is from 1 to 15. The command can run in parallel.
  2. Step 2 Set: Rscript $TAXDIR/Analysis/R/population_aggregation.R -b 1 -e FALSE -s TRUE For this step, the command should run sequentially. We don't want several processes write the same table at the same time.

Metadata checks?

<Were there any special instructions about ingesting population past 2020 where the raw data was available?>

Launching an ingestion script

  1. Use Analysis/R/prepare_covariates_cmd.R to start the covariate ingestion.
    • -u user name of database cholera_covariates
    • -d path of the covariate folder Layers
    • -r resolution of space, the default value 20. It means 20km * 20km.
    • -t resolution of time, the default value is 1 years
    • -i flag to do the ingestion. Candidate values are TRUE or FALSE.
    • -x flag to overwrite covariates metadata table in database. Candidate values are TRUE or FALSE.
    • -m flag to re-extract metadata information
    • -c List of covariates to use, specified as abbreviations separated by comma. The abbreviation for covariate is defined in covariate_dictionary.yml.
    • -g Name of full grid.
    • -p Flag to do the parallel processing.

For other command arguments, please check the file Analysis/R/prepare_covariates_cmd.R for more details.

  1. Set the environmental variable COVARIATE_DATABASE_PASSWORD in slurm script. This is the user password of database cholera_covariates running on idmodeling2. Consult the server administrator if you do not yet have a password for the cholera_covariates database.

  2. Please note that the user launching any covariate ingestion script must be granted write permissions for the cholera covariate database.

Ingesting shapefiles into the database

The ingestion code is at file Analysis/R/ingest_shapefiles_to_db.R. For the running, use parameter -d or --config_dir to specify the config file folder. This code will automatically extract country name from the config file. If running successfully, the shapefiles will be saved at public.shapefiles.
Rscript ...Analyis/R/ingest_shapefiles_to_db.R -d config_folder_path

For database administrators

The following permissions should be granted for users that are running the mapping pipeline or ingesting new covariates into the covariate database.

  • For schema covariates, users should have USAGE permission.
  • For table covariates.pop_1_years_1_1 and covariates.pop_1_years, users should have READ permission. If additional covariates are required for running the mapping pipeline, then the READ permission should be assigned for these tables too.
  • For schema public, users should have USAGE permission.
  • For schema grids, users should have USAGE permission.
  • For the table grids.master_grid, users should have READ permission.
  • For tables in schema public, users should have READ permission.