Covariate database maintenance - HopkinsIDD/cholera-mapping-pipeline GitHub Wiki
Ingesting covariates into the database
System setup
Covariate ingestion requires the following additional R packages installed:
- ncd4
- doParallel
- rts
Preparation
- Copy covariate files in branch dev_jk from the repo cholera-covariates to a new folder called
Layers, which should be placed in the main pipeline folder. The subfolders in the copiedLayersfolder will match covariate names defined in thecovariate_dictionary.yml. These files will serve as the source of the covariate ingestion. Currently, we use the 2020's population raw data for 2021, 2022, 2023 and 2024.
Overview of ingestion procedure
-
For covariate data available raw at the 1 km grid scale, first create the 1 km x 1 km grid table public.grid_1_1.
Rscript Analysis/R/prepare_grid_cmd.R -d ./ -u database_user_name -r 1 -i TRUE -
Then ingest the covariate at scale 1 km x 1 km for all years available in the
Layersfolder.Rscript .../cholera-mapping-pipeline/Analysis/R/prepare_covariates_cmd.R -u database_user_name -d .../cholera-mapping-pipeline/Layers -r 1 -t 1 years -i TRUE -o TRUE -x TRUE -m TRUE -c p -g public.grid_1_1 -p FALSE -
For covariate data available raw at the 5 km grid scale, repeat the above two steps to create public.grid_5_5 and ingest data at the 5 km scale.
-
As cholera maps were generated at the 20 km grid scale, always create the 20 km x 20 km grid table public.grid_20_20.
Rscript .../Analysis/R/prepare_grid_cmd.R -u pfang -d .../Layers -r 20 -i TRUE -
Aggregate the 1 km or 5 km gridded covariate to the 20 km x 20 km scale for all years in the
Layersfolder. Note that the spatial aggregation function (mean or sum) is already embedded in the covariates_dictionary.yml file.Rscript .../cholera-mapping-pipeline/Analysis/R/prepare_covariates_cmd.R -u database_user_name -d .../cholera-mapping-pipeline/Layers -r 1 -t 1 years -i TRUE -o TRUE -x TRUE -m TRUE -c p -g public.grid_1_1 -p FALSE
For population covariate ingestion only
Population is a critical variable for modeling incidence and the cholera mapping pipeline leverages 1 km and 20 km gridded population estimates even producing modeling estimates on the 20 km grid. Consequently, there is an extra ingestion step for population covariates, which ensures that the population of a 20 km grid cell equals the sum of all corresponding 1 km grid cells. This validation procedure is performed in the cholera-covariates/population_aggregation.R script in dev_jk branch.
- Step 1 Extraction:
Rscript $TAXDIR/Analysis/R/population_aggregation.R -b 1 -e TRUE -s FALSECurrently, we have 2010-2024 population covariate saved at the database. The value range of-bis from 1 to 15. The command can run in parallel. - Step 2 Set:
Rscript $TAXDIR/Analysis/R/population_aggregation.R -b 1 -e FALSE -s TRUEFor this step, the command should run sequentially. We don't want several processes write the same table at the same time.
Metadata checks?
<Were there any special instructions about ingesting population past 2020 where the raw data was available?>
Launching an ingestion script
- Use
Analysis/R/prepare_covariates_cmd.Rto start the covariate ingestion.-uuser name of databasecholera_covariates-dpath of the covariate folder Layers-rresolution of space, the default value 20. It means 20km * 20km.-tresolution of time, the default value is1 years-iflag to do the ingestion. Candidate values are TRUE or FALSE.-xflag to overwrite covariates metadata table in database. Candidate values are TRUE or FALSE.-mflag to re-extract metadata information-cList of covariates to use, specified as abbreviations separated by comma. The abbreviation for covariate is defined incovariate_dictionary.yml.-gName of full grid.-pFlag to do the parallel processing.
For other command arguments, please check the file Analysis/R/prepare_covariates_cmd.R for more details.
-
Set the environmental variable
COVARIATE_DATABASE_PASSWORDin slurm script. This is the user password of database cholera_covariates running on idmodeling2. Consult the server administrator if you do not yet have a password for the cholera_covariates database. -
Please note that the user launching any covariate ingestion script must be granted write permissions for the cholera covariate database.
Ingesting shapefiles into the database
The ingestion code is at file Analysis/R/ingest_shapefiles_to_db.R. For the running, use parameter -d or --config_dir to specify the config file folder. This code will automatically extract country name from the config file. If running successfully, the shapefiles will be saved at public.shapefiles.
Rscript ...Analyis/R/ingest_shapefiles_to_db.R -d config_folder_path
For database administrators
The following permissions should be granted for users that are running the mapping pipeline or ingesting new covariates into the covariate database.
- For schema covariates, users should have USAGE permission.
- For table covariates.pop_1_years_1_1 and covariates.pop_1_years, users should have READ permission. If additional covariates are required for running the mapping pipeline, then the READ permission should be assigned for these tables too.
- For schema public, users should have USAGE permission.
- For schema grids, users should have USAGE permission.
- For the table grids.master_grid, users should have READ permission.
- For tables in schema public, users should have READ permission.