Census file - dime-worldbank/Disease-Modelling-SSA GitHub Wiki

This data comes from the 5 percent sample of the 2012 Zimbabwean census, publicly available from IPUMS International https://international.ipums.org/international/. After selecting a larger selection of variables to download in Stata format we take the following steps (1,2 in Stata, remainder in Python):

  1. Create a fully expanded version of the population by multiplying by 20 to get a population of 15 million individuals.

  2. Then create multiple subsample datasets so in the end we have 5,20,50,75 and 100 percent samples, ensuring that there is representation across all districts in samples.

  3. Take each of these subsamples and store them in data/raw. Then, for each one, create a slimmed down version of the census files by running scripts/swise_scripts/censusBuilder_sa.py in order to clean age variables and produce the following list of variables which are essential agent characteristics of the model population: a. person_id - individual identifier for each agent b. age c. sex d. household_id - to enable grouping of households e. district_id - in numeric format 1-60 (no d_) f. economic_status - a subselection of occupations grouped from the census g. economic_activity_location_id h. school_goers - a dummy to identify those going to school for reopening simulations i. manufacturing_workers - a dummy to identify manufacturing workers for a reopening simulation

  4. Running the script outlined in 3 will output a file named 'census_sample_Xperc_070921.csv' to the data/preprocessed/census folder. Folders should be created accordingly for each file type. Note that these datafiles are not available on Github as they would consume too much memory.

The output file 'census_sample_Xperc_070921.csv' will then be the basis for your agent population. We recognised that processing data using multiple programs is not ideal and many of the steps above could be reduced to make them more efficient. We will do so in due course!