Dengue Cases Database - matthewcornell/wikitest GitHub Wiki

As described in Import-Process, the final import deliverable is the unique_case_data table in the dengue_cases database served by Postgres running on the secure UMass/Amherst server documented in Servers-Databases-And-Code-Repos. This page documents the database's namespaces, functions, and tables.

Namespaces/Schemas

Following are descriptions the five schemas in the database. Note: In this documentation, a table's schema is public if a schema is not explicit.

original_data

  • Data is read using R as text and written as one data table per file in this schema. Currently this uses read.csv with colClasses='character', stringsAsFactors=FALSE, and check.names=FALSE to avoid column name mangling and type conversions.
  • We add two columns: an integer indicating which row of a data file the data came from delivery_file_row, and a timestamp indicating when the data was most recently imported import_stamp. These values are primarily for consistency checks or debugging and are not relevant to data analysis.
  • The instructions in file_specific_code/<file_name>.R are used to modify column names only to assure that column names are unique. The most common issue is that columns will have the name date but contain a variety of date formats.
  • The historical count data covering 1968-2005 is loaded into one table named dengue_counts_1968_to_2005
  • The individual case data covering 1999-2010 is loaded into tables named dengue_cases_YYYY.
  • Starting with 2013, week 44 we began to receive individual case data in a single file approximately every two weeks. These files are loaded into tables named dengue_cases_YYYY_wk_WW with YYYY being the year and WW being the week of the year as indicated by the original data file sent to us*. They appear to be associated with the most recent week of data sent to us.
  • For the years 2013-current we also have a set of final year numbers which are placed in tables named dengue_cases_YYYY_final. In years with confusion about the final numbers the final files are numbered as in dengue_cases_2013_2.
  • At this stage all columns have a name that is standardized among tables.

standarized_data

  • Each table from original_data matches a table from this namespace. In standardized_data we add columns s.t. all case data tables have the same columns. Columns of missing data are filled with NA values. We also calculate a uid from the uid_fields defined in shared_data.R, as well as a uid_by_file from the same uid_fields plus the delivery file name.
  • This brings to an end the file-by-file processing pipeline and allows the file-specific tables to be merged.
  • In this namespace we also create a merged table of all case data (called case_data).
  • The UID values are SHA512, calculated within Postgres using a native function and stored as a bytea data type. They allow us to identify individual cases given that we are not provided a unique id.
  • For each table in this namespace there is a matching table (<name>_replicate_counts) with a UID column and the corresponding number of times that UID appears in the original table.

public

  • This schema contains all general use tables including country data, map data, case data, and aggregated case count data.
  • The main case data table is unique_case_data. This table takes the table standardized_data.case_data, removes duplicated rows (of which there are many due to each case appearing in multiple data deliveries), and produces rows with the most complete data from all times that a case was delivered.
  • The table unique_case_data_by_delivery does the same except that it maintains the duplicates across deliveries and allows us to see what the data set looked like at each delivery.
  • There is a set of aggregated count tables at the province, district, and sub-district spatial levels and weekly, biweekly, and monthly time scales. These are named <spatial>_level_<time>_case_counts and are currently aggregated by these levels only. We can construct more tables on request, get in touch. These tables omit zero counts as they only aggregate case data.
  • There is also a set of aggregated tables which fill in zero counts but involve some touchy SQL and potentially miss filling in some zeros. This is a problem we are currently working on.

summary_data

TBD

Functions

TBD More detailed documentation and example calls:

  • aggregate_table(acess_date, spatial_aggregation = 'province', temporal_aggregation = 'biweek'): Creates aggregate counts at various spatial/temporal scales.
    • arguments
      • access_date`: Only look at reports delivered before this date. Defaults to now.
      • spatial_aggregation`: Which spatial scale to aggregate over. Options are province, district, subdistrict, village.
      • temporal_aggregation`: Which temporal scale to aggregate over. Options are week, biweek, month, year.
    • return A table with these columns
      • date_sick_year`: The year
      • time_in_year`: depends on temporal_scale. The number of <temporal_aggregation>s that have passed since the beginning of the year.
      • geocode`: The geocode of a particular <spatial_aggregation>
      • cases`: How many reported cases there are for this date_sick_year,time_in_year,geocode combination.
      • spatial_level`: is always the same as spatial_aggregation
      • temporal_level`: is always the same as temporal_aggregation
  • most_recent_deliveries(acess_date): Returns all cases reported in the most recent reports.
    • arguments
      • access_date`: Only look at reports delivered before this date. Defaults to now.
    • return A table with these columns
      • delivery_year`: The year the report covers. Since reports start from scratch each year, we need one most recent report for each year. delivery_year is not the same as the year(date_delivered).
      • date_delivered`: The date we received this report.
      • file_name`: The name of the file containing the report.
  • unique_case_data(acess_date): Returns how the table unique_case_data would have looked if you accessed it on the given date.
    • arguments
      • access_date`: Only look at reports delivered before this date. Defaults to now.
    • return A table with the same columns as unique_case_data

Tables

There is a table called unique_case_data. Constructing this table is the main purpose of the database.

It includes all cases which are currently reported, and various information about them. It includes information present in the original file, and derivatives of it. In addition, it includes a the first date that each case is reported. This is the only piece of information not extracted from the original .csv . Instead, we maintain a table which keeps track of when reports were first observed. There are two proposed systems for this. One is represented by the table first_reports. In this scenario, we look at each reported case, and try and determine which cases are most likely the first report. From there, we planned to aggregate the delivery dates over probability. The second scenario is represented by the table first_deliveries. In this scenario, we look at unique identifiers, and use those to extract the first reported date. We are currently defining unique ids based on some of the fields, however we know these fields are unstable. We are currently in scenario 2, but we were working on scenario 1. There is a lot of room for improvement here, since scenario 1 is unfinished, and incorrect.

There is another important aspect of the database, which is related to unique_case_data. We want to be able to simulate the information available in the database at a previous time. This helps us do model fitting on simulated partial information and also (will hopefully later) serve as a check that everything is stable. This functionality is encapsulated in a function unique_case_data(access_date). The unique_case_data table is the same as unique_case_data(), but is saved for convenience. This function returns how the table unique_case_data would have looked if you accessed it on the given date. To pull the current cases, we take all cases reported in the most recent reports. Reports are cumulative, so we do not need to look at more than one report per year. There is a function most_recent_delivery(access_date) which extracts those reports. We extract all cases from those reports in the correct year from the table standardized_data.case_data, and join it with first_delivery on UID. standardized_data.case_data is a union all of each other table in the standardized_data schema.

There is also functionality to create aggregate counts at various spatial/temporal scales. This is encapsulated in the aggregate_table(access_date, spatial_scale, temporal_scale) function. There is a plan to use this to automatically keep updated aggregate tables on subversion, but that plan has not migrated to the server yet.

There is also older data back to 1968, which is separate, and pre-aggregated at the level of months and provinces. We interpolate this data to the week and biweek scale (the functions for this are also part of the GitHub repository), and use it to produce aggregate tables.

⚠️ **GitHub.com Fallback** ⚠️