Dengue Cases Database - matthewcornell/wikitest GitHub Wiki
As described in Import-Process, the final import deliverable is the unique_case_data table
in the dengue_cases
database served by Postgres running on the secure UMass/Amherst server documented in
Servers-Databases-And-Code-Repos. This page documents the database's namespaces, functions, and tables.
Following are descriptions the five schemas in the database. Note: In this documentation, a table's schema is public
if a schema is not explicit.
- Data is read using R as text and written as one data table per file in this schema. Currently this uses
read.csv
withcolClasses='character'
,stringsAsFactors=FALSE
, andcheck.names=FALSE
to avoid column name mangling and type conversions. - We add two columns: an integer indicating which row of a data file the data came from
delivery_file_row
, and a timestamp indicating when the data was most recently importedimport_stamp
. These values are primarily for consistency checks or debugging and are not relevant to data analysis. - The instructions in
file_specific_code/<file_name>.R
are used to modify column names only to assure that column names are unique. The most common issue is that columns will have the namedate
but contain a variety of date formats. - The historical count data covering 1968-2005 is loaded into one table named
dengue_counts_1968_to_2005
- The individual case data covering 1999-2010 is loaded into tables named
dengue_cases_YYYY
. - Starting with 2013, week 44 we began to receive individual case data in a single file approximately every two weeks.
These files are loaded into tables named
dengue_cases_YYYY_wk_WW
with YYYY being the year and WW being the week of the year as indicated by the original data file sent to us*. They appear to be associated with the most recent week of data sent to us. - For the years 2013-current we also have a set of final year numbers which are placed in tables named
dengue_cases_YYYY_final
. In years with confusion about the final numbers the final files are numbered as indengue_cases_2013_2
. - At this stage all columns have a name that is standardized among tables.
- Each table from
original_data
matches a table from this namespace. Instandardized_data
we add columns s.t. all case data tables have the same columns. Columns of missing data are filled withNA
values. We also calculate auid
from theuid_fields
defined inshared_data.R
, as well as auid_by_file
from the sameuid_fields
plus the delivery file name. - This brings to an end the file-by-file processing pipeline and allows the file-specific tables to be merged.
- In this namespace we also create a merged table of all case data (called
case_data
). - The UID values are SHA512, calculated within Postgres using a native function and stored as a
bytea
data type. They allow us to identify individual cases given that we are not provided a unique id. - For each table in this namespace there is a matching table (
<name>_replicate_counts
) with a UID column and the corresponding number of times that UID appears in the original table.
- This schema contains all general use tables including country data, map data, case data, and aggregated case count data.
- The main case data table is
unique_case_data
. This table takes the tablestandardized_data.case_data
, removes duplicated rows (of which there are many due to each case appearing in multiple data deliveries), and produces rows with the most complete data from all times that a case was delivered. - The table
unique_case_data_by_delivery
does the same except that it maintains the duplicates across deliveries and allows us to see what the data set looked like at each delivery. - There is a set of aggregated count tables at the province, district, and sub-district spatial levels and weekly,
biweekly, and monthly time scales. These are named
<spatial>_level_<time>_case_counts
and are currently aggregated by these levels only. We can construct more tables on request, get in touch. These tables omit zero counts as they only aggregate case data. - There is also a set of aggregated tables which fill in zero counts but involve some touchy SQL and potentially miss filling in some zeros. This is a problem we are currently working on.
TBD
TBD More detailed documentation and example calls:
-
aggregate_table(acess_date, spatial_aggregation = 'province', temporal_aggregation = 'biweek')
: Creates aggregate counts at various spatial/temporal scales.- arguments
- access_date`: Only look at reports delivered before this date. Defaults to now.
- spatial_aggregation`: Which spatial scale to aggregate over. Options are province, district, subdistrict, village.
- temporal_aggregation`: Which temporal scale to aggregate over. Options are week, biweek, month, year.
- return A table with these columns
- date_sick_year`: The year
- time_in_year`: depends on temporal_scale. The number of <temporal_aggregation>s that have passed since the beginning of the year.
- geocode`: The geocode of a particular <spatial_aggregation>
- cases`: How many reported cases there are for this date_sick_year,time_in_year,geocode combination.
- spatial_level`: is always the same as spatial_aggregation
- temporal_level`: is always the same as temporal_aggregation
- arguments
-
most_recent_deliveries(acess_date)
: Returns all cases reported in the most recent reports.- arguments
- access_date`: Only look at reports delivered before this date. Defaults to now.
- return A table with these columns
- delivery_year`: The year the report covers. Since reports start from scratch each year, we need one most recent report for each year. delivery_year is not the same as the year(date_delivered).
- date_delivered`: The date we received this report.
- file_name`: The name of the file containing the report.
- arguments
-
unique_case_data(acess_date)
: Returns how the table unique_case_data would have looked if you accessed it on the given date.- arguments
- access_date`: Only look at reports delivered before this date. Defaults to now.
- return A table with the same columns as unique_case_data
- arguments
There is a table called unique_case_data
. Constructing this table is the main purpose of the database.
It includes all cases which are currently reported, and various information about them. It includes information present
in the original file, and derivatives of it. In addition, it includes a the first date that each case is reported. This
is the only piece of information not extracted from the original .csv . Instead, we maintain a table which keeps track
of when reports were first observed. There are two proposed systems for this. One is represented by the table
first_reports. In this scenario, we look at each reported case, and try and determine which cases are most likely the
first report. From there, we planned to aggregate the delivery dates over probability. The second scenario is
represented by the table first_deliveries
. In this scenario, we look at unique identifiers, and use those to extract
the first reported date. We are currently defining unique ids based on some of the fields, however we know these fields
are unstable. We are currently in scenario 2, but we were working on scenario 1. There is a lot of room for improvement
here, since scenario 1 is unfinished, and incorrect.
There is another important aspect of the database, which is related to unique_case_data
. We want to be able to
simulate the information available in the database at a previous time. This helps us do model fitting on simulated
partial information and also (will hopefully later) serve as a check that everything is stable. This functionality is
encapsulated in a function unique_case_data(access_date)
. The unique_case_data
table is the same as
unique_case_data()
, but is saved for convenience. This function returns how the table unique_case_data
would have
looked if you accessed it on the given date. To pull the current cases, we take all cases reported in the most recent
reports. Reports are cumulative, so we do not need to look at more than one report per year. There is a function
most_recent_delivery(access_date)
which extracts those reports. We extract all cases from those reports in the correct
year from the table standardized_data.case_data
, and join it with first_delivery on UID. standardized_data.case_data
is a union all of each other table in the standardized_data
schema.
There is also functionality to create aggregate counts at various spatial/temporal scales. This is encapsulated in the
aggregate_table(access_date, spatial_scale, temporal_scale)
function. There is a plan to use this to automatically
keep updated aggregate tables on subversion, but that plan has not migrated to the server yet.
There is also older data back to 1968, which is separate, and pre-aggregated at the level of months and provinces. We interpolate this data to the week and biweek scale (the functions for this are also part of the GitHub repository), and use it to produce aggregate tables.