Data Architecture - stlrda/211Dashboard-Workflows GitHub Wiki
The RDA 211 dashboard encompasses the following segments of data
- Census Data
- 211 calls data
- Covid-19 data
- Unemployement Claims data
- Uemployment statistics (subset taken from Bureau of Labor Statistics)
- Regional funding data
The RDA data are hosted in PostgreSQL RDBMS.
Data Flow
There are No data processing needs for the dashboard, except for basic data aggregation and cleansing if necessary. The data are downloaded into csv files from respective sources (the sources are indicated in the comments section of each data object. [The attributes in csv files are pipe ('|') delimited - to avoid confusion with any usage of comma (',') in character / text attributes.] These csv files reside in the AWs S3 bucket (s3://uw211dashboard-workbucket).
These csv files in the S3 bucket are loaded into corresponding staging tables (the files and staging tables have one-to-one correspondence). The data from staging tables are then moved into core tables of the respective data segments. The dashboard tools (e.g., Tableau), used for analytics reporting on the data, use core tables. While the staging tables go through a full refresh, the transfer of data from staging into core tables happens incrementally (for a specific set of days, weeks, or months).
Lookup objects
Lookup objects (tables and views) are used as reference data and to associate standardized data element values (e.g., GeoID is used instead of County-Name when comparing data facts across different data segments).
Staging objects
Individual data segments - as listed above - are sourced from the respective sources (web-page or otherwise) and put into staging tables.[Covid data are taken from disparate sources and flows into 3 different staging tables - that are appropriately named.] No data manipulation occurs while downloading data into staging tables from corresponding data source(s) - except, of course, translating the blank, NaN values as NULLs.
Staging tables undergo a complete refresh (all data are wiped out using TRUNCATE statements and a fresh dataset is used to load data). At present, complete data bulk offers a simpler solution as the data bulk is not too prohibitive and is not detremental to performance.
All staging tbles have names with "stg_" prefix.
Core objects
Core objects are standardized and refined versions of the staging data objects. Unlike staging data tables, core data tables get feeds of incremental data. For individual data load cycles and purpose, a reference date is maintained in table cre_last_success_run_dt that indicates when the last chunk of data was successfully loaded into core table(s) from the corresponding staging table(s).
At the start of the run cycle (daily, weekly, or monthly), for a given core table, records on or after the last successful run date (lst_success_dt attribute of table cre_last_success_run_dt) are DELETEd. All records with a reference date stamp equal to or greater than the last successful run date are then transferred from corresponding staging table(s) to the core table being processed (all staging tables have an attribute with date or timestamp data type that is part of the (Natural) Primary Key). [An attribute buffer_cnt, in table cre_last_success_run_dt may be used to wipe data from core going back one or more refresh cycles beyond the last successful run date, ensuring that no records a missed from the recent most refresh of the staging data or if thre were any revisions of the staging data (e.g., sometimes the weekly unemployment claims of the prior weeks may be revised / updated).]
Data Objects
Data objects are named according to their purpose or data segment they contribute to and also to reflect the type of the object. As indicated earlier staging, core, and lookup objects are recognized by their respective prefix. Also, a view name contains vu_ immediately following the stg_, cre_, or lkup_ prefix. [All data objects with any of these three prefixes (stg_, cre_, or lkup_) are tables, unless indicated by the presence of vu_.]
Individual data obejects (tables and views) are created with corresponding comments that succinctly describe the funciton / purpose of the object and the source of data. [Comments of individual objects can be checked intheir "Property" dialog (right-click the object and click on 'Properties...' submenu item at the end of the list).]
Scripts
All these objects are created, maintained, and managed using SQL scripts that are described below. All of the SQL scripts have comments in the file-header section (at the top), or for individual SQL-statements, or both. There are often comments for individual lines of SQL statements to explain or reason their existence. [Comment text is precedded by "--" or enclosed in a matching pair of "/*" and "*/".]
DDL Scripts
Data Dictionary Language (DDL) scripts apply to creation (CREATE statements) and changes (ALTER statements) to data objects' structures. The ALTER statements may be used to change the structure of an objects (e.g., a table) to add a column or index.
The file names of DDL scripts are prefixed with crTbl_ (for table creation) or crVu_ (for view creation) [At present, only Tables and Views are created in the DB for RDA data.]
Each object created in DDL has the corresponding CREATE COMMENT statement following its creation. The COMMENT on the object describes the details about the object's purpose, its data source, and any other relevant detail. Object comments can be see in the "General" tab of an object's "Properties". [In PostgreSQL (PgAdmin), right-click on the object-name in the left-pane and click on the "Properties" in the resulting menu.]
DML Scripts
Data Manipulation Language (DML) scripts apply to INSERTing, UPDATing, or DELETing of data from any object (table). Most of these are used for data migration form staging tables into the core tables. Other DML scripts are to appropriately set the "last-successfull-execution" timestamp (lst_success_dt attribute of table cre_last_success_run_dt).
Some DML statements may be included in DDL scripts - to manually update the data created in Lookup tables (e.g., in the DDL script for zip-code lookup table - LKUP_ZIP_TRACT_GEOID). Such DML statements may have been commented (enclosed between a matching pair of "/*" and "*/") because they need to be executed only once. Appropriate comments have been added in that section to explain the operation.]
Admin Scripts
Admin scripts pertain to files containing SQL Statements towards administering of the DB for appropriate usage by the user community.
At the current juncture, there is only one file in this category - one that lists the GRANT privileges for DB objects usage. These GRANT statements are contained in the file "usr_dataviz_grnts.sql".