Home - NYCPlanning/db-data-library GitHub Wiki

Data Library

Data Library is a tool essential to the operation of Data Engineering team (also abbreviates as EDM below) at the Department of City Planning. It is used to manage and archive the very diverse body of datasets that data engineering then transform and produce data products in its portfolios.

Functionalities Overview

The purpose of the Data Library can be succinctly summarizes as an ETL tool to move a foreign or extraneous dataset into the EDM's data environment which is a Digital Ocean s3 bucket. The main accessor function to the primary use is library archive. The action not only creates triggers the etl process to the s3 bucket, it also is acting as a version control mechanism for the dataset being archived. For more details of the inner working of the archive itself, go to Arhive wiki page.

Before a dataset is ready to be archive, it is first ingested into the pipeline. The ingestion functionality mainly handles three classes of source data. The first class of sources is flat file lives on a remote and usually given in the format of url. The second class of sources is dataset curated by the NYC Open Data portal and we access them through Socrata API. The third class is a local, non-public available file that usually provided to EDM through FTP or other data transfer protocol.

yaml template

A yaml file provides a template or recipe to inform the program of how individual dataset should be handled. The for each unique dataset being imported in data library. The processing of these templates are done by Config. Config also handles the additional processing a dataset might need before being ingested into data environment.

wrappers

The uses of the wrapper functions in the ingestion process allows the functions to be very succinct and efficient. It also creates a way that output format can inform what ingestion process should return to the archive object.

Usage & Other Applications

The repo is useful for anyone who wants to gain an understanding of the data engineering's catalogue of sources which are also the ingredients fed into production pipelines to produce the wide range of data products created by EDM. For users outside the organization and context of data engineering team at DCP, one might also use the repo to establish their own archival process for the datasets. Researchers who often use NYC public dataset and want to establish a version control process can certainly benefit from using or building on top of the data library.

image

Uploading Instructions for Datasets to DigitalOcean

Steps to QAQC Datasets before uploading to s3

  • Typically, you will want to test that the data is correct locally before pushing to s3. To do this, you can just run the library archive -n [name_of_dataset] -o csv to view the data making sure to check that the geometry columns have lat/long coordinates that make sense for NYC geography. There can be issues with the source geometry to destination geometry for a variety of reasons, the aforementioned check of geometry columns helps debug this before it impacts our data products downstream. Once the engineer has confirmed that the geometry columns look good and that data looks good, they can upload to s3.

Upload to s3

  • Datasets in edm-recipes are used for our data products as source data and, generally, read in as .sql files in the dataloading step which is why you will often see dataset folders populated with only .sql files, however, best practice is to upload an .sql file AND a .csv (along with any other files in previous versions) to make any future debugging of data easier for the engineers as tabular data is more human readable.