Developer Conventions - NYCPlanning/data-engineering GitHub Wiki

Github use

Repos

Previously, every data product the team built had it's own repo. We now use:

data-engineering to build all data products, build, the QA app, and extract source data to S3 file storage
edm-data-operations to alert ITD's QA team of certain data updates
db-gru-qaqc to run QAQC checks for GRU-maintained products
python-geosupport to create python bindings for DCP's geocoding software called Geosupport

Branches

We have a single main branch and many feature branches. Feature branches are merged to main via pull requests.

Commits

In the interest of more atomic and meaningful commits on main, we strongly encourage:

cleaning up commits before seeking review
rebasing or squashing commits when merging a PR

To clean up commits on a dev branch:

Casually make commits to your dev branch
Locally rewrite your dev branch's history using git rebase -i main or git reset --soft main
Force push the new commits via git push --force-with-lease

[!NOTE] See these tips to split up a commit on a dev branch.

Pull requests

Open a draft PR early in the development process. A descriptive title, links to relevant issues, or a short description are good ways to help reviewers get relevant information. Draft PRs are also helpful for the author to keep track of what's been done

PRs must have descriptions of the changes present.

For especially significant code, try to explain it for reviewers or request input during development:

It's easiest to do this on lines of code or an entire file in the PR diffs.
For higher-level brainstorms, starting a conversation in an issue or in Teams might be easier in order to organize the conversation and not overly clutter the PR.

Mark the PR as Ready for Review and request the reviewer(s) when development on the branch is finished.

Implement changes as needed, re-request final review after significant changes, and merge to main when the PR is approved! 🎉

Code formatting

We use black and sqlfluff to lint and format our python and SQL code.

black --diff --color --check directory/file.py
black directory/file.py

sqlfluff lint products/.../file.sql
sqlfluff fix products/.../file.sql

For code comments, we recommend using consistent tags (inspired by Better Comments):

# TODO refactor this function to make it faster
# ! Deprecated function, do not use
# * This is an important note
# ? Should this variable be renamed for consistency?

dbt

See dbt's own style guide for reference.

We use dbt-checkpoint to validate our dbt project conventions.

# Environment variables required by the product's profiles.yml file must be set
dbt deps --profiles-dir products/product_directory --project-dir products/product_directory
dbt seed --profiles-dir products/product_directory --project-dir products/product_directory
pre-commit run --all-files

Model Folders/File Structure

We largely follow dbt's conventions but don't love the term "marts" for product/output tables, so we have

staging
intermediate
product

`product`

product models are output tables. They often have columns renamed for the purpose of business users. Every table that is exported and packaged as part of a build should be defined here.

Product tables do not need a prefix.

`staging`

staging models largely follows dbt's idea - any preprocessing step that does not join other tables and does relatively simple operations on data without fundamentally changing the structure of the data can be a staging table.

All staging tables should have stg__ prefix.

We're still deciding a bit if every data source needs a staging table, for now we're tentatively saying intermediate tables can directly reference source tables. But if you find yourself renaming columns, padding strings, etc. on a source table in an intermediate script, you should probably create a staging table. There is a bit of an exception in which what could be a staging table would be better suited in an intermediate table. But before getting into that...

`intermediate`

intermediate models are everything else, i.e. the actual "transformation" logic of the pipeline.

All intermediate tables should have int__ prefixes. It's perfectly fine to have a bunch of intermediate files all at the root level of this folder, if they're logically named. However, we encourage grouping by subfolders within the intermediate folder logically by the entities represented.

In the case of green_fast_track, many different data sources had transformations applied and then were buffered to create many tables of buffered geometries, to be used in the logic of flagging pluto lots. These can all go in intermediate/buffers (with prefixes to filenames int_buffers__). This is where an exception to the staging logic above might apply. Some buffers were inherently complex and required joining tables to calculate. Therefore, they do not belong in staging, so the buffers folder is in intermediate and not staging. However, there are also buffers that were created by performing relatively basic operations on a single data source. This could very much live in staging. But since other buffers are already being grouped in an intermediate subfolder, any would-be staging buffers can be put there instead.