Developer Conventions - NYCPlanning/data-engineering GitHub Wiki
Github use
Repos
Previously, every data product the team built had it's own repo. We now use:
data-engineering
to build all data products, build, the QA app, and extract source data to S3 file storageedm-data-operations
to alert ITD's QA team of certain data updatesdb-gru-qaqc
to run QAQC checks for GRU-maintained productspython-geosupport
to create python bindings for DCP's geocoding software called Geosupport
Branches
We have a single main
branch and many feature branches. Feature branches are merged to main
via pull requests.
Commits
In the interest of more atomic and meaningful commits on main
, we strongly encourage:
- cleaning up commits before seeking review
- rebasing or squashing commits when merging a PR
To clean up commits on a dev branch:
- Casually make commits to your dev branch
- Locally rewrite your dev branch's history using
git rebase -i main
orgit reset --soft main
- Force push the new commits via
git push --force-with-lease
[!NOTE] See these tips to split up a commit on a dev branch.
Pull requests
Open a draft PR early in the development process. A descriptive title, links to relevant issues, or a short description are good ways to help reviewers get relevant information. Draft PRs are also helpful for the author to keep track of what's been done
PRs must have descriptions of the changes present.
For especially significant code, try to explain it for reviewers or request input during development:
- It's easiest to do this on lines of code or an entire file in the PR diffs.
- For higher-level brainstorms, starting a conversation in an issue or in Teams might be easier in order to organize the conversation and not overly clutter the PR.
Mark the PR as Ready for Review and request the reviewer(s) when development on the branch is finished.
Implement changes as needed, re-request final review after significant changes, and merge to main when the PR is approved! 🎉
Code formatting
We use black
and sqlfluff
to lint and format our python and SQL code.
black --diff --color --check directory/file.py
black directory/file.py
sqlfluff lint products/.../file.sql
sqlfluff fix products/.../file.sql
For code comments, we recommend using consistent tags (inspired by Better Comments):
# TODO refactor this function to make it faster
# ! Deprecated function, do not use
# * This is an important note
# ? Should this variable be renamed for consistency?
dbt
See dbt's own style guide for reference.
We use dbt-checkpoint
to validate our dbt project conventions.
# Environment variables required by the product's profiles.yml file must be set
dbt deps --profiles-dir products/product_directory --project-dir products/product_directory
dbt seed --profiles-dir products/product_directory --project-dir products/product_directory
pre-commit run --all-files
Model Folders/File Structure
We largely follow dbt's conventions but don't love the term "marts" for product/output tables, so we have
- staging
- intermediate
- product
product
product models are output tables. They often have columns renamed for the purpose of business users. Every table that is exported and packaged as part of a build should be defined here.
Product tables do not need a prefix.
staging
staging models largely follows dbt's idea - any preprocessing step that does not join other tables and does relatively simple operations on data without fundamentally changing the structure of the data can be a staging table.
All staging tables should have stg__
prefix.
We're still deciding a bit if every data source needs a staging table, for now we're tentatively saying intermediate tables can directly reference source tables. But if you find yourself renaming columns, padding strings, etc. on a source table in an intermediate script, you should probably create a staging table. There is a bit of an exception in which what could be a staging table would be better suited in an intermediate table. But before getting into that...
intermediate
intermediate models are everything else, i.e. the actual "transformation" logic of the pipeline.
All intermediate tables should have int__
prefixes. It's perfectly fine to have a bunch of intermediate files all at the root level of this folder, if they're logically named. However, we encourage grouping by subfolders within the intermediate folder logically by the entities represented.
In the case of green_fast_track
, many different data sources had transformations applied and then were buffered to create many tables of buffered geometries, to be used in the logic of flagging pluto lots. These can all go in intermediate/buffers
(with prefixes to filenames int_buffers__
). This is where an exception to the staging logic above might apply. Some buffers were inherently complex and required joining tables to calculate. Therefore, they do not belong in staging, so the buffers
folder is in intermediate and not staging. However, there are also buffers that were created by performing relatively basic operations on a single data source. This could very much live in staging. But since other buffers are already being grouped in an intermediate subfolder, any would-be staging buffers can be put there instead.
Style guides
Code
Data
Learning resources
Links
- python
- sql database connections (Gitlab orchestration_utils example)
- sql
- bash
- raising errors with
set -e
- raising errors with
- github Emoji-Cheat-Sheet