Developer Setup - NYCPlanning/data-engineering GitHub Wiki
Welcome to Data Engineering at DCP! This guide is intended to help you get you set up to contribute to our codebase.
Code
This repository is our primary location for code, issues, and automated workflows.
-
If you don't already have a Github account, create one, and have a team member add you to the NYCPlanning organization. You can either use a personal account to link to the organization or make one for DCP purposes (some of us on the team do each).
-
Generate SSH Keys and add your key to Github account.
-
Create a
.env
file in the localdata-engineering
directory. Add environment variables to the.env
file; they will be used when creating a Dockerdev
container (see docker). A few others are included, but the basic ones needed for most of our pipelines areBUILD_ENGINE AWS_S3_ENDPOINT AWS_SECRET_ACCESS_KEY AWS_ACCESS_KEY_ID
Most of the relevant secrets can be found in 1password.
Tools
VSCode
Most developers at DCP use VSCode
Definite extensions to install:
- Python
- Dev Containers
- Pylance
- Docker
Other potentially useful ones
- Jupyter
- GitLens
- CodeRunner
- Rainbow CSV
- Data Wrangler
- power user for dbt
1password
We store secrets, credentials, etc in 1password. Talk to a teammate to get set up
Other Tools
Required
Recommended
- Homebrew if on mac
- IPython
- If running notebooks in VSCode, extensions can take care of install/setup
- Postgres or Postgres.app
- QGIS
Tools that are less used or being phased out
- R
- Poetry (Python package manager)
Environment
This section describes general workflow how to run code (qa app, data pipeline, etc) locally
Option 1 - Docker
The simplest way to develop and run pipelines locally is using a dev container. This is a dockerized environment that VSCode can connect to. While it's an effective way to simply set up a complete, production-ready environment (and ensure that code runs the same locally as it does on the cloud), it's also often less performant than running locally outside of a container. For now though, it's certainly still the best place to start (and generally we try to avoid running computationally expensive jobs on our own machines anyways).
All files needed for the container are stored in the data-engineering/.devcontainer/
directory:
Dockerfile
describes how to build "initial" image for our container. It's largely setting variables that VS Code expects to run in the container properly.docker-compose.yml
describes how to set up ourdev
container. It also specifies that we need to build from theDockerfile
prior to initiating the container. We used to specify a postgres service as well, but have moved in favor of using a lighter-weight container and connecting to our persisted cloud dbs even when running locally. Now, this mainly is exposing a port for running streamlit from inside the container and making sure volumes are properly mounted.devcontainer.json
is specifically used to create ourdev
container in VSCode. We don't need this file if we createdev
container from a terminal. It handles things like expected extensions for VSCode while running in the container, commands that should be run before or after starting the container, etc.
There are (at least) 2 ways to spin up the container:
-
From VSCode (which also will then run VS Code within the container):
-
Open VSCode
-
Open the cloned
data-engineering
directory -
VSCode will detect an existing container config file. Click on "Reopen in Container":
-
VSCode may ask for a passphrase associated with your Github SSH key:
If you don't remember the passphrase but saved the it in the Keychain Access app upon creation, you can find the password in the app.
-
If the container was started successfully, the page will look like this:
-
-
From terminal:
- navigate to
data-engineering/.devcontainer/
directory - run command
docker-compose up (-d)
. This command will use existing.yml
file to set up the container. With-d
, it will keep running in the background - if you go this route, you can run VS Code outside of the container or within. Both have advantages - inside the container, you can see performance issues. However, outside the container, you need just a little more decoration around running commands inside the container.
- Running
docker exec -ti de bash
open a terminal prompt in it.
- navigate to
Option 2 - manual setup.
Outside of a dev container, we use tools like homebrew and python virtual environments. There are many ways to do this we typically use venv
or pyenv
(pyenv repo, usage, tutorial). If you're familiar with conda, conda would probably work fine as well, most of us just don't use conda.
Mac
With homebrew, install
gdal
- if possible, the same version as inadmin/run_environment/requirements.txt
- postgres (latest version)
To install our python packages, you will need a virtualenv of your choice set up and activated.
python3 -m venv
source venv/bin/activate
To install this repo's python packages and the dcpy
package:
python3 -m pip install --requirement ./admin/run_environment/requirements.txt
python3 -m pip install --editable . --constraint ./admin/run_environment/constraints.txt
Windows
Create and activate a Python 3.13 environment
conda create -n myenv python=3.13
conda activate myenv
With Conda (Miniconda/Anaconda) in a Git Bash terminal, install
gdal libgdal libgdal-pg
from the conda-forge channel, run the following to ensure PostGIS support:
conda install -c conda-forge gdal libgdal libgdal-pg
- PostgreSQL (latest Windows installer)
(Optional but handy) โ expose the PostgreSQL CLI tools to your shell
echo 'export PATH="$PATH:/c/Program Files/PostgreSQL/17/bin"' >> ~/.bashrc
source ~/.bashrc
Fix psql encoding (UTF-8) - The Windows psql client defaults to WIN1252, but our dumps are UTF-8. Add one of the following:
# Option 1 โ shell-level (add to ~/.bashrc or ~/.bash_profile)
export PGCLIENTENCODING=UTF8
-- Option 2 โ psql-level (add to ~/.psqlrc)
SET client_encoding = 'UTF8';
To install this repoโs Python packages and the dcpy package:
python -m pip install --upgrade pip
python -m pip install --requirement ./admin/run_environment/requirements.txt
python -m pip install --editable . --constraint ./admin/run_environment/constraints.txt