Jupyter Notebook Innovation at Netflix - BKJackson/BKJackson_Wiki GitHub Wiki
Netflix Notebook Pipeline
Netflix Tech Blog
Part 1: Notebook Innovation at Netflix
Part 2: Scheduling Notebooks at Netflix
Scaling Time Series Data Storage — Part II
The Evolution of Container Usage at Netflix
Netflix Conductor: A microservices orchestrator
Meson: Workflow Orchestration for Netflix Recommendations
Guiding Principles for Notebook Development
At Netflix we’ve adopted notebooks as an integration tool, not as a library replacement.
What this means is that we’re not using notebooks as code libraries and consequently aren’t pressing for unit level tests on our notebooks, as those should be encapsulated by the underlying libraries. Instead, we promote guiding principles for notebook development:
-
Low Branching Factor: Keep your notebooks fairly linear. If you have many conditionals or potential execution paths, it becomes hard to ensure end-to-end tests are covering the desired use cases well.
-
Library Functions in Libraries: If you do end up with complex functions which you might reuse or refactor independently, these are good candidates for a coding library rather than in a notebook. Providing your notebooks in git repositories means you can position shared unit-tested code in that same repository as your notebooks, rather than trying to unit test complex notebooks.
-
Short and Simple is Better: A notebook which generates lots of useful outputs and visuals with a few simple cells is better than a ten page manual. This makes your notebooks more shareable, understandable, and maintainable.
Most Common Ways of Using Notebooks at Netflix
- Data access
"From our users’ perspective, notebooks offer a convenient interface for iteratively running code, exploring output, and visualizing data — all from a single cloud-based development environment. We also maintain a Python library that consolidates access to platform APIs. This means users have programmatic access to virtually the entire platform from within a notebook. Because of this combination of versatility, power, and ease of use, we’ve seen rapid organic adoption for all user types across our entire platform." Source - Notebook templates
"A parameterized notebook is exactly what it sounds like: a notebook which allows you to specify parameters in your code and accept input values at runtime. This provides an excellent mechanism for users to define notebooks as reusable templates." - Scheduling notebooks
"Many users construct an entire workflow in a notebook, only to have to copy/paste it into separate files for scheduling when they’re ready to deploy it. By treating notebooks as a logical workflow, we can easily schedule it the same as any other workflow."
"We can schedule other types of work through notebooks, too. When a Spark or Presto job executes from the scheduler, the source code is injected into a newly-created notebook and executed. That notebook then becomes an immutable historical record, containing all related artifacts — including source code, parameters, runtime config, execution logs, error messages, and so on. When troubleshooting failures, this offers a quick entry point for investigation, as all relevant information is colocated and the notebook can be launched for interactive debugging."
Papermill - a tool for parameterizing, executing, and analyzing Jupyter Notebooks
- parameterize notebooks
- execute and collect metrics across the notebooks
- summarize collections of notebooks
Since Papermill controls its own runtime processes, we don’t need any notebook server or other infrastructure to execute against notebook kernels. This eliminates some of the complexities that come with hosted notebook services as we’re executing in a simpler context.
With Papermill we can compute runtime parameters and inject them into a notebook, run the notebook, and store the outcomes to our data warehouse. This architecture decouples parameterized notebooks from scheduling, providing flexibility in choosing a scheduler. Thus just about any cron string and/or event consuming tool can enable running the work we’ve setup so far.
Scheduling a notebook with Papermill
Example
papermill s3://etlbucket/jobs/templates/vid_agg.ipynb s3://etlbucket/jobs/outputs/${timestamp}_vid_agg_fr.ipynb -p region_code fr
Scheduling Notebooks on the Cloud
How Netflix does it
When the user schedules a notebook, the scheduler copies the user’s notebook from Amazon EFS to a common directory on Amazon S3. The notebook on S3 becomes the source of truth for the scheduler, or source notebook.
Each time the scheduler runs a notebook, it instantiates a new notebook from the source notebook. This new notebook is what actually executes and becomes an immutable record of that execution, containing the code, output, and logs from each cell. We refer to this as the output notebook.
Containerization and Notebooks
Netflix employs a highly-scalable containerized architecture on AWS. All jobs on the Data Platform run on containers — including queries, pipelines, and notebooks. Naturally, we wanted to abstract away as much of this complexity as possible.
A container is provisioned when a user launches a notebook server. We provide reasonable defaults for container resources, which works for ~87.3% of execution patterns. When that’s not enough, users can request more resources using a simple interface.
We also provide a unified execution environment with a prepared container image. The image has common libraries and an array of default kernels preinstalled. Not everything in the image is static — our kernels pull the most recent versions of Spark and the latest cluster configurations for our platform. This reduces the friction and setup time for new notebooks and generally keeps us to a single execution environment.
Under the hood we’re managing the orchestration and environments with Titus, our Docker container management service. We further wrap that service by managing the user’s particular server configuration and image. The image also includes user security groups and roles, as well as common environment variables for identity within included libraries. This means our users can spend less time on infrastructure and more time on data.
Commuter for Notebook Collaboration at Netflix
Collaboration is fundamental to how we work at Netflix. It came as no surprise then when users started sharing notebook URLs. As this practice grew, we ran into frequent problems with accidental overwrites caused by multiple people concurrently accessing the same notebook . Our users wanted a way to share their active notebook in a read-only state. This led to the creation of Commuter. Behind the scenes, Commuter surfaces the Jupyter APIs for /files and /api/contents to list directories, view file contents, and access file metadata. This means users can safely view notebooks without affecting production jobs or live-running notebooks.