Reproducible Data Science Resources - hackalog/bus_number GitHub Wiki

What do other people recommend for reproducible data science?

5 Easy Steps to Make Your Data Science Project Reproducible:

A Good Project Structure: Here they actually recommend cookiecutter-data-science, which was the original basis of cookiecutter-easydata
Make Use of Virtual Environments: Done. See make create_environment, make requirements
Follow Best Practices while Coding: They mention pep8 and the logging module. We do try and make use of both, though we don't mention them specifically
Documentation, Documentation, Documentation: Yep. Hooks are there, though it hasn't made it to the tutorial yet. This will be part of the Reproducible Results section
Automation: They specifically mention using a Makefile, which we do. We also make use of a workflow module, thought we don't yet get into the more advanced workflow libraries

A Quick Guide to Organizing [Data Science] Projects (updated for 2018)

This article is based loosely on A quick guide to organizing computational biology projects. Here's how we hold up to their approach:

"Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why."
Everything you do, you will probably have to do over again.
To this end, we use:
- Self-documenting makefiles
- Virtual (conda) Environments for reproducibility
File and directory organization
- To this end, we use a cookiecutter and a standard, Well documented filesystem layout
- We also insist that raw data is read-Only: This was the reason for the DataSource object
The Lab notebook
- We recommend using Jupyter Notebooks for self-documenting code and EDA
Carrying out a single experiment
- Workflows
  - We use self-documenting Makefiles and some magic in the workflow module, but this article also mentions Luigi and Snakemake, which are certainly worth looking at
- Experiments
  - Our Dataset objects were designed to encapsulate processed data and associated metadata
  - Our Model object was designed to encapsulate the metadata about a single experiment
Handling and Preventing Errors
- We incorporate various forms of testing (doctests, property-based tests, unit tests) into our framework out of the gate
Command Lines versus Scripts versus Programs
- We encourage exploratory development in notebooks, but then working code should be moved to the editable module (src by default). We include infrastructure to make this easy to do
The Value of Version Control
- We include basic git workflows, and talk about using git from the beginning (Reproducible Environments)

Building a Repeatable Data Analysis Process with Jupyter Notebooks

Directory Structures
- the raw/interim/processed data dir is virtually identical what we do.
Notebook Structure
Operationalizing & Customing This Approach
- They also use cookiecutter

Other Projects

resumableds looks to also be inspired by cookiecutter-data-science. Worth a look to see what they are doing.