Reproducible Data Science Resources - hackalog/bus_number GitHub Wiki
What do other people recommend for reproducible data science?
5 Easy Steps to Make Your Data Science Project Reproducible:
- A Good Project Structure: Here they actually recommend cookiecutter-data-science, which was the original basis of cookiecutter-easydata
- Make Use of Virtual Environments: Done. See
make create_environment
,make requirements
- Follow Best Practices while Coding: They mention pep8 and the
logging
module. We do try and make use of both, though we don't mention them specifically - Documentation, Documentation, Documentation: Yep. Hooks are there, though it hasn't made it to the tutorial yet. This will be part of the Reproducible Results section
- Automation: They specifically mention using a
Makefile
, which we do. We also make use of aworkflow
module, thought we don't yet get into the more advanced workflow libraries
A Quick Guide to Organizing [Data Science] Projects (updated for 2018)
This article is based loosely on A quick guide to organizing computational biology projects. Here's how we hold up to their approach:
-
"Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why."
-
Everything you do, you will probably have to do over again.
-
To this end, we use:
- Self-documenting makefiles
- Virtual (conda) Environments for reproducibility
-
File and directory organization
- To this end, we use a cookiecutter and a standard, Well documented filesystem layout
- We also insist that raw data is read-Only: This was the reason for the
DataSource
object
-
The Lab notebook
- We recommend using Jupyter Notebooks for self-documenting code and EDA
-
Carrying out a single experiment
- Workflows
- Experiments
- Our
Dataset
objects were designed to encapsulate processed data and associated metadata - Our
Model
object was designed to encapsulate the metadata about a single experiment
- Our
-
Handling and Preventing Errors
- We incorporate various forms of testing (doctests, property-based tests, unit tests) into our framework out of the gate
-
Command Lines versus Scripts versus Programs
- We encourage exploratory development in notebooks, but then working code should be moved to the editable module (
src
by default). We include infrastructure to make this easy to do
- We encourage exploratory development in notebooks, but then working code should be moved to the editable module (
-
The Value of Version Control
- We include basic git workflows, and talk about using git from the beginning (Reproducible Environments)
Building a Repeatable Data Analysis Process with Jupyter Notebooks
- Directory Structures
- the raw/interim/processed data dir is virtually identical what we do.
- Notebook Structure
- Operationalizing & Customing This Approach
- They also use cookiecutter
Other Projects
resumableds looks to also be inspired by cookiecutter-data-science. Worth a look to see what they are doing.