Day 2 - QCB-Collaboratory/W17.MachineLearning GitHub Wiki


Classification, performance and cross-validation

  • Slides are available here.

  • The video is available here


Class materials

  • Here is a (static) Jupyter Notebook with the code from day 2

  • For a LIVE and editable version of Jupyter Notebook: Binder

The live version above does not require accounts or virtually anything installed on your own computer. It usually takes a few minutes (on average 3 min., certainly less than 10) for the notebook to be ready. But once it's on your screen, it runs smoothly.

Synthetic dataset to practice classification

  • Click here to download the dataset used with Decision Trees.

Because this dataset presents a very simple and obvious pattern, it is a great way to practice the construction of classifiers. Here is a plot of how this dataset looks like:

After-the-class practice


Resources to use Jupyter Notebooks with other languages


Non-technical reading about overfitting in biology

Overfitting is a classical problem in the Machine Learning literature, but is often overlooked in applications. Here's a famous quote from John von Neumann:

With four parameters I can fit an elephant and with five I can make him wiggle his trunk.

We provide below a list of papers that discuss overfitting and its impact on quantitative biology and medicine. These are, of course, just a few examples - if you know of any other interesting resource, please let us know.

What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models, by Dr. Babyak, published in Psychosomatic Medicine

Statistical models, such as linear or logistic regression or survival analysis, are frequently used as a means to answer scientific questions in psychosomatic research. Many who use these techniques, however, apparently fail to appreciate fully the problem of overfitting, ie, capitalizing on the idiosyncrasies of the sample at hand. Overfitted models will fail to replicate in future samples, thus creating considerable uncertainty about the scientific merit of the finding.

Rules of evidence for cancer molecular-marker discovery and validation, by Dr. Ransohoff, published in Nature Reviews Cancer.

This paper focuses in depth on the problem of OVERFITTING that, in DISCOVERY-BASED RESEARCH, could lead to promising but non-reproducible, results. Other problems or challenges to validity, described briefly, should be considered in similar depth.

Applications of machine learning in animal behaviour studies, by Dr. Valleta et al., published in Animal Behaviour.

ML algorithms can deal with nonlinearities and interactions among variables because the models are flexible enough to fit the data (as opposed to rigid linear regression models, for example). However, this flexibility needs to be constrained to avoid fitting noise (overfitting). Hyperparameters, specific to the ML algorithm, are tuned by cross-validation to strike a balance between underfitting and overfitting, known as the bias–variance trade-off (see Fig. 2a).


Finally, we also selected a list of papers that apply specific methods to avoid overfitting in biological and medical applications.


Beyond accuracy

Below you will find a list of recent publications that go beyond accuracy, exploring precision and recall as better descriptors of the performance of their models. These papers were "randomly" selected in the literature.

One common approach to using precision and recall is the construction of ROC curves. Here are a few review papers that discuss how this is employed in bio-medical applications.

⚠️ **GitHub.com Fallback** ⚠️