Day 2 - QCB-Collaboratory/W17.MachineLearning GitHub Wiki

Classification, performance and cross-validation

Slides are available here.

The video is available here

Class materials

Here is a (static) Jupyter Notebook with the code from day 2
For a LIVE and editable version of Jupyter Notebook:

The live version above does not require accounts or virtually anything installed on your own computer. It usually takes a few minutes (on average 3 min., certainly less than 10) for the notebook to be ready. But once it's on your screen, it runs smoothly.

Synthetic dataset to practice classification

Click here to download the dataset used with Decision Trees.

Because this dataset presents a very simple and obvious pattern, it is a great way to practice the construction of classifiers. Here is a plot of how this dataset looks like:

After-the-class practice

Click here for the Banana Dataset.

Resources to use Jupyter Notebooks with other languages

Julia language for jupyter
R language for jupyter
Kernel connecting Matlab with jupyter

Non-technical reading about overfitting in biology

Overfitting is a classical problem in the Machine Learning literature, but is often overlooked in applications. Here's a famous quote from John von Neumann:

With four parameters I can fit an elephant and with five I can make him wiggle his trunk.

We provide below a list of papers that discuss overfitting and its impact on quantitative biology and medicine. These are, of course, just a few examples - if you know of any other interesting resource, please let us know.

What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models, by Dr. Babyak, published in Psychosomatic Medicine

Statistical models, such as linear or logistic regression or survival analysis, are frequently used as a means to answer scientific questions in psychosomatic research. Many who use these techniques, however, apparently fail to appreciate fully the problem of overfitting, ie, capitalizing on the idiosyncrasies of the sample at hand. Overfitted models will fail to replicate in future samples, thus creating considerable uncertainty about the scientific merit of the finding.

Rules of evidence for cancer molecular-marker discovery and validation, by Dr. Ransohoff, published in Nature Reviews Cancer.

This paper focuses in depth on the problem of OVERFITTING that, in DISCOVERY-BASED RESEARCH, could lead to promising but non-reproducible, results. Other problems or challenges to validity, described briefly, should be considered in similar depth.

Applications of machine learning in animal behaviour studies, by Dr. Valleta et al., published in Animal Behaviour.

ML algorithms can deal with nonlinearities and interactions among variables because the models are flexible enough to fit the data (as opposed to rigid linear regression models, for example). However, this flexibility needs to be constrained to avoid fitting noise (overfitting). Hyperparameters, specific to the ML algorithm, are tuned by cross-validation to strike a balance between underfitting and overfitting, known as the bias–variance trade-off (see Fig. 2a).

Finally, we also selected a list of papers that apply specific methods to avoid overfitting in biological and medical applications.

Controlling the Overfitting of Heritability in Genomic Selection through Cross Validation, published in Scientific Reports.
Structural biology: Proteins in dynamic equilibrium, published in Nature.
Prediction of ovarian cancer prognosis and response to chemotherapy by a serum-based multiparametric biomarker panel, published in British Journal of Cancer
Improved estimation of SNP heritability using Bayesian multiple-phenotype models, published in European Journal of Human Genetics
Model-based analysis of multishell diffusion MR data for tractography: How to get over fitting problems, published in Magnetic Resonance in Medicine (2012).
Prevention of overfitting in cryo-EM (electron cryomicroscopy) structure determination, by Dr. Scheres, published in Nature Methods
Automated pulse discrimination of two freely-swimming electric fish during dominance contest, published in Journal of Physiology Paris.
Parameter estimation in systems biology models using spline approximation, published in BCM Systems Biology.

Beyond accuracy

Below you will find a list of recent publications that go beyond accuracy, exploring precision and recall as better descriptors of the performance of their models. These papers were "randomly" selected in the literature.

A method to predict the impact of regulatory variants from DNA sequence, by Dongwon Lee et al. in Nature Genetics.
Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software, by Alexander Sczyrba in Nature Genetics.
A CRISPR-based screen for Hedgehog signaling provides insights into ciliary function and ciliopathies, by David K. Breslow in Nature Genetics.
The genomic landscape of Neanderthal ancestry in present-day humans, by Sriram Sankararaman et al. in Nature.
MetaMass, a tool for meta-analysis of subcellular proteomics data, by Fridtjof Lund-Johansen et al. in Nature Methods.

One common approach to using precision and recall is the construction of ROC curves. Here are a few review papers that discuss how this is employed in bio-medical applications.

Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond, by Michael Pencina et al. in Statistics in Medicine
Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine., by M H Zweig and G Campbell in Clinical Chemistry.