ML Noted Strengths - fcrimins/fcrimins.github.io GitHub Wiki

ML is probably way too susceptible to overfitting to be used in many applications. However, there are a lot of lessons that we can learn from it and the study of it.

Representations - It may be exceptional at representing data (e.g. factorizing, dimension reduction) in efficient ways, such as with word embeddings (e.g. word2vec).
- Constructive Distraction - Unsupervised pretraining (e.g. w/ stacked RBMs followed by Discriminative Fine Tuning) gets you into a part of the function space that you wouldn't otherwise visit if you were just doing supervised learning. I.e. it's very important to initialize your weights well and indeed this has made a lot of techniques successful that weren't previously.
Overfitting - Since ML is so susceptible to overfitting, and since the output of ML models aren't very understandable, there is a lot of focus placed on avoidance of overfitting, much more than there is with classic, statistical techniques. For example, the realization that randomization of weights/inputs/activations is equivalent to weight penalties (see Hinton's lecture notes, 9b?) is a key result.
- Sampling over Global Fitting (aka Bayesian over Frequentist) - Sampling (in the form of: mini-batches, training/validation/test datasets, Mixture of Experts, boosting, bagging, corruption (e.g. skip-gram w/ negative sampling (SGNS), dropout, aka neuron sampling, denoising autoencoders, GANs') seems to have fantastic advantages wrt overfitting.
Practical over theoretical - There really is no theory in ML, so practicality wins the day. This does result, however, in a lot of trial-and-error (e.g. regarding different network architectures). But again, techniques have been designed that can be applicable elsewhere (e.g. cross-validation). We aren't 'data scientists,' we're 'data engineers.'
- A lesson: Mathematical theoretical convenience is expensive. E.g. using flat bottomed parabolas vs. un-mutated ones.
Optimization efficiency - Different cost functions are used more often than in classic, statistical techniques. Also, GPUs.
Noise is good - SGD. "Noisy [Hopfield] networks find better energy minima." Dropout. Normalization via noise. Monte Carlo Markov Chains (for sampling representative distributions of random points).
- negative sampling a.k.a. noise contrastive training or noise contrastive estimation (NCE)
High dimensional search - It's not realistic to construct a grid over all parameters or over all input dimensions. Dynamic programming is too dense (shared sub-problems only get you so far in terms of increased efficiency). Markov Chain Monte Carlo (MCMC) is a brilliant search method for exploring only "interesting" (local minima) regions of the search space (with high probability).
- Also, the Frequentist view that you need to fit less complex models in the absence of lots of data is trumped by the Bayesian view that it's fine to do so, as long as you're not selecting a single (maximum likelihood) model.
Non-linear generalization - Linear techniques can be easily generalized to non-linear by changing the units in a NN. For example, PCA can be done (inefficiently) with a NN, but PCA via SVD cannot be easily generalized to non-linear manifolds in the input space.
DF <> Complexity - An autoencoder can have arbitrary complexity but can still effectively limit the number of Degrees of Freedom, via its bottleneck, of a model. Typically model complexity and degrees of freedom are intricately tied together; when one increases so does the other. * Hobbling of models to prevent too many degrees of freedom (and overfitting): greedy layerwise pretraining, GANs (adversarials)
Distribution modeling and sampling bias - A linear regression can't sample correctly unless you weight the points according to how much they matter.
- We are all fooled by sampling bias (== "fooled by randomness"?). We over-fear outrageous, low probability outcomes because they are outrageous--i.e. we hear about them. We under-fear common, less-low probability outcomes because we don't hear about them as much.
- This is one of the major goals of machine learning (e.g. GANs, see here) to properly estimate distributions.