Home - Nori12/Machine-Learning-Tutorial GitHub Wiki

Machine Learning Tutorial

Following the book 'Introduction to Machine Learning with Python' by Andreas C. Müller & Sarah Guido.

Libraries

Numpy One of the fundamental packages for scientific computing in Python. Contains functionality for multidimensional array, high-level mathematical functions such as linear algebra operations and the Fourier transform, and pseudorandom number generators.
SciPy Collection of functions for scientific computing in Python. It provides advanced linear algebra routines, mathematical function optimisation, signal processing, special mathematical functions, and statistical distributions.
Scikit-learn Open source project that contains a number of state-of-the-art machine learning algorithms. The most prominent Python library for ML. It is build on top of NumPy and SciPy.
Matplotlib Primary scientific plotting library in Python. It provides functions for visualising data.
Pandas Python library for data wrangling and analysis. It is build around a structure called the DataFrame, which works like a table. It allows SQL-like queries and joins of tables. In contrast with NumPy, pandas allows each column to have a separate type (integers, dates,...). It can ingest from a large variety of file formats and databases, like SQL, Excel files and CSV files.

The follow imports will be assumed:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn # Custom library from the book

Toolkit

Useful Concepts

Supervised Learning

Supervised learning is used whenever we want to predict a certain outcome from a given input, and we have examples of input/output pairs. There are two major types: Classification and Regression.

In Classification the goal is to predict a class label (binary classification of multi class classification) Example: This email is a spam or not? This vehicle is a motorcycle, a car or a bus?
In Regression the goal is to predict a continuous number. Example: Predicting a person's annual income from their education, their age and where they live.

Characteristics:

Generalization: if a model is able to make accurate predictions on unseen data, we say it is able to generalize from the training set to the test set.
Overfitting

Occurs when you fit a model too closely to the particularities of the training set and obtain a model that works well on the training set but is not able to generalize to new data.

Underfitting

Occurs when you choose a too simple model. Lower complexity is related to worse performance on the training set, but better generalisation.

Algorithms for Supervised Learning

The simplest algorithm in machine learning is the KNearest Neighbors. To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset—its “nearest neighbors.”. Building the model consists only of storing the training dataset.

Linear Models

The general prediction formula for a linear model looks as follows: linear_models_formula

Linear models are a group of machine learning algorithms which fast to train and fast to predict. Also scale very large datasets and work well with sparse data. Often used on very large datasets, simply because it's not feasible to train other models. However, other models might yield better generalisation performance in lower-dimensional spaces.

These models perform well when the number of features is large compared to the number of samples. For one-dimensional dataset, there is little danger of overfitting, but with higher-dimensional datasets, linear models become more powerful and there is a higher chance of overfitting.

In highly correlated features, the coefficients might be hard to interpret.

OBS: If your data consists of hundreds of thousands or millions of samples, you might want to investigate using a solver='sag' in LogisticRegression and Ridge, which can be faster than default. Other options are the SGDClassifier class and the SGDRegressor class, which implement even more scalable versions.

Algorithms for regression:

Algorithms for classification:

* Although the name is regression, it is a classification algorithm

Multiclass Classification

Two understand some limitations of Linear Models, take a look on the image below. This is due to the fact that lines and hyperplanes have limited flexibility.


Bad case of using linear models

Create nonlinear models from linear models

Naive Bayers Classifier

Naive Bayes classifiers are a family of classifiers that are quite similar to the linear models discussed in the previous section. However, they tend to be even faster in training. The price paid for this efficiency is that naive Bayes models often provide generalization performance that is slightly worse than that of linear classifiers like LogisticRegression and LinearSVC. The reason that naive Bayes models are so efficient is that they learn parameters by looking at each feature individually and collect simple per-class statistics from each feature. There are three kinds of naive Bayes classifiers implemented in scikit-learn: GaussianNB, BernoulliNB, and MultinomialNB. BernoulliNB and MultinomialNB are mostly used in text data classification.

Naive Bayers models share maney of the strengths and weaknesses of the linear models. However, they're often used on very large datasets, where training even a linear model might take too long.

Decision Trees

Decision trees are widely used models for classification and regression tasks. Essentially, they learn a hierarchy of if/else questions, leading to a decision. The goal is to get us to the true answer most quickly. In the machine learning setting, these questions are called tests. The tests that are used on continuous data are of the form “Is feature i larger than value a?”

A leaf of the tree that contains data points that all share the same target value is called pure.

The following images shows graphically the evolution of the complexity of a model by controlling the depth of the tree.

decision_tree_depth_1

decision_tree_depth_2

decision_tree_depth_9

To make a prediction for regression tasks, we traverse the tree based on the tests in each node and find the leaf the new data point falls into. The output for this data point is the mean target of the training points in this leaf.

Typically, building a tree as described here and continuing until all leaves are pure leads to models that are very complex and highly overfit to the training data (like the depth 9 image). The presence of pure leaves mean that a tree is 100% accurate on the training set.

There are two common strategies to prevent overfitting:

Stopping the creation of the tree early (also called pre-pruning)
Building the tree but then removing or collapsing nodes that contain little information (also called post-pruning or just pruning)

OBS: scikit-learn only implements pre-pruning, not post-pruning.

To see how to visualize the tree or to check the most important features in a model, check the Tree Analysis page.

Decision trees have two advantages over many of the algorithms we’ve discussed so far: the resulting model can easily be visualized and understood by nonexperts (at least for smaller trees), and the algorithms are completely invariant to scaling of the data. As each feature is processed separately, and the possible splits of the data don’t depend on scaling, no preprocessing like normalization or standardization of features is needed for decision tree algorithms.

The main downside of decision trees is that even with the use of pre-pruning, they tend to overfit and provide poor generalization performance.

Ensembles of Decision Trees

There are many models in the machine learning literature that belong to this category, but there are two ensemble models that have proven to be effective on a wide range of datasets: random forests and gradient boosted decision trees.

Random Forest

Random forests are one way to address the problem of overfitting in decision trees algorithms. They're among the most widely used machine learning methods. A random forest is essentially a collection of decision trees, where each tree is slightly different from the others. The idea behind is that each tree might do a relatively good job of predicting, but will likely overfit on part of the data. If we build many trees, all of which work well and overfit in different ways, we can reduce the amount of overfitting by averaging their results.

There are two ways in which the trees in a random forest are randomized: by selecting the data points used to build a tree (Bootstrap Sample) and by selecting the features in each split test.

(TODO: Separate better the concepts, tips and algorithms in the links below)

Random forests don’t tend to perform well on very high dimensional, sparse data, such as text data. For this kind of data, linear models might be more appropriate. They usually work well even on very large datasets.

Gradient Boosted Regression Trees

The gradient boosted regression tree is another ensemble method that combines multiple decision trees to create a more powerful model. Despite the “regression” in the name, these models can be used for regression and classification. In contrast to the random forest approach, gradient boosting works by building trees in a serial manner, where each tree tries to correct the mistakes of the previous one. The trees are also called weak learners because are designed to be simple.

By default, there is no randomization in gradient boosted regression trees; instead, strong pre-pruning is used. Gradient boosted trees often use very shallow trees, of depth one to five, which makes the model smaller in terms of memory and makes predictions faster. They are generally a bit more sensitive to parameter settings than random forests, but can provide better accuracy if the parameters are set correctly. They are also frequently the winning entries in machine learning competitions, and are widely used in industry.

Gradient Boosted Regression Tree for Classification

A common approach is to first try random forests, which work quite robustly. If random forests work well but prediction time is at a premium, or it is important to squeeze out the last percentage of accuracy from the machine learning model, moving to gradient boosting often helps.

If you want to apply gradient boosting to a large-scale problem, it might be worth looking into the xgboost package and its Python interface, which at the time of writing is faster (and sometimes easier to tune) than the scikit-learn implementation of gradient boosting on many datasets.

Their main drawback is that they require careful tuning of the parameters and may take a long time to train.

Kernelized Support Vector Machines

It's an extension of Support Vector Machines that allows for more complex models that are not defined simply by hyperplanes in the input space. By adding nonlinear features to the representation of the data it makes linear models much more powerful.

There is a clever mathematical trick that allows us to learn a classifier in a higher-dimensional space without actually computing the new, possibly very large representation. This is known as the kernel trick, and it works by directly computing the distance (more precisely, the scalar products) of the data points for the expanded feature representation, without ever actually computing the expansion.

To map the data into a higher-dimensional in the context of support vector machines, there are two ways:

The polynomial kernel, which computes all possible polynomials up to a certain degree of the original features (like feature1 ** 2 * feature2 ** 5;
The radial basis function kernel (RBF kernel), also known as the Gaussian kernel, which considers all possible polynomials of all degrees, but the importance of the features decreases for higher degrees.
Understanding SVMs
Kernelized SVM for Classification

While SVMs often perform quite well, they are very sensitive to the settings of the parameters and to the scaling of the data.

In particular, they require all the features to vary on a similar scale. A common way to do that, is by setting all features between 0 and 1 (check out the MinMaxScalar function).

Kernelized SVM work well on low-dimensional and high-dimensional data, but don’t scale very well with the number of samples. Running an SVM on data with up to 10,000 samples might work well, but working with datasets of size 100,000 or more can become challenging in terms of runtime and memory usage. Another downside of SVMs is that they require careful preprocessing of the data and tuning of the parameters. Furthermore, SVM models are hard to inspect; it can be difficult to understand why a particular prediction was made, and it might be tricky to explain the model to a nonexpert.

The important parameters in kernel SVMs are the regularization parameter C, the choice of the kernel, and the kernel-specific parameters.

Neural Networks

Check the Neural Networks page to get a full explanation of the subject.