Machine Learning Algorithms - ofithcheallaigh/masters_project GitHub Wiki

Introduction

This section of the wiki will provide information on some of the algorithms used while carrying out initial data analysis. The algorithms which produced the most promising results were the k-nearest neighbour algorithm, and the Decision Tree algorithm, and as such they will be covered here.

These are types of classification algorithms.

k-Nearest Neighbors

The k-nearest neighbours algorithm is also referred to as K-NN or kNN.

KNN is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions around the grouping of an individual data point. The KNN algorithm can be used for regression tasks, it is prominently used for classification [1] (and this is the context in which we will examine it). The K-Nearest Neighbours algorithm is one of the simplest, but most commonly used classifiers used in supervised learning [2].

The KNN algorithm does not technically train a model to make predictions, instead, an observation is predicted to be the class of that of the largest portion of the k nearest observations [2].

When looking at classification problems, each class is assigned a label, and the label is assigned based on the label most commonly appearing around a given data point. It is important to keep in mind that while this process is referred to as a majority vote, this is not technically true - a majority vote would suggest greater than 50% is required to assign a label, but, this is not the case. For example, if a grouping contained four labels: label1, label2, label3 and label41, with each label having the following percentages: label1: 35%, label2: 15%, label3: 25%, label4: 25%, we can see that label1 will be the label assigned to the group.

Algorithm Steps

The high-level steps for the KNN algorithm can be described as:

  • Decide on the value of k: k is a hyperparameter. which is picked by the user. This will be discussed more below.
  • Calculate the distance between each training data point and the input data point
  • Sort the distances of the training examples to the input data point. The training data that are closest to the new data will go to the top of the list
  • Select the k data points that are closest to the input data point
  • Use the labels from the k-nearest neighbours to predict the label of the new data point

The main steps listed above are shown in the diagram below [1]:

Distances

The goal of KNN is to determine the group of neighbours, to do this, the algorithm needs to determine distances. The process starts with a query point. The distance from a query point to its neighbours is determined. The distance is calculated using one of several possible distance measurement options. A popular one would be the Euclidean distance method, which uses the following equation:

$$d(x,y) = \sqrt{\sum_{i=1}^{k} (x_i - y_i)^2}$$

The Euclidean distance is represented in the image below:

The image shows that the Euclidean distance is simply the measure of the straight line between two points (i.e. the new data point and the training data).

Other methods include Manhattan, Minkowski, and Hamming distances.

When looking at the Euclidean distance method, the distance between the chosen query point and the surrounding data points is calculated, and it is these distances that are used to form what are called decision boundaries. The decision boundaries are what separate the data points into various regions.

What is the k?

In the most simple terms, the k value in KNN is the number of neighbours we are interested in finding, for a given data point. If we select k=1, all points in the dataset will be assigned to a single class. Selecting k=2 will split the dataset into two groups, and so on.

One must be careful when setting a value for k, as wrong values can lead to over-fitting or under-fitting. However, the figure picked for k will depend on the data, however, some suggest that odd numbers of k are recommended to avoid coupling within the data.

The KNN algorithm has several advantages. First, it is called a "lazy" learning algorithm, meaning it does not require any training data to build a model, instead, as discussed, it stores all that training data, and uses that to make predictions. This means that KNN can be useful for smaller datasets.

Algorithm Pseudo-code

Below is the pseudo-code for the KNN algorithm [3]:

  1. Calculate $d(x,y)$ for y=1,2,..., n
  2. Arrange the calculated n distances in a non-decreasing order
  3. Choose a positive integer for k, and take the first k distance
  4. Find the k-points corresponding to the k-distances
  5. Let k_y denote the number of elements which are members of the y^th class
  6. If $k_y > k_j \forall k \neq j$

Decision Trees

Decision trees are a powerful machine learning algorithm which can perform both classification and regression tasks and can be capable of fitting large, complex datasets.

Classification tasks can be broken into two steps: learning and prediction. In the learning step, the model is developed based on a provided dataset. In the prediction step, the trained model is used to make predictions on new data. Decision trees work by learning decision rules from prior data. These rules are then used to predict the class or value of a new data point [4].

As one might expect, when discussing decision trees, there are a lot of tree-related terminologies. Some common terms used when discussing decision trees can be found in the Explanation of Terms section.

When predicting a class label we start with the *root of the tree. The algorithm works by recursively splitting the data into subsets. This split is carried out based on the most significant feature at each node of the tree. Each split can be thought of as a new branch. The most significant feature is the one which best splits the data into two groups. This idea is shown below:

Decision trees themselves can be split into two groups:

  • Categorical variable decision trees
  • Continuous variable decision trees

We can see that the type of decision tree needed for a task is based on the type of target variable.

Advantages and Disadvantages

Decision trees have several advantages, which include [5]:

  • Simple to understand and interpret
  • Can be visualised
  • Requires little data preparation (i.e. normalisation, dummy variables etc.)
  • Can use both numerical and categorical data
  • Can be used for multi-output problems
  • Models can be validated using statistical tests and analysis

Some disadvantages would include [5]:

  • Trees can become overly complex, which do not generalise well (i.e. over-fitting)
  • Decision trees can be unstable, meaning that small changes in data can result in completely different trees
  • Some problems are not suited to analysis by decision trees
  • Decision tree learners can create biased trees if some classes within the dataset are dominant. For this reason, it is important to balance the datasets.

Making predictions

To try and understand how the decision tree algorithm makes predictions. We will use an example where we will use the weather conditions to decide if kids should go out and play. To do this, we will use weather data gathered from the previous ten days, which contains information on the weather type (sunny, raining etc,), the temperature, the humidity, and the wind conditions. Finally, information was gathered on whether the kids went out to play. The decision tree for this could look something like this:

As one would expect, we start from the first node, where we have to make a decision. This will create a split, sometimes called the decision boundary. The decision boundary is determined through an analysis process where all possible decision boundaries within the dataset are tested and the one which minimises the Gini impurity of the two split datasets is chosen.

Gini impurity is a measure of the probability of a randomly chosen element in the dataset being incorrectly classified. This can be described as a measure of how pure the node is. In this context, a pure node would be one where all the data is correctly classified (i.e. they belong to the same class). A dataset with a low Gini impurity is purer than a dataset with a high Gini impurity. The equation for the Gini impurity is:

$$\text{Gini} = 1 - \sum_{i=1}^{c}(p_i)^2$$

where $c$ is the number of classes, $p(i)$ is the probability of picking a data point with class $i$.

Once the Gini impurity has been calculated for all features, and the lowest Gini impurity has been found, the split can be made. This process will continue in an iterative fashion until the process is completed, or, a stop condition has been met (for example, a condition which stops the tree from going past a certain depth [7].

There are two important factors which must be considered when working with decision trees: entropy and information gain. These factors come into play when considering how to split the data.

Entropy is, in its simplest terms, a measure of the amount of disorder in a given dataset. We can get a better understanding of this if we imagine a dataset for the toy problem discussed above. One factor that was considered was if it was sunny or not. We can imagine this dataset could look like this:

Weather
1
0
1
0
1
1
1
1
0
1
where `1` indicates a sunny day, and `0` indicates a non-sunny day.

We can see that there were 7 sunny days, and 3 non-sunny days. One way to describe this data would be to say that about $\frac{2}{3}$ of the days were sunny. We could also say about $\frac{1}{3}$ of the days were not sunny. Entropy allows us to be more accurate.

If we had a dataset where half the data was 1 and half was 0, we would have an entropy value of 1. If the data indicated every day was sunny (or every day was not sunny), we would have an entropy value of 0.

We can use the Shannon entropy equation, given below, to determine the entropy value:

$$H(x) = -\sum_{i=1}^{n} [P(x_i) \cdot {log}_b P(x_i)]$$

We know that the probability of having a sunny day is $\frac{7}{10}$, and the probability of a rainy day is $\frac{3}{10}$. With this in mind, the above equation can be rewritten to:

$$\frac{7}{10} \cdot log_2 \frac{7}{10} + \frac{3}{10} \cdot log_2 \frac{3}{10}$$

This get's us an entropy value of $0.88129$. But, what does this figure mean? First, we need to understand that the data we are using, 1's and 0's, can be thought of as bits. In this situation, a 1 is "information" and a 0 is "no information", so entropy is information.

But how does information progress? Imagine we are asked to flip a coin. We are also told the coin is rigged and every time we flip the coin, it will be heads. In this situation, entropy is 0, because we know the outcome, there is no new information. However, if the coin is not rigged, there is a 50-50 chance we will get a heads. So we have new information.

From the calculation above, we are getting a bit less than 1-bit of information, we are getting 0.88129 bits of information. This is because there are more 1's in our data than 0's. We also know that the chances of it being a sunny day are greater than it is a rainy day.

The use of entropy allows us to make decisions on what paths to follow in decision trees and where to make splits because the goal is to lower the entropy value of the target.

Information gain is a statistical property that measures how well a given attribute separates the training examples according to the target classification [4]. Coupling this with entropy, we can say that when working with decision trees, we want attributes with a high information gain value and a low entropy value.

A very simple IG equation is:

$$\text{Information Gain} = \text{Entropy}(Parent) - \text{Entropy}(Children)$$

Where the $Parent$ is the entropy before splitting, and the $Children$ is the entropy after splitting.

Information gain determines the reduction in uncertainty once the dataset has been split around a particular feature or attribute. As the information gain increases, and therefore entropy decreases, the related feature is more helpful to the classification process -- the feature with the highest information gain is the best feature on which to make a split [6].

Cross Validation

Cross-validation is a statistical method used to estimate how well ML models perform.

k-Fold Cross Validation

Cross-validation is a resampling procedure used to evaluate an ML model on a limited data set. The k parameter refers to the number of groups that a given data sample is split into. Cross-validation uses a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

The algorithm used for k-fold cross-validation is:

  1. Random;y shuffle the data set
  2. Split the data set into k-groups
  3. For each group
    • Use a group as a hold-out which will be used as a test data set
    • Take the remaining groups as a training data set
    • For the model on the training set
    • Retain the evaluation scores
  4. Summerise the performance of the model using the sample of model evaluation scores

An important thing to note is that for k-folds, each group will be used at least once for the hold-out or test data set.

Sources

[1] IBM, "What is the k-nearest neighbours algorithm?," [Online]. Available: https://www.ibm.com/uk-en/topics/knn. [Accessed 12 2 2023].
[2] C. Albon, Machine Learning with Python Cookbook, O'Reilly Media, Inc., 2018.
[3] R. Saxena, "Dataaspirant," 23 12 2016. [Online]. Available: https://dataaspirant.com/k-nearest-neighbor-classifier-intro/. [Accessed 21 4 2023].
[4] N. S. Chauhan, "Decision Tree Algorithm, Explained," KD Nuggets, 9 2 2022. [Online]. Available: https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html. [Accessed 22 4 223].
[5] Scikit-Learn, "1.10 Decision Trees," [Online]. Available: https://scikit-learn.org/stable/modules/tree.html. [Accessed 22 4 2023].
[6] N. Tyagi, "What is Information Gain and Gini Index in Decision Trees?," Analytic Steps, 22 3 2021. [Online]. Available: https://www.analyticssteps.com/blogs/what-gini-index-and-information-gain-decision-trees. [Accessed 23 4 2023].
[7] V. Richer, "Understanding Decision Trees (once and for all!)," Towards Data Science, 2 3 2019. [Online]. Available: https://towardsdatascience.com/understanding-decision-trees-once-and-for-all-2d891b1be579. [Accessed 22 4 2023].

[x] https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/02_knn_notes.pdf
[x] https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html

⚠️ **GitHub.com Fallback** ⚠️