Decision Trees - ofithcheallaigh/masters_project GitHub Wiki

Introduction

This section will detail background information on decision trees and k-fold cross validation, as well as analysis carried out using these techniques.

Decision Trees

Decision trees (DT) are a supervised learning technique used for classification and regression. The goal of a DT is to generate a model that can predict the outcome by generating decision rules based on training data [1].

DTs are quite popular because they are relatively easy to understand, and the tree can be visualised - however, the tree is not always simple. However, it is important to note that DTs rules can become complex, which means the rules do not generalise well which can lead to over-fitting. There are methods, such as pruning, which can combat this proble.

Cross Validation

Cross validation is a statistical method used to estimate how well ML models preform.

k-Fold Cross Validation

Cross validation is a resampling procedure used to evaluate an ML model on a limited data set. The k parameter refers to the number of groups that a given data sample is split into. Cross validation uses a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

The algorithm used for k-fold cross validation is:

  1. Random;y shuffle the data set
  2. Split the data set into k-groups
  3. For each group
    • Use a group as a hold-out which will be used as a test data set
    • Take the remaining groups as a training data set
    • For the model on the training set
    • Retain the evaluation scores
  4. Summerise the performance of the model using the sample of model evaluation scores

An important thing to note is that for k-folds, each group will be used at least once for the hold-out or test data set.

Sources

[1] Scikit Learn, "1.10. Decision Trees," [Online]. Available: https://scikit-learn.org/stable/modules/tree.html. [Accessed 13 5 2023].