Regulation Techniques for Overfitting in Machine Learning - 180D-FW-2024/Knowledge-Base-Wiki GitHub Wiki

Introduction

In this modern world of rapidly developing Artificial Intelligence, a machine learning model's ability to give accurate predictions based on processed data is fundamental for the reliability of automated systems.

Overfitting is a common behavior in machine learning where a learning model gives inaccurate predictions for new data despite having performed well on the dataset used to train it. This happens when the model is unable to generalize a dataset due to fitting too closely to the training data instead. There are many regulation techniques available to tackle this phenomenon. However, if done incorrectly or abused, models face another undesired behavior: underfitting. Determining which regulation techniques to use to create the most robust model lies in finding the balance between overfeeding and underfeeding relevant data for machine learning.

Why Does Overfitting Happen?

A sample dataset is used to train the algorithm when building machine learning models. However, if the provided dataset is too complex or the model trains on it for too long, it may begin to learn irrelevant details or "noise". This over-optimization results in models that lack the flexibility to generalize new data, rendering it unable to perform its original classification and prediction tasks. On the opposite end, a training data size too small would not accurately represent all possible input data values, also resulting in a model where the noise is learned and is later used for predictions.

Key Techniques for Regulating Overfitting

Preventing overfitting relies heavily on data science strategies.

Early Stopping

Early stopping is a regulation strategy where training is stopped before it begins to learn the noise of the dataset. This effectively avoids the "learning speed slow-down" phenomenon, an issue where an algorithm's accuracy stops improving or even begins to decline after a period of time due to this noise-learning. In machine learning, epoch is a single complete pass of a training dataset through the algorithm. This can be used as a means of measure, as it is in Figure 1. The model learns the data and updates accordingly after each epoch.

Figure 1: Validation Error vs. Training Error

Training and validation error decreases as long as the model is generalizing the input dataset. However, Figure 1 observes the differing behaviors of the validation and training error after some iterations: The training error continues to steadily decrease, but validation error begins to increase. The key takeaway from this graph is defining the point in epochs where this pattern is observed in order to determine the most optimal point to halt training. The model at this stage has low variance and generalizes the data well.

Implementing an early stopping layer to model training not only prevents overfitting, but also reduces training time with minimal data loss, thus rendering the model capable of being trained on a CPU in less time.

Pruning

Pruning is also known as feature selection. As the name implies, certain parameters that have the most impact on the predictions of a model are prioritized, efficiently eliminating irrelevant information and establishing a dominant trend in the data.

Figure 2: Before and After Pruning of a Neural Network

Figure 2 shows how pruning reduces the size of decision trees by removing any non-critical or redundant features, reducing the complexity of the final classifier and subsequently improving the predictive accuracy of a learning model. This figure combines the two different pruning methods in dealing with this noise: Unstructured and structured.

Figure 3: Comparison Between Unstructured and Structured Pruning

The faded connections and/or nodes in Figure 3 represent the pruned values in the network. Structured pruning removes entire sections from a neural network to simplify calculations and connections. Unstructured pruning only removes individual weight connections by setting them to 0. While unstructured pruning allows for more precise adjustments, the efficiency in computation complexity is minimal compared to simply removing parts of the network, as done in structured pruning. However, while structured pruning speeds up models as a result of this, it will maintain less accuracy than unstructured.

Choosing which method is most optimal depends on the type of neural network the learning model will work with- structured pruning benefits larger networks that need big changes quickly, while unstructured pruning helps smaller networks make precise changes with minimal change in its structure.

Regularization

Most models have numerous features that contributes to its complexity. Pruning, as mentioned previously, is an effective way of reducing the number of features. However, if deciding what inputs during the feature selection process should be eliminated is unclear, regularization methods may be more effective.

Regularization applies a "penalty" value to features with minimal impact on the model rather than directly removing them. There are different regularization methods as well:

Lasso Regression - L1 Regularization: Lasso stands for Least Absolute Shrinkage and Selection Operator. Lasso Regression adds a penalty to the loss function based on the coefficients. This penalty term is the absolute value of the sum of coefficients, controlled by a tuning parameter. Hence, the weights can be penalized to equal zero, resulting in feature selection that automatically identifies and discards irrelevant or redundant variables. Such removeable features are called multicollinearity, where more than two independent variables are correlated. This creates difficulty in determining any one variable's individual impact on the output, preventing models from isolating predictors.
Ridge Regression - L2 Regularization: Ridge regression is similar to Lasso in that it penalizes high-value coefficients in the loss function. However, getting to the penalty term differs. In ridge regression, the penalty term is the squared sum of the coefficients, rather than its absolute values. Hence, it follows that ridge regression does not enable feature selection, since it only shrinks the weights of certain features towards zero but never to zero.
Elastic Net Regularization: A combination of the L1 and L2 penalty terms, Elastic net regularization inserts both types of regression into the loss function, effectively enabling feature selection while addressing the multicollinearity features of the model.

Figure 4: Elastic Net Equation

The blue highlights the L1 norm penalty equation and the red highlights the L2 norm penalty equation in Figure 4. Due to the feature selection enacted by Lasso Regression, the L1 penalty generates a sparse model. Combining it with Ridge Regression removes the limitation on the number of selected variables that comes with pruning, stabilizing the L1 regularization path. Since Ridge Regression addresses multicollinearity features and does not get rid of them, it encourages clustering, the grouping of data points. This aids in machine learning since classification is the labeling of these groups.

Ensembling

Ensembling is the method of combining multiple, weaker learning models to aggregate their predictions, promptly identifying the most popular result. There are two well-known ensemble methods: Bagging and boosting.

Bagging, also known as bootstrap aggregating, works by deriving multiple models from the original dataset and later combining their predictions to determine the final, singular prediction.

Figure 5: Diagram of the Bagging Process

A base model is created on each of the derived subsets, as shown in Figure 5. These models run in parallel and are independent of each other, maximizing a generalized result. Bagging allows a fair idea of the distribution of the dataset, since the models essentially "vote" upon their separate classifications to result in the best prediction.

Boosting, on the other hand, is a sequential process, where each subsequent model improves upon the previous model. Random samples are fit into the early models and then analyzed for errors.

Figure 6: Diagram of the Boosting Process

When an input is misclassified by the model, that data's weight is increased so that the next model can more likely correctly classify it. Figure 6 shows how the base model makes a prediction on the original data points, which have equal weights. Another model is created based on the dataset with the increased weights of the previously incorrectly predicted observations. In the diagram, the proceeding data now has points that are more/less important when beginning the next model's classifications. A series of constantly updated models results in a singular learning model that makes the final prediction on the whole dataset. This model is called the strong learner and is the weighted mean of all the models before it (known as weak learners).

Data Augmentation

All the regulation techniques so far emphasizes the importance of injecting clean training data. However, data augmentation, done sparingly, takes on the opposite approach to make a model more stable.

Figure 7: Example of Visual Data Augmentation

Figure 7 shows that data augmentation involves exposing machine learning models to a wider range of input patterns to increase the diversity of the dataset. Various transformations to images, such as the cat shown above, include rotation, zooming, cropping, blurriness (noise injection), and more. This enhances the model's ability to perform successful classification under varying, non-ideal conditions.

Underfitting: The Common Con of Regulation Techniques

The flip side of using the techniques listed above is that, if done incorrectly, can lead to the opposite issue of underfitting, another undesirable behavior of machine learning models where it has not sufficiently captured the relationship between the input and output data to make accurate predictions. This often results in low performance both in the training set and unseen data, as the model is too simplistic to identify any patterns.

Early Stopping

Looking back at Figure 1, if early stopping occurs at any point before the validation error increases, the training dataset risks being too small and unrepresentative of the intended classification. If the model hasn't had enough epochs to fully capture the training dataset's patterns, it does not reach the complexity necessary to generalize a new dataset. For example, in image recognition tasks, stopping early might result in a model that fails to differentiate between a square and a circle, let alone more complex distinctions like cars and trucks, rendering it useless for classification tasks. Different from overfitting, an underfitted model shows poor performance in both training and validation. The stopping point is critical in ensuring the learning model is trained just enough to avoid fitting to noise.

Pruning

Similarly, pruning the dataset too aggressively can lead to a model that is missing important features in the data, hurting performance. When key connections, nodes, or layers are pruned away, the model is rendered unable to capture significant patterns in the dataset. For example, in decision tree-based models, excessive pruning may remove branches that represent more rare but important patterns, such as identifying fraudulent transactions in financial datasets. Optimal pruning requires careful consideration of including enough parameters to learn the training data's distribution, requiring careful monitoring.

Regularization

Penalizing large weights and deactivating features as a result can result in a model that struggles to learn meaningful patterns if regularization is too aggressive. Too strong of a penalty on certain weights and a very high feature dropout rate would cause restrictions that lead to underfitting. Overly sparse models ignore weak but meaningful features, which is especially harmful in healthcare data, since subtle symptom clusters indicative of a rare disease may be ignored in predictive algorithms.

Ensembling

Ensembled models are only as good as the base models. For both ensembling methods, the combination of multiple models is what reduces variance and improves generalization. However, if the individual models in the ensemble lack a learned pattern due to the misuse of the other regulation techniques, then the ensemble will lack complexity. On the flip side, if the models are overly diversified with little to no overlap in their understanding of the training data, the ensemble will make inaccurate predictions.

Data Augmentation

While applying noise and transformation to original data can improve generalization by exposing the model to variation, overly aggressive augmentation can distort the training data to no longer reflect the intended pattern. As a result, the learning model would struggle to capture relevant features, making it difficult to generalize actual, unaltered validation data.

Conclusion

Overfitting is a common, undesirable behavior in training machine learning models that results in incorrect predictions when analyzing new data. There are many regulation techniques in place to minimize the risk of this error. However, developers must also find a balance in these techniques to avoid underfitting, another error in training that results in the model failing due to the lack of observed patterns in the dataset.

Comparison Via Table

Technique	Advantages	Disadvantages	Best for
Early Stopping	Prevents overfitting, reduces training time, improves efficiency	Requires reliable validation set and careful monitoring	Models with sufficient training and limited computational resources
Pruning	Simplifies models, removes redundant features	Risk of removing critical features, can lead to underfitting	Large networks, interpretable models
Regularization	Improves generalization, flexible adjustments, fixes over-reliance on specific features	Slows training, requires tuning	Sparse or dense datasets, feature-rich environments
Ensembling	Improves accuracy, reduces variance, combines strengths of multiple models	Computationally expensive, prone to overfitting	High-stakes tasks, such as medical diagnostics or financial forecasting
Data Augmentation	Enhances dataset diversity, makes models robust to varying inputs	Requires domain knowledge, risks distortion/unrealistic patterns	Image/audio data

Data scientists must find the sweet spot between underfitting and overfitting to result in a model that can quickly establish a trend given any dataset in its domain. By understanding the tradeoffs, developers can design models that perform well across a real-world variety of applications, ensuring reliability and robustness, simultaneously.

Sources

[1] Paris G., Robilliard D. and Fonlupt C. 2004 International Conference, Evolution Artificielle, Ea 2003 (Marseilles, France: DBLP) Exploring Overfitting in Genetic Programming. Artificial Evolution 267-277 October

[2] Xue Ying 2019 J. Phys.: Conf. Ser. 1168 022022, An Overview of Overfitting and its Solutions.

[3] Rawat, Apoorva, and Arun Solanki. “Sequence Imputation Using Machine Learning with Early Stopping Mechanism.” 2020 International Conference on Computational Performance Evaluation (ComPE), IEEE, 2020, pp. 859–63, https://doi.org/10.1109/ComPE49325.2020.9200099.

[4] Jain, R. (2020) Why ‘early-stopping’ works as regularization?, Medium. Available at: https://medium.com/@rahuljain13101999/why-early-stopping-works-as-regularization-b9f0a6c2772 (Accessed: 08 November 2024).

[5] Dai, Yunwei, et al. “A Pruning Extreme Learning Machine with L2,1/2 Regularization for Multi-Dimensional Output Problems.” International Journal of Machine Learning and Cybernetics, vol. 15, no. 2, 2024, pp. 621–36, https://doi.org/10.1007/s13042-023-01929-z.

[6] Fujino, Yuri, et al. “Applying ‘Lasso’ Regression to Predict Future Visual Field Progression in Glaucoma Patients.” Investigative Opthalmology & Visual Science, vol. 56, no. 4, 2015, pp. 2334--2339, https://doi.org/10.1167/iovs.15-16445.

[7] Lomte, Santosh S., et al. “Classifier Ensembling: Dataset Learning Using Bagging and Boosting.” Computing and Network Sustainability, vol. 75, Springer Singapore Pte. Limited, 2019, pp. 83–93, https://doi.org/10.1007/978-981-13-7150-9_9.