Neural Network Hyperparameters - ofithcheallaigh/masters_project GitHub Wiki

Introduction

This section of the wiki will present information on the various hyperparameters which can be tuned when working with neural networks.

Optimisers

This section will briefly overview the various optimisers investigated during this project.

Stochastic Gradient Descent (SGD)

Before getting into SGD, we will quickly review gradient descent. The gradient descent optimiser tries to minimise a model's cost function by moving in the direction of the negative gradient. Doing so allows the system to converge on an optimal set of parameters that best fits the data [1].

The cost function is used to estimate how badly a model is performing. At its heart, the cost function calculates the difference between the predicted and actual output from the training data. These differences are then summarised into a single value. Minimising the cost function minimises the differences between the expected and calculated values. The cost function is minimised through iterative processing of changing the algorithm's weights via backpropagation.

The idea of gradient descent is maybe best to understand with the following image:

The point where the gradient descent algorithm converges is the point at which further modification to the parameters provides little to no worthwhile changes to the loss figure. This is because when the algorithm has converged, the model has optimised weights [2].

Calculating gradients on large datasets can be expensive, so gradient descent is not well-suited for this type of work. This is where stochastic gradient descent comes in. The word stochastic means that the system has an element of randomness, which the algorithm is based on [3]. The SGD algorithm will use random subsets of data to work out the algorithm parameters. This means that each update is done on "new" data (i.e. data the algorithm has not previously seen). This means SGD will require more iterations than gradient descent to find the local or global minima. The increased number of iterations means the computation time will be increased. Still, even with this increase in computational time, the overall computational cost is less when compared to gradient descent [3] because SGD is much faster than gradient descent in carrying out its calculations because it is doing those calculations on a much smaller subset of the data.

It is worth keeping in mind that the random nature of SGD means that there is the potential for the algorithm updates to be "noisy", and this noise can mean that it does not converge to the exact global minimum, meaning that DGS can be less accurate than gradient descent [4]. As with most things with neural networks, these have to be considered, and the best decision made for the application.

RMSProp

RMSProp stands for Root Mean Square propagation. RMSProp maintains a running average of the squared gradients of mini-batches. This allows RMSprop to converge faster than the original optimiser (RProp) [5]. RMSProp is an extension of gradient descent and the AdaGrad version of gradient descent. It uses a decaying average of partial gradients, allowing the algorithm to "forget" earlier gradients of focus on the most recent gradients observed during the search phase, overcoming shortfalls seen in other algorithms [6].

RMSProp is relatively easy to implement and can be used on a range of neural network architectures; however, it can be computationally expensive and

Adam

Adam is short for Adaptive Moment Estimation and is an adaptive learning rate optimiser algorithm designed to work with deep learning networks. The optimiser dates back to 2014, so it is a standard tool in the designer's toolbox. Adam uses the power of adaptive learning rates to allow it to determine individual learning rates. This approach differs from SGD, which holds a single rate throughout.

Adam combines elements from SGD and RMSProp to make a computationally efficient algorithm that requires only little memory. In addition, Adam was designed to work well with very noisy datasets with sparse gradients [7].

Batch Size

The batch size is the hyperparameter that controls the number of samples the system processes before updating the model's internal parameters [8]. Therefore, batch size can have an impact on the overall accuracy of the model. While there is no one-size-fits-all solution for which batch size to use, there are several things to keep in mind:

A larger batch size will generally result in a fast convergence
A smaller batch size will require less memory and computation but may take longer to converge
A larger batch size can be used as a form of filtering in that any noise present in a larger batch should be averaged out
A larger batch size can help with stability in neural networks because a larger batch size is less affected by outliers in the data

The best batch size will depend on the neural network's architecture, the available resources and the data used to train the system, and ultimately is a balance between the desired convergence speed, the desire for stability, the acceptable tolerance to noise and the number of resources available to the network.

Activation Functions

The activation function is one of the most essential parts of the deep learning model. It will determine the output of a model, the model's accuracy, and how computationally efficient it is [9].

The most straightforward activation function is a linear activation function. A linear activation function is one where no transform is applied to the input. This type of system is straightforward to train but cannot learn complex systems. However, linear activation functions will commonly be found in the output layer for networks that will predict a quantity, for example, in regression problems [9]. For this reason, non-linear activation functions are preferred because they allow nodes to learn more complex features in the input data. Two standard non-linear activation functions are the sigmoid function and the hyperbolic tangent (tanh) function, and the relu activation function.

Another critical parameter is the activation function. What is the purpose of interaction functions? Activation functions will impact two things:

Interaction effects
Non-linear effects

An interaction effect is where one variable, var1, affects a prediction which depends on another variable, var2. As an example, let's look at hypertension. An indicator of hypertension is a person's body weight. But someone may have a bodyweight which alone would make a person think they are likely to have hypertension, but they could be relatively tall, in which case, their body weight would be normal. So, in reality, a person's height also impacts their risk of hypertension, so weight and height impact the risk of hypertension.

Non-linearities can be seen if one plotted their prediction on one axis and a variable.

The sigmoid function is probably one of the best-known activation functions and can be seen in the plot below:

The sigmoid function is a mathematical function with an 'S-shaped curve, also known as the sigmoid curve. The sigmoid function is monotonic, meaning it is either entirely non-increasing or non-decreasing. A monotonic function's first derivative will not change the sign [10]. Another characteristic of the sigmoid function is that it is constrained by a pair of horizontal asymptotes as $x \rightarrow \infty$. The equation for the sigmoid function is:

$$s(x) = \frac{1}{1+e^x}$$

The sigmoid function is good to use in models where one needs a probability as the output since probabilities exist between 0 and 1. There is a version of the sigmoid function called the softmax function, which is an activation function used in multi-class classification systems. The softmax activation function is a mathematical function that transforms a vector of numbers (int, float etc.) into a vector of probabilities. This probability of each value is representative of the relative size of the values in the vector [11]. The softmax activation function is best used in the output layer of a neural network that predicts a multinomial probability. It can be used on hidden layers, but that is less common. The softmax activation will output one value for each node in the output layer. When conducting a multi-class analysis, the target or response variable that holds the class labels must be encoded. In other words, if they are not already, they must be converted to an integer representing each class from 0 to N-1, where N is the number of classes. If there is categorical data, this needs to be One Hot Encoded.

The tanh activation function is similar to the sigmoid function, but can offer better performance. The range of the tanh function is from $(-1\ \text{to} \ 1)$. The tanh function (shown below) is a similar shape to the sigmoid function.

The advantage of this type of function is that the inputs will be mapped better. For example, a negative input will be mapped strongly negative, inputs near the zero range will be mapped near the zero, and positive inputs will be mapped more strongly in line with the positive portion of the tanh graph.

Another activation function is the relu function, often stylised to ReLU. This stands for Rectified Linear Unit, which is a linear function. The relu function is defined as:

$$f(x) = \begin{cases} x, & \text{if}\ x>1 \ 1, & \text{otherwise} \end{cases} \ $$

$$f^{'}(x) = \begin{cases} 1, & \text{if}\ x>0 \ 0, & \text{if} x<0 \end{cases}$$

where $x$ is the input to the neuron. This type of function is also known as a ramp function, and for the electronics engineers out there, it can be thought of as similar to a half-wave rectifier. In very simple terms, the relu activation function is 0 for all negative inputs, but the output equals the input for all positive inputs. This is shown in the figure below:

The relu function is very efficient in terms of computing, and it does not suffer from the vanishing gradient point problem (where gradients will be so small, it prevents the weights from being updated), because the outputs will always be 1 for positive inputs and 0 for negative inputs. However, a relu function can die. A dying relu comes about in a situation where a lot of relu neurons only output values of 0. The diagram above shows that this can happen when the inputs are in the negative range. When the input range is negative, the outputs are 0. When this happens, the gradients are not changing, and as a result, the weights are not updated [12].

References

[1] Analytics Vidhya, "How Does the Gradient Descent Algorithm Work in Machine Learning?", 2 10 2020. [Online]. Available: https://www.analyticsvidhya.com/blog/2020/10/how-does-the-gradient-descent-algorithm-work-in-machine-learning/. [Accessed 2023 5 10]. [2] C. Msck, "Machine learning fundamentals (I): Cost functions and gradient descent," 27 11 2017. [Online]. Available: https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220. [Accessed 10 5 23] [3] A. Gupta, "A comprehensive guide on deep learning optimisers," 7 10 2021. [Online]. Available: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-optimizers/. [Accessed 10 5 2023].
[4] R. Roy, "ML | Stochastic Gradient Descent (SGD)," 1 3 2023. [Online]. Available: https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/. [Accessed 11 5 2023].
[5] J. Huang, "RMSProp," 25 8 2022. [Online]. Available: https://optimization.cbe.cornell.edu/index.php?title=RMSProp#. [Accessed 11 5 2023].
[6] J. Brownlee, "Gradient Descent With RMSProp from Scratch," 21 5 2021. [Online]. Available: https://machinelearningmastery.com/gradient-descent-with-rmsprop-from-scratch/. [Accessed 1 5 2023].
[7] T. Aremu, "Impact of Optimisers in Image Classifiers," Medium, 22 8 2022. [Online]. Available: https://pub.towardsai.net/impact-of-optimizers-in-image-classifiers-3b04ed20823a. [Accessed 29 4 2023].
[8] D. Devansh, "How does Batch Size impact your model learning," 22 1 2022. [Online]. Available: https://medium.com/geekculture/how-does-batch-size-impact-your-model-learning-2dd34d9fb1fa. [Accessed 7 5 2023].
[9] https://towardsdatascience.com/why-rectified-linear-unit-relu-in-deep-learning-and-the-best-practice-to-use-it-with-tensorflow-e9880933b7ef
[10] https://mathworld.wolfram.com/MonotonicFunction.html
[11] https://machinelearningmastery.com/softmax-activation-function-with-python/
[12] K. Leung, "The Dying ReLU Problem, Clearly Explained," Medium, 30 3 2021. [Online]. Available: https://towardsdatascience.com/the-dying-relu-problem-clearly-explained-42d0c54e0d24. [Accessed 11 5 2023].