01 Sigmoid, ReLU and Softmax - chanchishing/Introduction-to-Deep-Learning GitHub Wiki
Sigmoid Function
The Sigmoid Function $\sigma(z)=\dfrac{1}{1+e^{-z}}$ was traditionally (in the 60's) use as the activation function of neuron in neural net as it has a nice property that its output is naturally bounded between 0 and 1, making it ideal for representing probabilities or "firing" a neuron. However, the below shows that the derivate of a sigmoid function is the sigmoid times its complement. When the input is zero the derivate is 0.25 However when the input is not close to zero, the derivative (gradient) become very close to zero, it means it will take a long time for gradient descent to converge to minima or even stuck making no progress and this situation is called Gradient Saturation.
The effect of multiple (even slight) Gradient Saturation can be magnified in a deep network as weak gradient can be carried through layers and caused Vanishing Gradient.
$$
\begin{alignat*}{3}
\sigma(z) &= \dfrac {1}{1+e^{-z}} &&\
\dfrac{d}{dz} \sigma(z)&= \dfrac{d}{dz} \left(\dfrac {1}{1+e^{-z}} \right)&&\
&= \dfrac{d}{dz} (1+e^{-z})^{-1} &&\
&= -(1+e^{-z})^{-2} \dfrac{d}{dz} (1+e^{-z}) &&\
&= -(1+e^{-z})^{-2}(0-e^{-z}) &&\
&= \dfrac{e^{-z}}{(1-e^{-z})^{2}} &&\
&= \dfrac{1}{(1+e^{-z})}\dfrac{e^{-z}}{(1+e^{-z})} &&\
&= \sigma(z)~\sigma(z)e^{-z} &&\
&= \sigma(z)(1-\sigma(z)) &&\because \sigma(z)=\dfrac {1}{1-e^{-z}} \Rightarrow \sigma(z)~e^{-z}=(1-\sigma(z))\
\end{alignat*}
$$
Sigmoid function is still being used as the activation function for the output layer of binary classification as its output is always between 0 and 1.
ReLU Function (Rectified Linear Unit)
In later neural network studies, ReLU replaced Sigmoid as activation function in Hidden Layer as in ReLU has a better derivative shape that the would coverage quicker when the input is greater than zero. ReLU is defined as:
$$ \begin{alignat*}{3} ReLU(z)& =\max(0,z) && \ &= \begin{cases} 0 && z\le 0 \ z && z\gt 0 \ \end{cases} && \ \end{alignat*} $$
$$ \begin{alignat*}{3} \dfrac{d}{dz}ReLU(z) &= \begin{cases} 0 && z\le 0 \ 1 && z\gt 0 \ \end{cases} && \ \end{alignat*} $$
For ReLU, the gradient will be always 1 when z is greater than zero. It mean we will always make some progress in Gradient Descent when z is is greater than zero. However, we may run into the problem of "Dying ReLU" when $z$ is negative, as ReLU always output zero meaning the weights never get updated. There are variants of ReLU allowing a small non-zero gradient for negative inputs such as Leaky ReLU, ELU.
Softmax Function
The Softmax Function is used as the activation function for the output layer of multi-class classification problems. It is a generalization of the Sigmoid function. While Sigmoid is used for 2 classes (0 or 1), Softmax handles $K$ classes.
It takes a vector of raw scores (logits) $z$ and transforms them into a vector of probabilities. It ensures two critical properties:
- All output values are positive.
- The sum of all output values equals 1.
$$ \widehat{y}_k = \text{Softmax}(z)k = \dfrac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}} $$
Where:
- $z$ is the input vector (logits) of size $K$.
- $e^{z_k}$ is the standard exponential function applied to the $k$-th element.
- The denominator is the sum of exponentials of all elements in the vector.