03 Loss Function - chanchishing/Introduction-to-Deep-Learning GitHub Wiki
Loss Function
Loss Function ($L$) measures prediction errors.
$$ L(y,\widehat{y})=\text{how wrong is $\widehat{y}$ when truth is $y$ ?} $$
In Neural Net, we train the model to find the Weights (and Biases) that minimize the average loss.
$$ \min_{W,b} \dfrac{1}{n} \sum_{i=1}^{n} L(y_i,\widehat{y}_i) $$
In order for Gradient Descent to find $W$ and $b$ that minimize the Loss, the Loss Function needed to be differentiable.
Regression Loss - Mean Squared Error
For Regression Task (predicting continuous values), Mean Squared Error (MSE) is chosen as the Loss Function.
$$ L_{MSE}=\dfrac{1}{n} \sum_{i=1}^{n} (y_i-\widehat{y}_i)^2 $$
The MSE Gradient (in vector form) is:
$$ \dfrac{\partial L}{\partial \widehat{y}} = -\dfrac{2}{n} (y-\widehat{y}) = \dfrac{2}{n} (\widehat{y}-y) $$
if $\widehat{y}_i \gt y_i$ (gradient is +ve), decrease $\widehat{y}_i$
if $\widehat{y}_i \lt y_i$ (gradient is -ve), increase $\widehat{y}_i$
Binary Classification Loss - Binary Cross Entropy
For Binary Classification, the Loss function is Binary Cross Entropy
$$ L_{BCE}=-\dfrac{1}{n} \sum_{i=1}^{n} \left( y_i log(\widehat{y}_i) + (1-y_i) log(1-\widehat{y}_i) \right) $$
Where
- the truth $y_i$ is 1 or 0 act as a indicator variable
- $\widehat{y}_i$ is the predicted probability of $y_i=1$, it is between 0 and 1.
The BCE Gradient (in vector form) is:
$$ \begin{alignat*}{3} \dfrac{\partial L}{\partial \widehat{y}} &= -\dfrac{1}{n} \left(y \dfrac{\ \ 1\ \ }{\widehat{y}} + (1-y) \dfrac{-1}{1-\widehat{y}}\right) &&\ &= \dfrac{1}{n} \left( (1-y) \dfrac{1}{1-\widehat{y}} - y \dfrac{\ \ 1\ \ }{\widehat{y}} \right)&&\ \end{alignat*} $$
The Gradient of BCE is much steeper than that of MSE when the prediction is wrong (e.g $y=1$ but $\widehat{y}$ is close to zero). This is one of the reasons BCE is preferred over MSE used as Loss Function in Binary Classification.
Multi-Class Classification Loss - Categorical Cross Entropy
For Multi-Class Classification, the Loss Function is Categorical Cross Entropy. For a $K$-classes one-hot encoded label $n$ samples dataset the CCE is:
$$L_{CCE}=- \dfrac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{i,k}log(\widehat{y}_{i,k})$$
For example, if the prediction result of i-th sample of a 3-class classification is :
- one-hot encoded truth label $y_{i}=[0,1,0]$
- The predicted probability $\widehat{y}_{i}=[0.2,0.7,0.1]$
- $Loss=-log(0.7)\approx 0.357$
The CCE Gradient (in vector form) is for class $k$ is:
$$ \begin{alignat*}{3} \dfrac{\partial L}{\partial \widehat{y}}_k &= -\dfrac{1}{n} \left(y_k \dfrac{\ \ 1\ \ }{\widehat{y}}_k\right) && \ \end{alignat*} $$
(Note: $y_k$ is zero for all incorrect classes, the gradient is non-zero only for the true class).