03 Loss Function - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

Loss Function

Loss Function ($L$) measures prediction errors.

$$ L(y,\widehat{y})=\text{how wrong is $\widehat{y}$ when truth is $y$ ?} $$

In Neural Net, we train the model to find the Weights (and Biases) that minimize the average loss.

$$ \min_{W,b} \dfrac{1}{n} \sum_{i=1}^{n} L(y_i,\widehat{y}_i) $$

In order for Gradient Descent to find $W$ and $b$ that minimize the Loss, the Loss Function needed to be differentiable.

Regression Loss - Mean Squared Error

For Regression Task (predicting continuous values), Mean Squared Error (MSE) is chosen as the Loss Function.

$$ L_{MSE}=\dfrac{1}{n} \sum_{i=1}^{n} (y_i-\widehat{y}_i)^2 $$

The MSE Gradient (in vector form) is:

$$ \dfrac{\partial L}{\partial \widehat{y}} = -\dfrac{2}{n} (y-\widehat{y}) = \dfrac{2}{n} (\widehat{y}-y) $$

if $\widehat{y}_i \gt y_i$ (gradient is +ve), decrease $\widehat{y}_i$

if $\widehat{y}_i \lt y_i$ (gradient is -ve), increase $\widehat{y}_i$

Binary Classification Loss - Binary Cross Entropy

For Binary Classification, the Loss function is Binary Cross Entropy

$$ L_{BCE}=-\dfrac{1}{n} \sum_{i=1}^{n} \left( y_i log(\widehat{y}_i) + (1-y_i) log(1-\widehat{y}_i) \right) $$

Where

the truth $y_i$ is 1 or 0 act as a indicator variable
$\widehat{y}_i$ is the predicted probability of $y_i=1$, it is between 0 and 1.

The BCE Gradient (in vector form) is:

$$ \begin{alignat*}{3} \dfrac{\partial L}{\partial \widehat{y}} &= -\dfrac{1}{n} \left(y \dfrac{\ \ 1\ \ }{\widehat{y}} + (1-y) \dfrac{-1}{1-\widehat{y}}\right) &&\ &= \dfrac{1}{n} \left( (1-y) \dfrac{1}{1-\widehat{y}} - y \dfrac{\ \ 1\ \ }{\widehat{y}} \right)&&\ \end{alignat*} $$

The Gradient of BCE is much steeper than that of MSE when the prediction is wrong (e.g $y=1$ but $\widehat{y}$ is close to zero). This is one of the reasons BCE is preferred over MSE used as Loss Function in Binary Classification.

Multi-Class Classification Loss - Categorical Cross Entropy

For Multi-Class Classification, the Loss Function is Categorical Cross Entropy. For a $K$-classes one-hot encoded label $n$ samples dataset the CCE is:

$$L_{CCE}=- \dfrac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{i,k}log(\widehat{y}_{i,k})$$

For example, if the prediction result of i-th sample of a 3-class classification is :

one-hot encoded truth label $y_{i}=[0,1,0]$
The predicted probability $\widehat{y}_{i}=[0.2,0.7,0.1]$
$Loss=-log(0.7)\approx 0.357$

The CCE Gradient (in vector form) is for class $k$ is:

$$ \begin{alignat*}{3} \dfrac{\partial L}{\partial \widehat{y}}_k &= -\dfrac{1}{n} \left(y_k \dfrac{\ \ 1\ \ }{\widehat{y}}_k\right) && \ \end{alignat*} $$

(Note: $y_k$ is zero for all incorrect classes, the gradient is non-zero only for the true class).