loss functions - hassony2/inria-research-wiki GitHub Wiki
Loss functions
Soft max
Differentiable version of argmax
Formula : e^(x_i)/sum_{j}{e^{x_j}}
If one of the x_j is much higher then the others, the values will be close to 0 for all except the j index, for which the value will be close to one.
Cross Entropy
See this friendly post which starts from information entropy to explain cross entropy as a choice of loss, and shows that minimizing the cross entropy is equivalent to minimizing the negative log likelyhood
Entropy
H(y) = \sum_i y_i * log(1/y_i) = -y_i log(y_i)
Interpretation : mean size of bit encoding under distribution y
Cross-Entropy
ground truth distribution: y, estimated distribution y_hat
C(y, y_hat) = - \sum_i y_i log(y_hat_i)
Interpretation : mean bit size of encoding of y under wrong distribution y
KL
KL(y, y_hat) = cross_entropy(y, y_hat) - entropy(y)
Minimizing KL is the same as minimizing cross_entropy
It is the same as maximizing the likelyhood (or minimizing the negative log likelyhood)
\Product_{(x, y)} P(x,y | theta)