Loss Functions
Table of Contents
- Introduction
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- Huber Loss
- Binary Cross-Entropy Loss
- Categorical Cross-Entropy Loss
- Kullback-Leibler Divergence (KL Divergence)
- Hinge Loss
- Contrastive Loss
- Triplet Loss
- Conclusion
Introduction
Loss functions (also called cost functions or objective functions) quantify the difference between predicted outputs and true targets. During training, neural networks use loss functions to guide weight updates using optimization algorithms like Gradient Descent. A good choice of loss function ensures meaningful learning and faster convergence.
Given:
- Predictions: \(\hat{y}\)
- True values: \(y\)
- Loss: \(L(y, \hat{y})\)
The loss function is minimized during training.
Mean Squared Error (MSE)
Used in regression tasks.
Definition
Properties
- Penalizes large errors more heavily than small errors.
- Sensitive to outliers.
- Smooth and differentiable.
Derivative
Mean Absolute Error (MAE)
Also for regression tasks.
Definition
Properties
- More robust to outliers than MSE.
- Not differentiable at \(y = \hat{y}\) (but subgradients are used).
- Slower convergence compared to MSE.
Derivative (subgradient)
Huber Loss
Combines MSE and MAE to be robust and smooth.
Definition
Properties
- Quadratic for small errors, linear for large errors.
- Reduces sensitivity to outliers.
Binary Cross-Entropy Loss
Used in binary classification.
Definition
Where \(y \in {0, 1}\) and \(\hat{y} \in (0, 1)\).
Properties
- Penalizes confident incorrect predictions.
- Assumes outputs are probabilities (usually after Sigmoid).
Derivative
Categorical Cross-Entropy Loss
Used in multi-class classification with one-hot encoded targets.
Definition
Given:
- True label vector \(y \in \{0, 1\}^C\) (one-hot encoded)
- Predicted probabilities \(\hat{y} = \text{softmax}(z)\)
Where:
- \(C\) = number of classes
- \(y_i = 1\) for the correct class, $0$ otherwise
With Softmax Output:
Let:
Properties
- Used with Softmax outputs.
- Measures log loss over multiple categories.
Derivative of Loss w.r.t. Input \(z_i\)
We want to compute the derivative of the loss with respect to the logits \(z_k\), where \(\hat{y}_k = \text{softmax}(z_k)\).
Step-by-Step:
We already know:
Now, for softmax:
Putting this together:
This is more common and efficient: when using categorical cross-entropy with softmax, the gradient simplifies greatly, which is why they are often combined in frameworks like TensorFlow (SoftmaxCrossEntropyWithLogits
) and PyTorch (CrossEntropyLoss
).
Kullback-Leibler Divergence (KL Divergence)
Measures how one probability distribution diverges from a second expected distribution.
Definition
Where:
- \(P\): true distribution
- \(Q\): predicted distribution
Properties
- Asymmetric: \(D\_{KL}(P \parallel Q) \neq D\_{KL}(Q \parallel P)\)
- Used in VAEs (Variational Autoencoders), NLP models
Hinge Loss
Used in SVMs and “maximum-margin” classifiers.
Definition
Where:
- \(y \in {-1, +1}\)
Properties
- Encourages correct classification with a margin.
- Only penalizes incorrect or borderline predictions.
Contrastive Loss
Used in Siamese Networks (e.g., face recognition).
Definition
Where:
- \(D\): Euclidean distance between feature embeddings
- \(y = 0\) for similar pairs, \(1\) for dissimilar
- \(m\): margin
Triplet Loss
Used in ranking tasks (e.g., face verification).
Definition
Where:
- \(a\) = anchor
- \(p\) = positive example
- \(n\) = negative example
- \(\alpha\) = margin
Properties
- Encourages anchor-positive pairs to be closer than anchor-negative pairs by a margin \(\alpha\).
Conclusion
Loss functions guide learning by quantifying prediction errors. Choosing the right loss function depends on the problem type:
Task Type | Common Loss Functions |
---|---|
Regression | MSE, MAE, Huber |
Binary Classification | Binary Cross-Entropy |
Multi-class Classification | Categorical Cross-Entropy, Sparse Cross-Entropy |
Embedding / Similarity | Contrastive, Triplet |
Structured / Probabilistic | KL Divergence, Hinge |
Proper loss function choice can significantly improve convergence, performance, and generalization.