Introduction to Machine Learning
This documentation covers the fundamental concepts and theory behind key machine learning algorithms. It is designed to provide clear explanations and mathematical intuition to help deepen your understanding of how these algorithms work.
Use this as a starting point to explore the core ideas and techniques that power many machine learning applications today.
Table of Contents
- Introduction
- Linear Regression
- Linear Model Evaluation
- K-Fold Cross Validation
- Bias-Variance Decomposition
- Maximum Likelihood Estimation
- Examples: MLE Leading to Common Loss Functions
- Maximum A Posteriori (MAP) Estimation
- Gradient Descent from First-Order Taylor Approximation
- Gradient Descent (GD) Convergence
- Stochastic Gradient Descent (SGD)
- Mini-Batch Stochastic Gradient Descent
- Stochastic Gradient Descent (SGD) with Momentum
- Neural Network
Introduction
In supervised machine learning we usually start with the dataset given as:
Where \(x_i \in \mathcal{X}, y_i \in \mathcal{Y}, (x_i, y_i) \in \mathcal{X} \times \mathcal{Y}\) and \(n\) is the number of samples.
\(\mathcal{X}\) is known as the input space while \(\mathcal{Y}\) is the output space.
If \(\mathcal{Y}\) is a finite set of discrete labels then the task is Classification. But if \(\mathcal{Y} \in \mathbb{R}\) then the task is Regression.
In ML we are looking for the best \(F\) such that:
\(F\) is known as a function, mapping, or hypothesis. And where \(F\) comes from is known as the hypothesis class/space \(\mathcal{H}\), i.e.,
We also have an objective function (or cost function/criterion) where our main task is to optimize it.
Now we look at one of the common hypothesis spaces known as linear mapping, which leads us to linear regression.
Linear Regression
Here our hypothesis is given as:
where \(x\) is known as the input data, \(w\) is the weights (coefficients) and \(d\) is number of features.
From the above linear mapping, we can have our objective function as:
Now we can find the analytic solution \(\hat{w}\) which minimizes our objective function \(L(w)\). To solve, we will find the derivative \(L(w)\) (gradient) then equate it to zero.Taking \(X \in \mathbb{R}^{n\times d},\quad Y \in \mathbb{R}^{n\times 1} \quad \text{and} \quad w \in \mathbb{R}^{d \times 1}\). Let’s solve:
Now we find the gradient w.r.t \(w\):
Therefore, our gradient \(\nabla_w L(w)\) is given as:
Taking \(\nabla_w L(w) = 0\):
Therefore, the analytical solution \(\hat{w}\) that minimizes \(L(w)\) is:
But the above solution exists only if \(X^\top X\) is invertible, i.e., the inverse exists.
If the inverse doesn’t exist, we introduce a regularizer. Let’s use the \(L_2\) regularizer (also known as ridge), where we will have a new objective function given as:
Where \(\lambda > 0\) (positive), since if it’s negative and \(\left\| w \right\|_2^2\) is too big, then our \(L_{\text{ridge}}(w)\) will not be bounded.
Let’s find the analytic solution of \(L_{\text{ridge}}(w)\):
Now we find the gradient of \(L_{\text{ridge}}(w)\) w.r.t \(w\):
Therefore, our gradient \(\nabla_w L_{\text{ridge}}(w)\) is given as:
Taking \(\nabla_w L_{\text{ridge}}(w) = 0\):
Therefore, the analytical solution \(\hat{w}\) that minimizes \(L_{\text{ridge}}(w)\) is:
The above solution from ridge regression always exists since \(X^\top X + \lambda I\) is always invertible.
Linear Model Evaluation
For linear regression most time we evaluate our model using mean square error (mse).
Given the dataset:
and a model \(f(x) = w^\top x\), the Mean Squared Error (MSE) is defined as:
Given the loss/criterion above our aim is to find a function \(\hat{f}\) which minimizes our loss i.e $$ \hat{f} = \arg\min_{f \in \mathcal{H}} \; \frac{1}{n} \sum_{i=1}^{n} \left( f(x_i) - y_i \right)^2 $$ where;
- \(f\) - hypothesis
- \(\mathcal{H}\) - hypothesis class
- \(\hat{f}\) - best in \(\mathcal{H}\)
- \(\frac{1}{n} \sum_{i=1}^{n} \left( f(x_i) - y_i \right)^2\) - emperical risk
We have two type of risk:
1). Emperical Risk
We get it from our train dataset, which is given as our loss i.e $$ R_{train} = \frac{1}{n} \sum_{i=1}^{n} \left( f(x_i) - y_i \right)^2 $$
- This is also called training error or empirical risk \(\hat{R}(f)\).
- It’s computable since we have the training dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\).
2). True Risk
- \(\tau\) is the true (unknown) data distribution.
- This is called expected risk or generalization error.
- We usually can’t compute this directly because we don’t know the true distribution \(\tau\), only samples from it (the dataset).
Why don’t we use Empirical Risk to evaluate our model performance?
This is because Empirical (Training) Risk is typically a biased (optimistic) estimator of the True Risk as it underestimates how well the model will perform on unseen data.
Proof
Let \(f_{\mathcal{H}}\) be the best-in-class predictor that minimizes the true risk, i.e.,
Let \(\hat{f}\) be the best-in-class predictor that minimizes the empirical risk, i.e.,
By definition of \(\hat{f}\), we have:
Taking the expectation over training samples (drawn from distribution \(\tau\)):
However, we can go further to have
That is, the empirical risk of \(\hat{f}\) evaluated on the same data it was trained on is generally less than the true/generalization risk of \(\hat{f}\) on unseen data. Therefore:
Since we have seen that Empirical Risk is a biased estimate of the True Risk, we introduce the concept of the test dataset to evaluate our model’s generalization performance.
Given a test dataset:
Then the empirical test risk is defined as:
Proof:
Therefore, $$ \mathbb{E} \left[ R_{\text{test}} \right] = R $$
This shows that the empirical test risk is an unbiased estimate of the true risk, assuming the test data is sampled independently from the same distribution \(\tau\) as the training data and is not used during training.
Empirical Risk should not be used as a performance metric because it’s biased. It doesn’t reflect how well the model will generalize to new, unseen data — for that, we must estimate the true risk, typically using a validation or test set.
K-Fold Cross Validation
K-Fold Cross Validation is a model evaluation and selection technique used to estimate how well a model generalizes to unseen data. It is also used to tune hyperparameters or select the best hypothesis \(f \in \mathcal{H}\).
Given a dataset:
And a learning algorithm that finds:
We partition the dataset into \(k\) folds:
Where:
- \(\mathcal{D}_i = \{ (x_j, y_j) \}_{j=1}^{n/k}\)
- \(\bigcup_{i = 1}^{k} \mathcal{D}_i = \mathcal{D}\)
- \(\mathcal{D}_i \cap \mathcal{D}_j = \emptyset \quad \forall i \neq j\)
How K-Fold Cross Validation Works
-
Split the dataset \(\mathcal{D}\) into \(k\) approximately equal-sized folds: $$ \mathcal{D}_1, \mathcal{D}_2, \dots, \mathcal{D}_k $$
-
For each fold \(i \in \{1, 2, \dots, k\}\):
- Training set: \(\mathcal{D}_{\text{train}} = \mathcal{D} \setminus \mathcal{D}_i\)
- Validation set: \(\mathcal{D}_i\)
- Train the model \(f_i\) on \(\mathcal{D}_{\text{train}}\)
-
Evaluate the validation loss on \(\mathcal{D}_i\):
\( L^{(i)} = \frac{1}{|\mathcal{D}_i|} \sum_{(x, y) \in \mathcal{D}_i} \mathcal{l}(f_i(x), y) \)
-
Average the validation losses across all folds:
\( \hat{L}_{\text{cv}} = \frac{1}{k} \sum_{i=1}^{k} L^{(i)} \)
How Cross Validation Helps Select Better \(f\):
- Provides a more reliable estimate of generalization performance on unseen data.
- Allows for fair comparison of different models or hyperparameters (e.g., regularization terms).
- Helps select the best model \(f^*\) that minimizes the cross-validation loss:
$$ f^* = \arg\min_{f \in \mathcal{H}} \; \hat{L}_{\text{cv}}(f) $$
You can find other cross validation techniques explanations here:
Bias-Variance Decomposition
Given a dataset \(\mathcal{D} = \left\{ (x_i, y_i) \right\}_{i=1}^{k}\)
Let,
where \(\epsilon\) is the noise in the dataset and it’s independent from \(x_i's\). Also we assume the noise \(\epsilon\) is from a Gaussian distrubtion with mean zero and variance \(\sigma^2\) (ie \(\epsilon \sim \mathcal{N}(0, \sigma^2)\)).
Given \(\hat{f}\) then the True Risk is given as;
Now lets solve \(\mathbb{E}\left[(f(x)-\hat{f}(x))^2\right]\). Let \(\bar{f}\) be the average predictor which is given as \(\bar{f} = \mathbb{E}\left[\hat{f}(x)\right]\). Then we will have,
Therefore our final equestion can be expressed as;
where;
Thus the Bias Variance threshold is given as;
High Bias | High Variance |
---|---|
- Underfitting (high training and test error) | - Overfitting (low training error, high test error) |
- Model has not learned enough | - Model is complex |
** Train longer | - Small training data |
- Model is too simple | ** Feature selection. 1, 2, 3 |
** Increase the features | ** Dimesionality Reduction |
** Change Hypothesis class | ** Data Augmentation |
The above table compares high bias (underfitting) and high variance (overfitting). Entries marked with **
indicate common solutions to address each issue.
Maximum Likelihood Estimation
Here we are going to use the concept of MLE to find negative log-likelihood and also come up with some of the objective/loss functions.
Given a dataset
Let \(h_{w} \in \mathcal{H}\) i.e a function \(h\) with parameter \(w\) in hypothesis space \(\mathcal{H}\). If we assuming the outputs are drawn independently from a distribution \(P(y_i | x_i; w)\),the likelihood of the data is:
Since the product of probabilities tends to zero or are very small (i.e between 0 and 1) we take its \(log\) as \(log\) is an increasing function and it changes the product of probabilities in to summation which simplifies computation to have log-likelihood:
Then, the Maximum Likelihood Estimation (MLE) objective is:
Or equivalently, we minimize the negative log-likelihood (NLL):
Examples: MLE Leading to Common Loss Functions
1. Mean Squared Error (MSE)
Given a dataset:
Assume that the target variable \(y_i\) is generated as:
where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\). So the conditional probability of \(y_i\) given \(x_i\) is:
Assuming i.i.d. data:
Take the log:
We now minimize the negative log-likelihood:
Since \(n\log(2\pi\sigma^2)\) and \(\frac{1}{2\sigma^2}\) are constants (do not depend on \(w\)), minimizing \(\mathcal{L}_{\text{NLL}}(w)\) is equivalent to minimizing:
Thus, minimizing the Mean Squared Error (MSE):
is equivalent to maximum likelihood estimation under the assumption of Gaussian noise in the regression model.
2. Binary cross Entropy
Given a dataset:
Assume that the target variable \(y_i\) is generated from a Bernoulli distribution i.e
Since \(\hat{y_i}\) are probalities we can use sigmoid function to generate them;
Assuming i.i.d. data:
Take the log:
We now minimize the negative log-likelihood:
Thus the resulting loss function is called the Binary Cross-Entropy given as:
So minimizing binary cross-entropy is equivalent to maximum likelihood estimation under the assumption that labels are drawn from a Bernoulli distribution with probability given by \(\hat{y} = \frac{1}{1+e^{w^\top x}}\)
NOTE:
In Maximum Likelihood Estimation (MLE), to derive the negative log-likelihood, we assume that we know the distribution of the outputs \(y_i\) given the inputs \(x_i\) and model parameters \(w\); that is, we assume a probabilistic model of the form:
where \(w\) represents the model parameters.
Now, we introduce another concept known as Maximum A Posteriori (MAP) Estimation.
Maximum A Posteriori (MAP) Estimation
Given a dataset:
Let \(h_w \in \mathcal{H}\), i.e., a function \(h\) parameterized by \(w\), in hypothesis space \(\mathcal{H}\).
In Maximum A Posteriori (MAP) Estimation, we assume we have prior knowledge about the distribution of the parameters \(w\). Using Bayes’ Rule:
Where:
- \(P(w \mid \mathcal{D})\): Posterior — what MAP tries to maximize
- \(P(\mathcal{D} \mid w)\): Likelihood — same as in MLE
- \(P(w)\): Prior over parameters
- \(P(\mathcal{D})\): Marginal likelihood (a constant with respect to \(w\))
The MAP estimate maximizes the posterior:
Because probabilities are small and prone to numerical instability, we apply the log function (monotonic, so it preserves maxima):
In practice, we often minimize the negative log-posterior:
- The first term is the negative log-likelihood (same as MLE)
- The second term is the negative log-prior — acts as a regularization term
MAP is like MLE plus regularization, where the regularization reflects our prior belief about the parameters.
Example 1: MAP with a Gaussian Prior (L2 regularization)
Assume
Where \(d\) is the number of features and \(n\) is number of samples i.e \(X \in \mathbb{R}^{n\times d}\) which implies that \(w \in \mathbb{R}^d\).
Now our prior will be
Now solving log-prior,
Therefore we will have;
This shows that a Gaussian prior on \(w\) leads to L2 regularization in the MAP objective which is the same as Ridge Regression.
Example 2: MAP with a Laplace Prior (L1 regularization)
Assume the prior on each weight \(w_j\) is a Laplace distribution (double exponential distribution):
Which implies
Since weights are assumed to be iid, then the prior is:
Now computing the log-prior:
Apply Bayes’ rule to get the MAP estimate:
This shows that a Laplace prior on \(w\) leads to L1 regularization in the MAP objective which is the same as Lasso Regression.
Gradient Descent from First-Order Taylor Approximation
The gradient of a function \(f\) at a point \(w\) yields the first-order Taylor approximation of \(f\) around \(w\), given as:
When \(f\) is a convex function, this approximation is a lower bound of \(f\). That is:
Therefore, for \(w\) close to \(w^{(t)}\), we can approximate \(f(w)\) as:
We note that the above approximation might become loose if \(w\) is far away from \(w^{(t)}\). Therefore, we minimize jointly the distance between \(w\) and \(w^{(t)}\) and the approximation of \(f\) around \(w^{(t)}\). We also introduce a parameter \(\eta\) which controls the trade-off between the two terms. This leads to the update rule as:
Solving the optimization problem by taking the derivative with respect to \(w\) and setting it to zero yields the gradient descent update rule.
Taking the objective function as \(J(w)\):
Now solving for \(\nabla_w J(w)\)
Next we set the gradient to zero i.e \(\nabla_w J(w) = 0\)
Therefor the gradient descent update rule is given as;
Gradient Descent (GD) Convergence
We want to show that GD converges, i.e.,
after a sufficient number of steps \(T\), where:
- \(f\) is convex and \(\rho\)-Lipschitz
- \(\bar{w} = \frac{1}{T} \sum_{t=1}^T w^{(t)}\)
- \(w^*\) is the optimal point and \(B\) be an upper bound on \(\|w^*\|\) i.e \(\|w^*\| \leq B\)
- \(w^{(t+1)} = w^{(t)} - \eta \nabla f(w^{(t)})\)
Proof:
From the definition of \(\bar{w}\) and using Jensen’s inequality we have;
Due to convexity of \(f\) we have:
So combining the above two inqualies we obtain:
To bound the right-hand side of the above inequality we use this lemma: Given the update rule \(w^{(t+1)} = w^{(t)} - \eta \nabla f(w^{(t)})\), then it satisfies;
If \(f\) is \(\rho\)-Lipschitz, then \(\| \nabla f(w^{(t)}) \| \leq \rho\). Therefore:
Now Choosing an optimal learning rate let say,
Then:
Dividing by \(T\):
If we run the GD algorithm on \(f\) for \(T\) steps with \(\eta = \frac{B}{\rho \sqrt{T}}\), then we will have,
To guarantee that the left-hand side is at most \(\varepsilon\), we solve:
Thus, Gradient Descent converges to within error \(\varepsilon\) after:
iterations, when minimizing a convex and \(\rho\)-Lipschitz function \(f\), under the assumption that the optimal solution satisfies \(\|w^*\| \leq B\).
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent is an iterative optimization algorithm used to minimize a differentiable objective function. Unlike standard (batch) gradient descent, which computes the gradient using the entire dataset, SGD approximates the gradient using a single randomly chosen data point at each iteration.
From above, we have seen that standard gradient descent has the following update rule:
Here, \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w)\), where each \(f_i(w)\) typically represents the loss on the \(i\)-th training example.
In Stochastic Gradient Descent, the full gradient \(\nabla f(w^{(t)})\) is approximated by the gradient on a single example:
where \(i\) is chosen uniformly at random from \(\{1, \dots, n\}\) at each iteration.
This stochastic estimate is unbiased but introduces variance, which affects convergence stability:
Proof:
Given that \(i\) is selected uniformly at random from \(\{1, 2, \dots, n\}\), the probability of selecting any particular index \(i\) is \(\frac{1}{n}\). Therefore, the expectation of the stochastic gradient is:
This confirms that the stochastic gradient estimator is unbiased.
Mini-Batch Stochastic Gradient Descent
Mini-Batch SGD is an improvement between standard gradient descent and stochastic gradient descent (SGD). Instead of computing the gradient using the entire dataset or a single sample, it computes the gradient over a random subset (mini-batch) of data points at each iteration.
Given \(B_t \subset \{1, 2, \dots, n\}\) be a randomly selected mini-batch of \(m\) examples at iteration \(t\). Then the update rule is given as:
Where:
- \(m\): mini-batch size
- \(B_t\): the mini-batch sampled uniformly without replacement
We can also show that the Mini-Batch SGD gradient estimate is unbiased i.e
Proof:
We’re taking the expected value of the mini-batch gradient over all possible random mini-batches \(B_t\) of size \(m\). Since the indices are sampled uniformly, each data point \(i\) has the same probability \(\frac{m}{n}\) of being included in the mini-batch.
Using linearity of expectation:
Since:
We substitute:
Therefore:
We can also note that Mini-Batch SGD has a lower variance than the SGD due to averaging over multiple samples.
Stochastic Gradient Descent (SGD) with Momentum
SGD with momentum is used to accelerate convergence and to smoothen updates, especially in directions of consistent gradients, by adding a “memory” of past gradients.
In SGD with Momentum we have the following update Rule:
We introduce a velocity vector \(v^{(t)}\), which accumulates an exponentially decaying moving average of past gradients.
Where:
- \(\beta \in [0,1)\) is the momentum coefficient,
- \(v^{(0)} = 0\),
- \(\nabla f_i(w^{(t)})\) is a stochastic gradient sampled at iteration \(t\).
We can see that the velocity term is a weighted sum of previous gradients given as:
This shows that recent gradients have more weight (since \(\beta^k\) decays with \(k\)).
Neural Network
A neural network is a machine learning model inspired by the human brain that maps input data to output predictions by learning weights through layers of interconnected nodes (neurons). Neural networks are capable of modeling complex non-linear decision boundaries by composing multiple linear transformations and non-linear activation functions.
A feedforward neural network is made up of:
- An input layer that takes the features.
- One or more hidden layers that learn intermediate representations.
- An output layer that produces predictions.
Each layer applies a linear transformation followed by a non-linear activation function.
Data Representation
Let:
- \(n\): number of training examples
- \(d\): number of features
- \(X \in \mathbb{R}^{d \times n}\): input matrix
- \(y \in \mathbb{R}^{1 \times n}\): target vector for regression or \(y \in {0, 1}^{1\times n}\) for binary classification
For a 2-layer neural network (i.e., one hidden layer), we define:
- \(W^{[1]} \in \mathbb{R}^{h \times d}\): weights for hidden layer
- \(b^{[1]} \in \mathbb{R}^{h \times 1}\): bias for hidden layer
- \(W^{[2]} \in \mathbb{R}^{1 \times h}\): weights for output layer
- \(b^{[2]} \in \mathbb{R}\): bias for output layer
Neural Network Architecture
Here we are going to use an example of a 2-layer feedforward neural network with:
- Input layer (\(d\) neurons)
- Hidden layer (\(h_1\) neurons)
- Output layer (\(h_2\) neurons)
Assuming our task is binary clasification and we use \(\sigma\) (sigmoid) activation function on both layers.
Now let’s see how our data and parameters are represented.
-
Data:
Features: \(X \in \mathbb{R}^{d\times n}\) which is the transpose of our original input data.
Targets: \(Y \in \mathbb{R}^{h_2\times n}\) which is the transpose of our original target data.
-
Layer 1:
Weight: \(W_1 \in \mathbb{R}^{h_1\times d}\)
Bias: \(b_1 \in \mathbb{R}^{h_1\times 1}\)
-
Layer 2:
Weight: \(W_2 \in \mathbb{R}^{h_2\times h_1}\)
Bias: \(b_2 \in \mathbb{R}^{h_2\times 1}\)
-
Loss Function: Since this is a binary classification problem, we use binary cross entrpy. $$ \mathcal{L} = - \frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] $$
Forward Propagation
The Feedforward Propagation, also called Forward Pass, is the process consisting of computing all network nodes’ output values, starting with the first hidden layer until the last output layer, using at start either a subset or the entire dataset samples.
Layer 1
- linear transformation $$ Z_1 = W_1X+b_1 \quad \text{where } Z_1 \in \mathbb{R}^{h_1\times n} $$
- non-linear transformation with \(\sigma\) activation function $$ A_1 = \sigma (Z_1) \quad \text{where } A_1 \in \mathbb{R}^{h_1\times n} $$
- Number of parameters in layer 1 is given as: \((h_1 \times d) + h_1 = h_1(d + 1)\).
Layer 2
- linear transformation
- non-linear transformation with \(\sigma\) activation function
- Number of parameters in layer 2 is given as: \((h_2 \times h_1) + h_2 = h_2(h_1+1)\).
Output Layer
- Output
$$ A_2 = \hat{Y} \in \mathbb{R}^{h_2\times n} $$
Loss Function
For binary classification, we use binary cross-entropy loss:
Matrix form:
Backward Propagation
Backpropagation, or backward propagation of errors, is an algorithm working from the output nodes to the input nodes of a neural network using the chain rule to compute how much each activation unit contributed to the overall error.
Backpropagation automatically computes error gradients to then repeatedly adjust all weights and biases to reduce the overall error.
From our example our aim is to find
Note that we are using our loss and binary cross entropy. But since we want to find its partial derivative w.r.t \(A_2\) we will modify it to be
- \(\displaystyle \frac{\partial L}{\partial A_2}\)
- \(\displaystyle \frac{\partial L}{\partial Z_2} = \displaystyle \frac{\partial L}{\partial A_2} \times \displaystyle \frac{\partial A_2}{\partial Z_2}\)
Therefore we have,
- \(\displaystyle \frac{\partial L}{\partial W_2} = \displaystyle \frac{\partial L}{\partial Z_2} \times \displaystyle \frac{\partial Z_2}{\partial W_2}\)
Therefore we have,
- \(\displaystyle \frac{\partial L}{\partial b_2} = \displaystyle \frac{\partial L}{\partial Z_2} \times \displaystyle \frac{\partial Z_2}{\partial b_2}\)
Therefore we have,
- \(\displaystyle \frac{\partial L}{\partial A_1} = \displaystyle \frac{\partial L}{\partial Z_2} \times \displaystyle \frac{\partial Z_2}{\partial A_1}\)
Therefore we have,
- \(\displaystyle \frac{\partial L}{\partial Z_1} = \displaystyle \frac{\partial L}{\partial A_1} \times \displaystyle \frac{\partial A_1}{\partial Z_1}\)
Therefore we have,
- \(\displaystyle \frac{\partial L}{\partial W_1} = \displaystyle \frac{\partial L}{\partial Z_1} \times \displaystyle \frac{\partial Z_1}{\partial W_1}\)
Therefore we have,
- \(\displaystyle \frac{\partial L}{\partial b_1} = \displaystyle \frac{\partial L}{\partial Z_1} \times \displaystyle \frac{\partial Z_1}{\partial b_1}\)
Therefore we have,
Gradient Descent and Optimization
Using gradient descent, we update parameters in the direction that reduces the loss.
Let \(\eta\) be the learning rate. The update rules:
- \(W_1 \leftarrow W_1 - \eta \cdot \frac{\partial L}{\partial W_1}\)
- \(b_1 \leftarrow b_1 - \eta \cdot \frac{\partial L}{\partial b_1}\)
- \(W_2 \leftarrow W_2 - \eta \cdot \frac{\partial L}{\partial W_2}\)
- \(b_2 \leftarrow b_2 - \eta \cdot \frac{\partial L}{\partial b_2}\)
Repeat this process for multiple epochs until the loss converges.
Vanishing and Exploding Gradient Problems in Deep Neural Networks
In a deep neural network, at layer \(\mathcal{l}\), we define the pre-activation and activation as follows:
Here, \(\phi\) is the activation function.
Gradient Behavior in Backpropagation
During backpropagation, the gradient of the loss \(L\) with respect to \(Z^{\mathcal{l}}\) can be approximated as:
This product can either explode or vanish, depending on the magnitudes of the weights and activation derivatives.
Case 1: Exploding Gradients
Occurs when:
Let
This exponential growth leads to exploding gradients, destabilizing training.
Solutions:
- Reduce depth using residual/skip connections.
- Regularization (e.g., \(L_1\), \(L_2\), or spectral norm regularization).
- Gradient clipping to limit gradient magnitude.
- Normalization techniques, such as BatchNorm or LayerNorm.
Case 2: Vanishing Gradients
Occurs when:
This leads to gradients approaching zero, making it difficult for earlier layers to learn.
Solutions:
- Reduce depth via residual connections.
-
Use non-saturating activation functions:
-
Prefer ReLU, Leaky ReLU, ELU, or Swish over sigmoid or tanh.
- Proper weight initialization (e.g., He or Xavier initialization).
Problem of Dying Neurons
With ReLU, neurons can “die” (i.e., output zero for all inputs), especially when gradients become zero.
Solutions:
- Use Leaky ReLU, PReLU, or ELU to maintain non-zero gradients.
Depth and Skip Connections
Depth refers to the number of layers or the length of the computational path from input to output. Skip connections help by providing alternate shorter paths for gradient flow, effectively reducing the network’s depth from a gradient propagation perspective.
Summary Table
Problem | Solutions |
---|---|
Vanishing Gradient | - Use residual connections (reduce effective depth) - Use non-saturating activations - Use proper initialization |
Exploding Gradient | - Use residual connections - Regularization (e.g., spectral norm) - Gradient clipping - Normalization techniques |
Dying Neurons | - Use Leaky ReLU, ELU, or PReLU |