Logistic Regression
Table of Contents
- Introduction
- Data Representation
- Model Hypothesis
- Loss Function
- Gradient Of Loss Function
- Method 2: When \(y \in \{-1, 1\}\)
- Conclusion
Introduction
Logistic regression models a linear relationship between input features \(X\) and a binary target vector \(y\). It applies the sigmoid function to the linear combination of inputs to produce an output between 0 and 1, which is then thresholded to make a binary class prediction. Matrix operations enable efficient computation and scalability to large datasets.
Data Representation
Let:
- \(n\): number of training examples
- \(d\): number of features
We define:
- \(\mathbf{X} \in \mathbb{R}^{n \times d}\): feature matrix
- \(\mathbf{y} \in \{0,1\}^n\): target vector
- \(\beta \in \mathbb{R}^{d \times 1}\): weight vector
Model Hypothesis
In logistic regression, we compute a linear combination of input features and model parameters, and then apply the sigmoid function to make the result into the range \([0, 1]\). The hypothesis function is:
where the sigmoid function \(\sigma(z)\) is defined as:
Loss Function
To derive the loss function for logistic regression, we use maximum likelihood estimation (MLE), assuming that each target label is drawn from a Bernoulli distribution.
Assumption: Bernoulli Targets
We assume that the training dataset \(\{(x_i, y_i)\}_{i=1}^n\) are independent and identically distributed (iid), and that the target \(y_{i} \in \{0, 1\}\) follows a Bernoulli distribution parameterized by \(\theta_i\):
where \(\theta_i = \hat{y}_{i} = \sigma(\beta^\top x_{i})\). The probability mass function of the Bernoulli distribution is:
Now computing the likelihood \(L(\beta)\):
Next we computing the Log-Likelihood (\(\log L(\beta)\)),
Finding the negative of Log-Likelihood \(-\log L(\beta)\) we have our final loss function as Negative-Log-Likelihood \(NLL\) given as:
Since \(0\le\theta_{i} \le 1\) then we can use a sigmoid function to generate it hence we have \(\theta_{i} = \hat{y}_{i} = \sigma(\beta^\top x_{i})\). And finally we have a loss function known as binary cross entropy given as
Gradient of the Loss Function
Now we want to find the gradient of our loss function Negative-Log-Likelihood \(NLL\) using chain rule i.e. $$ L = - \sum_{i=1}^n\left(y_{i} \log \sigma(z_{i}) + (1 - y_{i}) \log (1 - \sigma(z_{i}))\right) $$ $$ \frac{d L}{d \beta} = \frac{d L}{dz_{i}} \times \frac{d z_{i}}{d \beta} $$
NOTE
$$ \sigma(z_{i}) = \frac{1}{1 + e^{-z_{i}}}, \quad \text{where } z_{i} = \beta^\top x_{i} $$ To find the derivative of \(\sigma(z_{i})\) with respect to \(z_{i}\), we have:
Now continueing with the derivative of our binary cross entropy loss \(L\) w.r.t \(z_{i}\) we have
Next we solve for \(\frac{d z_{i}}{d \beta}\)
\(z_{i} = \beta^\top x_{i}\)
\(\frac{d z_{i}}{d \beta} = x_{i}\)
Therefore we have our final gradient as
The gradient in vectorized form is:
When $ X $ is a data matrix, the gradient becomes:
Method 2: When \(y \in \{-1, 1\}\)
In some variants, target labels are represented using \(-1\) and \(+1\). The logistic regression model and loss function can be adjusted accordingly.
Now we are going to derive the hypothesis, loss function, and the gradient.
Hypothesis
Let
and \(\hat{y} = w^\top x\)
Assuming \(P(Y=-1\mid X)\) and \(P(Y=1\mid X)\) are defined, then
Now introducing the logarithm, we have
This is equivalent to
Therefore we have
And
Therefore, we can conclude our Hypothesis as
where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function.
Loss Function
From Maximum Likelihood Estimation (MLE), the negative log-likelihood is given as
From the hypothesis derived earlier, we have
Now solving for \(NLL(w)\):
Therefore, our Loss Function is given as
Gradient
Now we find the gradient of our loss function \(\mathcal{L}(w)\) with respect to \(w\):
where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function.
Therefore our gradient \(\nabla_w L\) is given as; $$ \boxed{ \nabla_w L = -\sum_{i=1}^{n} \left(1 - \sigma(y_i w^\top x_i)\right) y_i x_i } $$
Conclusion
Here’s a table comparing the two formulations of logistic regression based on the label encoding: \(y \in {0, 1}\) vs. \(y \in {-1, 1}\).
Feature | \(y \in {0, 1}\) | \(y \in {-1, 1}\) |
---|---|---|
Label Encoding | 0 for negative class, 1 for positive class | -1 for negative class, +1 for positive class |
Hypothesis Function | \(\hat{y} = \sigma(\beta^\top x)\) | \(P(y \mid x) = \sigma(y \cdot w^\top x)\) |
Sigmoid Argument | \(\sigma(z) = \frac{1}{1 + e^{-\beta^\top x}}\) | \(\sigma(y \cdot w^\top x)\) |
Loss Function | \(-\sum_{i=1}^n \left[y_i \log \sigma(z_i) + (1-y_i) \log(1-\sigma(z_i))\right]\) | \(\sum_{i=1}^n \log(1 + e^{-y_i (w^\top x_i)})\) |
Loss Name | Binary Cross-Entropy | Log-Loss (equivalent in nature, different form) |
Gradient of Loss Function | \(\nabla_\beta = X^\top(\sigma(X\beta) - y)\) | \(\nabla_w = -\sum_{i=1}^n (1 - \sigma(y_i w^\top x_i)) y_i x_i\) |
Logistic regression models binary outcomes using a sigmoid function over a linear combination of features. We derived the binary cross-entropy loss using maximum likelihood and computed its gradient using the chain rule. The model can be expressed for both \(y \in \{0, 1\}\) and \(y \in \{-1, 1\}\), with slightly different forms of the loss function and gradient. The final vectorized gradient enables efficient optimization using gradient descent.