Perceptron Algorithm
Table of Contents
- The Algorithm
- Intuition Behind the Update Rule
- Perceptron Convergence Theorem
- Advantages & Disadvantages
The Algorithm
The Perceptron is an algorithm for learning a binary classifier. The goal is to find a hyperplane that separates the data linearly. It predicts labels \(\hat{y} \in \{-1, +1\}\) given an input \(x \in \mathbb{R}^n\).
Algorithm 1: Perceptron Algorithm
Input: Sequence of training examples \((x_1, y_1), (x_2, y_2), \ldots\), where \(x_i \in \mathbb{R}^n\), \(y_i \in \{-1, +1\}\)
Output: Final weight vector \(w \in \mathbb{R}^n\)
1. Initialize: w₁ = 0 ∈ ℝⁿ
2. For t = 1, 2, ...
3. Sample (xᵢ, yᵢ)
4. Predict: ŷ = sign(wₜᵀ xᵢ)
5. If ŷ ≠ yᵢ then
6. wₜ₊₁ = wₜ + yᵢ xᵢ
7. Else
8. wₜ₊₁ = wₜ
Intuition Behind the Update Rule
Why is the update rule:
We can analyze this by multiplying the transpose of the weight update with the feature vector \(x_i\):
Let’s consider two cases:
Case 1: \(y_i = +1\) (positive example)
If the prediction was wrong, then \(w_t^\top x_i < 0\), and we have:
This moves the dot product closer to being positive, which corrects the classification.
Case 2: \(y_i = -1\) (negative example)
If the prediction was wrong, then \(w_t^\top x_i > 0\), and we have:
This moves the dot product closer to being negative, again correcting the classification.
Perceptron Convergence Theorem
Theorem (Perceptron Mistake Bound)
Assume:
- The data is linearly separable if there exists a unit vector \(w^* \in \mathbb{R}^n\) and margin \(\gamma > 0\) such that
$$ \forall i,\quad y_i (w^{*T} x_i) \geq \gamma $$ * Each example is bounded: \(\|x_i\| \leq R\)
Then, the Perceptron algorithm makes at most \(\frac{R^2}{\gamma^2}\) mistakes.
Proof:
When mistake \(k\) occurs at example \(t\), we update:
Multiply both sides by \(w^*\):
By induction (starting from \(w_1 = 0\)):
Now compute the squared norm of \(w_{k+1}\):
Using Cauchy-Schwarz and equations (1) and (2):
Hence the Perceptron algorithm makes at most \(\frac{R^2}{\gamma^2}\) mistakes..
Advantages & Disadvantages
Advantages
- Cheap and easy to implement.
- Online algorithm: processes one example at a time. It can scale to large datasets.
- Can use any features easily.
Disadvantages
- All classifiers that separate the data are equivalent.
- Convergence depends on the margin \(\gamma\). If classes are not well separated, slow convergence.