Large Margin Classifiers and Optimization

Date: 2025.10.27 Topic: Large Margin Classifiers, Optimization, Margin Definition

The lecture begins by shifting focus from generative methods to discriminative methods, specifically within a linearly separable setting.

Problem Setting: The goal is to classify data that can be perfectly separated by a linear boundary (hyperplane).
Robustness: While infinite linear classifiers may separate the data, the objective is to find the "best" one. The best classifier is defined as the one that is most robust, meaning it generalizes well to new test data and handles potential outliers effectively.
Intuition: A robust classifier places the decision boundary in the middle of the gap between classes, maximizing the distance to the nearest data points.

The concept of the margin is introduced to mathematically define robustness.

Definition: The margin is the distance between the decision hyperplane and the closest data points.
Hyperplane Equation: The decision boundary is defined as w^T x - b = 0.
Support Lines: To define the margin, we establish two parallel lines passing through the closest data points:
- w^T x - b = 1 (for class +1).
- w^T x - b = -1 (for class -1).
- The region between these lines contains no data points.

The lecture derives the mathematical expression for the margin width using vector projection.

Vector Projection: The margin is calculated by projecting the vector connecting a point on the boundary (x_0) to a support vector (x) onto the normal vector w.
Derivation:
- The distance is the projection of vector (x - x_0) onto the unit normal vector \frac{w}{||w||}.
- Using the constraint w^T x - b = 1 and w^T x_0 - b = 0, the derived margin distance is \frac{1}{||w||}.
Conclusion: Maximizing the margin is equivalent to minimizing the norm of the weight vector $||w||$.

The task of finding the best classifier is formulated as a constrained optimization problem.

Objective Function:
\min ||w||^2
(Note: Minimizing ||w|| is computationally equivalent to minimizing ||w||^2)
Constraints: All data points must be correctly classified and lie outside the margin. This is formalized as:
- w^T x_i - b \ge 1 for y_i = 1.
- w^T x_i - b \le -1 for y_i = -1.
- Combined Constraint: y_i (w^T x_i - b) \ge 1 for all i.

The lecture explains how to solve this optimization problem using Lagrange Multipliers, using a general example first.

Problem Setup: Minimize an objective function L(x) subject to a constraint g(x) \ge 0.
Lagrangian: A new objective function is defined by combining the original loss and the constraint with a multiplier \lambda:
L'(x) = L(x) - \lambda g(x)
(Note: The transcript discusses combining components; the sign depends on the specific maximization/minimization formulation)
Solution Cases: The solution involves taking the derivative \frac{dL'}{dx} = 0 and considering two cases:
1. Feasible Region (\lambda = 0): The unconstrained minimum of L(x) naturally satisfies the constraint (g(x) > 0). In this case, the constraint is inactive.
2. Boundary Case (\lambda > 0): The unconstrained minimum violates the constraint. Therefore, the optimal solution lies on the boundary where g(x) = 0.

A specific mathematical example is worked through to demonstrate the method.

Objective: Minimize x_1^2 + x_2^2 (distance from origin).
Constraint: x_2 - x_1^2 - 1 \ge 0 (must be above a parabola).
Solving:
- The Lagrangian is set up: L' = x_1^2 + x_2^2 - \lambda(x_2 - x_1^2 - 1).
- Case 1 (\lambda = 0): Leads to x_1=0, x_2=0, which violates the constraint (0 - 0 - 1 = -1 \not\ge 0). This solution is discarded.
- Case 2 (Boundary, \lambda \ne 0): The solution must lie on x_2 - x_1^2 - 1 = 0. Solving the system of equations yields the valid minimum at x_1=0, x_2=1.

The lecture concludes by linking this optimization framework back to the classifier.

Support Vectors: The data points that lie exactly on the margin boundary (g(x)=0) are called "Support Vectors".
Future Topic: This foundation leads into the Support Vector Machine (SVM) algorithm, which will be discussed in the next session to handle non-linearly separable data.