8.1 KiB
Support Vector Machines: Optimization, Dual Problem & Kernel Methods
Date: 2025.10.30 and 2025.11.03 Topic: SVM Dual Form, Lagrange Multipliers, Kernel Trick, Cover's Theorem, Mercer's Theorem
1. Introduction to SVM Mathematics
The lecture focuses on the fundamental mathematical concepts behind Support Vector Machines (SVM), specifically the Large Margin Classifier.
- Goal: The objective is to understand the flow and connection of formulas rather than memorizing them.
- Context: SVMs were the dominant model for a decade before deep learning and remain powerful for specific problem types.
- Core Concept: The algorithm seeks to maximize the margin to ensure the most robust classifier.
2. General Optimization with Constraints
The lecture reviews and expands on the method of Lagrange multipliers for solving optimization problems with constraints.
- Problem Setup: To minimize an objective function
L(x)subject to constraintsg(x) \ge 0, a new objective function (Lagrangian) is defined by combining the original function with the constraints using multipliers (\lambda). - KKT Conditions: The Karush-Kuhn-Tucker (KKT) conditions are introduced to solve this. There are two main solution cases:
- Feasible Region: The unconstrained minimum satisfies the constraint. Here,
\lambda = 0. - Boundary Case: The solution lies on the boundary where
g(x) = 0. Here,\lambda > 0.
- Feasible Region: The unconstrained minimum satisfies the constraint. Here,
3. Multi-Constraint Example
A specific example is provided to demonstrate optimization with multiple constraints.
- Objective: Minimize
x_1^2 + x_2^2subject to two linear constraints. - Lagrangian: The function is defined as
L'(x) = L(x) - \lambda_1 g_1(x) - \lambda_2 g_2(x). - Solving Strategy: With two constraints, there are four possible combinations for
\lambdavalues (both zero, one zero, or both positive).- The lecture demonstrates testing these cases. For instance, assuming both
\lambda=0yieldsx_1=0, x_2=0, which violates the constraints. - The valid solution is found where the constraints intersect (Boundary Case).
- The lecture demonstrates testing these cases. For instance, assuming both
4. SVM Mathematical Formulation (Primal Problem)
The lecture applies these optimization principles specifically to the SVM Large Margin Classifier.
- Objective Function: Minimize
\frac{1}{2}||w||^2(equivalent to maximizing the margin). - Constraints: All data points must be correctly classified outside the margin:
y_i(w^T x_i - b) \ge 1. - Lagrangian Formulation:
L(w, b) = \frac{1}{2}||w||^2 - \sum_{i=1}^{N} \alpha_i [y_i(w^T x_i - b) - 1]Here,\alpha_irepresents the Lagrange multipliers.
5. Deriving the Dual Problem
To solve this, the Partial Derivatives with respect to the parameters w and b are set to zero.
- Derivative w.r.t
w: Yields the relationshipw = \sum \alpha_i y_i x_i. This showswis a linear combination of the data points. - Derivative w.r.t
b: Yields the constraint\sum \alpha_i y_i = 0. - Substitution: By plugging these results back into the original Lagrangian equation, the "Primal" problem is converted into the "Dual" problem.
6. The Dual Form and Kernel Intuition
The final derived Dual objective function depends entirely on the dot product of data points.
- Dual Equation:
\text{Maximize } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_jSubject to\sum \alpha_i y_i = 0and\alpha_i \ge 0. - Primal vs. Dual:
- Primal: Depends on the number of features/parameters (
D). - Dual: Depends on the number of data points (
N).
- Primal: Depends on the number of features/parameters (
- Significance: The term
x_i^T x_jrepresents the inner product between data points. This structure allows for the "Kernel Trick" (discussed below), which handles non-linearly separable data by mapping it to higher dimensions without explicit calculation.
7. The Dual Form and Inner Products
In the previous section, the Dual Form of the SVM optimization problem was derived.
-
Objective Function: The dual objective function to maximize involves the parameters
\alphaand the data points:\sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i^T x_j) -
Key Observation: The optimization depends solely on the inner product (
x_i^T x_j) between data points. This inner product represents the similarity between two vectors, which is the foundational concept for the Kernel Method.
8. Feature Mapping and Cover's Theorem
When data is not linearly separable in the original space (low-dimensional), we can transform it into a higher-dimensional space where a linear separator exists.
-
Mapping Function (
\Phi): We define a transformation rule, or mapping function\Phi(x), that projects input vectorxfrom the original space to a high-dimensional feature space.- Example 1 (1D to 2D): Mapping
x \to (x, x^2). A linear line in the 2D space (parabola) can separate classes that were mixed on the 1D line. - Example 2 (2D to 3D): Mapping
x = (x_1, x_2)to\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2).
- Example 1 (1D to 2D): Mapping
-
Cover's Theorem: This theorem states that as the dimensionality of the feature space increases, the "power" of the linear method increases, making it more likely to find a linear separator.
- Strategy: Apply a mapping function
\Phito the original data, then find a linear classifier in that high-dimensional space.
- Strategy: Apply a mapping function
9. The Kernel Trick
Directly computing the mapping \Phi(x) can be computationally expensive or impossible (e.g., infinite dimensions). The Kernel Trick allows us to compute the similarity in the high-dimensional space using only the original low-dimensional vectors.
-
Definition: A Kernel function
K(x, y)calculates the inner product of the mapped vectors:K(x, y) = \Phi(x)^T \Phi(y) -
Efficiency: The result is a scalar value calculated without knowing the explicit form of
\Phi. -
Derivation Example (Polynomial Kernel): For 2D vectors
xandy, consider the kernelK(x, y) = (x^T y)^2.(x^T y)^2 = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + x_2^2 y_2^2 + 2x_1 y_1 x_2 y_2This is mathematically equivalent to the dot product of two mapped vectors where:
\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)Thus, calculating
(x^T y)^2in the original space is equivalent to calculating similarity in the 3D space defined by\Phi.
10. Mercer's Theorem & Positive Definite Functions
How do we know if a function K(x, y) is a valid kernel? Mercer's Theorem provides the condition.
- The Theorem: If a function
K(x, y)is Positive Definite (P.D.), then there always exists a mapping function\Phisuch thatK(x, y) = \Phi(x)^T \Phi(y). - Implication: We can choose any P.D. function as our kernel and be guaranteed that it corresponds to some high-dimensional space, without needing to derive
\Phiexplicitly.
Positive Definiteness (Matrix Definition)
To check if a kernel is P.D., we analyze the Kernel Matrix (Gram Matrix) constructed from data points.
- For any non-zero vector
z, a matrixMis P.D. ifz^T M z > 0for allz. - Eigenvalue Condition: A matrix is P.D. if and only if all of its eigenvalues are positive.
11. Infinite Dimensionality (RBF Kernel)
The lecture briefly touches upon the exponential (Gaussian/RBF) kernel.
- The exponential function can be expanded using a Taylor Series into an infinite sum.
- This implies that using an exponential-based kernel is equivalent to mapping the data into an infinite-dimensional space.
- Even though the dimension is infinite, the calculation
K(x, y)remains a simple scalar operation in the original space.
12. Final SVM Formulation with Kernels
By applying the Kernel Trick, the SVM formulation is generalized to non-linear problems.
-
Dual Objective: Replace
x_i^T x_jwithK(x_i, x_j):\text{Maximize: } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j K(x_i, x_j) -
Decision Rule: For a new test point
x', the classification is determined by:\sum \alpha_i y_i K(x_i, x') - b \ge 0
Next Lecture: The course will move on to Generative Methods (probability methods).