add to final

This commit is contained in:
2025-12-06 18:32:08 +09:00
parent ac1d2e744d
commit 0fc412e690
21 changed files with 935 additions and 0 deletions

125
final/1030.md Normal file
View File

@@ -0,0 +1,125 @@
# Support Vector Machines: Optimization, Dual Problem & Kernel Methods
**Date:** 2025.10.30 and 2025.11.03
**Topic:** SVM Dual Form, Lagrange Multipliers, Kernel Trick, Cover's Theorem, Mercer's Theorem
---
### 1. Introduction to SVM Mathematics
The lecture focuses on the fundamental mathematical concepts behind Support Vector Machines (SVM), specifically the Large Margin Classifier.
* **Goal:** The objective is to understand the flow and connection of formulas rather than memorizing them.
* **Context:** SVMs were the dominant model for a decade before deep learning and remain powerful for specific problem types.
* **Core Concept:** The algorithm seeks to maximize the margin to ensure the most robust classifier.
### 2. General Optimization with Constraints
The lecture reviews and expands on the method of Lagrange multipliers for solving optimization problems with constraints.
* **Problem Setup:** To minimize an objective function $L(x)$ subject to constraints $g(x) \ge 0$, a new objective function (Lagrangian) is defined by combining the original function with the constraints using multipliers ($\lambda$).
* **KKT Conditions:** The Karush-Kuhn-Tucker (KKT) conditions are introduced to solve this. There are two main solution cases:
1. **Feasible Region:** The unconstrained minimum satisfies the constraint. Here, $\lambda = 0$.
2. **Boundary Case:** The solution lies on the boundary where $g(x) = 0$. Here, $\lambda > 0$.
### 3. Multi-Constraint Example
A specific example is provided to demonstrate optimization with multiple constraints.
* **Objective:** Minimize $x_1^2 + x_2^2$ subject to two linear constraints.
* **Lagrangian:** The function is defined as $L'(x) = L(x) - \lambda_1 g_1(x) - \lambda_2 g_2(x)$.
* **Solving Strategy:** With two constraints, there are four possible combinations for $\lambda$ values (both zero, one zero, or both positive).
* The lecture demonstrates testing these cases. For instance, assuming both $\lambda=0$ yields $x_1=0, x_2=0$, which violates the constraints.
* The valid solution is found where the constraints intersect (Boundary Case).
### 4. SVM Mathematical Formulation (Primal Problem)
The lecture applies these optimization principles specifically to the SVM Large Margin Classifier.
* **Objective Function:** Minimize $\frac{1}{2}||w||^2$ (equivalent to maximizing the margin).
* **Constraints:** All data points must be correctly classified outside the margin: $y_i(w^T x_i - b) \ge 1$.
* **Lagrangian Formulation:**
$$L(w, b) = \frac{1}{2}||w||^2 - \sum_{i=1}^{N} \alpha_i [y_i(w^T x_i - b) - 1]$$
Here, $\alpha_i$ represents the Lagrange multipliers.
### 5. Deriving the Dual Problem
To solve this, the Partial Derivatives with respect to the parameters $w$ and $b$ are set to zero.
* **Derivative w.r.t $w$:** Yields the relationship $w = \sum \alpha_i y_i x_i$. This shows $w$ is a linear combination of the data points.
* **Derivative w.r.t $b$:** Yields the constraint $\sum \alpha_i y_i = 0$.
* **Substitution:** By plugging these results back into the original Lagrangian equation, the "Primal" problem is converted into the "Dual" problem.
### 6. The Dual Form and Kernel Intuition
The final derived Dual objective function depends entirely on the dot product of data points.
* **Dual Equation:**
$$\text{Maximize } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$$
Subject to $\sum \alpha_i y_i = 0$ and $\alpha_i \ge 0$.
* **Primal vs. Dual:**
* **Primal:** Depends on the number of features/parameters ($D$).
* **Dual:** Depends on the number of data points ($N$).
* **Significance:** The term $x_i^T x_j$ represents the inner product between data points. This structure allows for the "Kernel Trick" (discussed below), which handles non-linearly separable data by mapping it to higher dimensions without explicit calculation.
---
### 7. The Dual Form and Inner Products
In the previous section, the **Dual Form** of the SVM optimization problem was derived.
* **Objective Function:** The dual objective function to maximize involves the parameters $\alpha$ and the data points:
$$\sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i^T x_j)$$
* **Key Observation:** The optimization depends solely on the **inner product** ($x_i^T x_j$) between data points. This inner product represents the **similarity** between two vectors, which is the foundational concept for the Kernel Method.
---
### 8. Feature Mapping and Cover's Theorem
When data is not linearly separable in the original space (low-dimensional), we can transform it into a higher-dimensional space where a linear separator exists.
* **Mapping Function ($\Phi$):** We define a transformation rule, or mapping function $\Phi(x)$, that projects input vector $x$ from the original space to a high-dimensional feature space.
* **Example 1 (1D to 2D):** Mapping $x \to (x, x^2)$. A linear line in the 2D space (parabola) can separate classes that were mixed on the 1D line.
* **Example 2 (2D to 3D):** Mapping $x = (x_1, x_2)$ to $\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$.
* **Cover's Theorem:** This theorem states that as the dimensionality of the feature space increases, the "power" of the linear method increases, making it more likely to find a linear separator.
* **Strategy:** Apply a mapping function $\Phi$ to the original data, then find a linear classifier in that high-dimensional space.
---
### 9. The Kernel Trick
Directly computing the mapping $\Phi(x)$ can be computationally expensive or impossible (e.g., infinite dimensions). The **Kernel Trick** allows us to compute the similarity in the high-dimensional space using only the original low-dimensional vectors.
* **Definition:** A Kernel function $K(x, y)$ calculates the inner product of the mapped vectors:
$$K(x, y) = \Phi(x)^T \Phi(y)$$
* **Efficiency:** The result is a scalar value calculated without knowing the explicit form of $\Phi$.
* **Derivation Example (Polynomial Kernel):**
For 2D vectors $x$ and $y$, consider the kernel $K(x, y) = (x^T y)^2$.
$$(x^T y)^2 = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + x_2^2 y_2^2 + 2x_1 y_1 x_2 y_2$$
This is mathematically equivalent to the dot product of two mapped vectors where:
$$\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$$
Thus, calculating $(x^T y)^2$ in the original space is equivalent to calculating similarity in the 3D space defined by $\Phi$.
---
### 10. Mercer's Theorem & Positive Definite Functions
How do we know if a function $K(x, y)$ is a valid kernel? **Mercer's Theorem** provides the condition.
* **The Theorem:** If a function $K(x, y)$ is **Positive Definite (P.D.)**, then there *always* exists a mapping function $\Phi$ such that $K(x, y) = \Phi(x)^T \Phi(y)$.
* **Implication:** We can choose any P.D. function as our kernel and be guaranteed that it corresponds to some high-dimensional space, without needing to derive $\Phi$ explicitly.
#### **Positive Definiteness (Matrix Definition)**
To check if a kernel is P.D., we analyze the Kernel Matrix (Gram Matrix) constructed from data points.
* For any non-zero vector $z$, a matrix $M$ is P.D. if $z^T M z > 0$ for all $z$.
* **Eigenvalue Condition:** A matrix is P.D. if and only if **all of its eigenvalues are positive**.
---
### 11. Infinite Dimensionality (RBF Kernel)
The lecture briefly touches upon the exponential (Gaussian/RBF) kernel.
* The exponential function can be expanded using a Taylor Series into an infinite sum.
* This implies that using an exponential-based kernel is equivalent to mapping the data into an **infinite-dimensional space**.
* Even though the dimension is infinite, the calculation $K(x, y)$ remains a simple scalar operation in the original space.
---
### 12. Final SVM Formulation with Kernels
By applying the Kernel Trick, the SVM formulation is generalized to non-linear problems.
* **Dual Objective:** Replace $x_i^T x_j$ with $K(x_i, x_j)$:
$$\text{Maximize: } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$
* **Decision Rule:** For a new test point $x'$, the classification is determined by:
$$\sum \alpha_i y_i K(x_i, x') - b \ge 0$$
**Next Lecture:** The course will move on to Generative Methods (probability methods).