add to final

2025-12-06 18:32:08 +09:00
parent ac1d2e744d
commit 0fc412e690
21 changed files with 935 additions and 0 deletions
--- a/final/1030.md
+++ b/final/1030.md
@@ -0,0 +1,125 @@
+# Support Vector Machines: Optimization, Dual Problem & Kernel Methods
+
+**Date:** 2025.10.30 and 2025.11.03
+**Topic:** SVM Dual Form, Lagrange Multipliers, Kernel Trick, Cover's Theorem, Mercer's Theorem
+
+---
+
+### 1. Introduction to SVM Mathematics
+The lecture focuses on the fundamental mathematical concepts behind Support Vector Machines (SVM), specifically the Large Margin Classifier.
+* **Goal:** The objective is to understand the flow and connection of formulas rather than memorizing them.
+* **Context:** SVMs were the dominant model for a decade before deep learning and remain powerful for specific problem types.
+* **Core Concept:** The algorithm seeks to maximize the margin to ensure the most robust classifier.
+
+### 2. General Optimization with Constraints
+The lecture reviews and expands on the method of Lagrange multipliers for solving optimization problems with constraints.
+* **Problem Setup:** To minimize an objective function $L(x)$ subject to constraints $g(x) \ge 0$, a new objective function (Lagrangian) is defined by combining the original function with the constraints using multipliers ($\lambda$).
+* **KKT Conditions:** The Karush-Kuhn-Tucker (KKT) conditions are introduced to solve this. There are two main solution cases:
+    1.  **Feasible Region:** The unconstrained minimum satisfies the constraint. Here, $\lambda = 0$.
+    2.  **Boundary Case:** The solution lies on the boundary where $g(x) = 0$. Here, $\lambda > 0$.
+
+### 3. Multi-Constraint Example
+A specific example is provided to demonstrate optimization with multiple constraints.
+* **Objective:** Minimize $x_1^2 + x_2^2$ subject to two linear constraints.
+* **Lagrangian:** The function is defined as $L'(x) = L(x) - \lambda_1 g_1(x) - \lambda_2 g_2(x)$.
+* **Solving Strategy:** With two constraints, there are four possible combinations for $\lambda$ values (both zero, one zero, or both positive).
+    * The lecture demonstrates testing these cases. For instance, assuming both $\lambda=0$ yields $x_1=0, x_2=0$, which violates the constraints.
+    * The valid solution is found where the constraints intersect (Boundary Case).
+
+### 4. SVM Mathematical Formulation (Primal Problem)
+The lecture applies these optimization principles specifically to the SVM Large Margin Classifier.
+* **Objective Function:** Minimize $\frac{1}{2}||w||^2$ (equivalent to maximizing the margin).
+* **Constraints:** All data points must be correctly classified outside the margin: $y_i(w^T x_i - b) \ge 1$.
+* **Lagrangian Formulation:**
+    $$L(w, b) = \frac{1}{2}||w||^2 - \sum_{i=1}^{N} \alpha_i [y_i(w^T x_i - b) - 1]$$
+    Here, $\alpha_i$ represents the Lagrange multipliers.
+
+### 5. Deriving the Dual Problem
+To solve this, the Partial Derivatives with respect to the parameters $w$ and $b$ are set to zero.
+* **Derivative w.r.t $w$:** Yields the relationship $w = \sum \alpha_i y_i x_i$. This shows $w$ is a linear combination of the data points.
+* **Derivative w.r.t $b$:** Yields the constraint $\sum \alpha_i y_i = 0$.
+* **Substitution:** By plugging these results back into the original Lagrangian equation, the "Primal" problem is converted into the "Dual" problem.
+
+### 6. The Dual Form and Kernel Intuition
+The final derived Dual objective function depends entirely on the dot product of data points.
+* **Dual Equation:**
+    $$\text{Maximize } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$$
+    Subject to $\sum \alpha_i y_i = 0$ and $\alpha_i \ge 0$.
+* **Primal vs. Dual:**
+    * **Primal:** Depends on the number of features/parameters ($D$).
+    * **Dual:** Depends on the number of data points ($N$).
+* **Significance:** The term $x_i^T x_j$ represents the inner product between data points. This structure allows for the "Kernel Trick" (discussed below), which handles non-linearly separable data by mapping it to higher dimensions without explicit calculation.
+
+---
+
+### 7. The Dual Form and Inner Products
+In the previous section, the **Dual Form** of the SVM optimization problem was derived.
+* **Objective Function:** The dual objective function to maximize involves the parameters $\alpha$ and the data points:
+    $$\sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j (x_i^T x_j)$$
+   
+* **Key Observation:** The optimization depends solely on the **inner product** ($x_i^T x_j$) between data points. This inner product represents the **similarity** between two vectors, which is the foundational concept for the Kernel Method.
+
+---
+
+### 8. Feature Mapping and Cover's Theorem
+When data is not linearly separable in the original space (low-dimensional), we can transform it into a higher-dimensional space where a linear separator exists.
+
+* **Mapping Function ($\Phi$):** We define a transformation rule, or mapping function $\Phi(x)$, that projects input vector $x$ from the original space to a high-dimensional feature space.
+    * **Example 1 (1D to 2D):** Mapping $x \to (x, x^2)$. A linear line in the 2D space (parabola) can separate classes that were mixed on the 1D line.
+    * **Example 2 (2D to 3D):** Mapping $x = (x_1, x_2)$ to $\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$.
+
+
+
+* **Cover's Theorem:** This theorem states that as the dimensionality of the feature space increases, the "power" of the linear method increases, making it more likely to find a linear separator.
+    * **Strategy:** Apply a mapping function $\Phi$ to the original data, then find a linear classifier in that high-dimensional space.
+
+---
+
+### 9. The Kernel Trick
+Directly computing the mapping $\Phi(x)$ can be computationally expensive or impossible (e.g., infinite dimensions). The **Kernel Trick** allows us to compute the similarity in the high-dimensional space using only the original low-dimensional vectors.
+
+* **Definition:** A Kernel function $K(x, y)$ calculates the inner product of the mapped vectors:
+    $$K(x, y) = \Phi(x)^T \Phi(y)$$
+   
+* **Efficiency:** The result is a scalar value calculated without knowing the explicit form of $\Phi$.
+
+* **Derivation Example (Polynomial Kernel):**
+    For 2D vectors $x$ and $y$, consider the kernel $K(x, y) = (x^T y)^2$.
+    $$(x^T y)^2 = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + x_2^2 y_2^2 + 2x_1 y_1 x_2 y_2$$
+    This is mathematically equivalent to the dot product of two mapped vectors where:
+    $$\Phi(x) = (x_1^2, x_2^2, \sqrt{2}x_1 x_2)$$
+    Thus, calculating $(x^T y)^2$ in the original space is equivalent to calculating similarity in the 3D space defined by $\Phi$.
+
+---
+
+### 10. Mercer's Theorem & Positive Definite Functions
+How do we know if a function $K(x, y)$ is a valid kernel? **Mercer's Theorem** provides the condition.
+
+* **The Theorem:** If a function $K(x, y)$ is **Positive Definite (P.D.)**, then there *always* exists a mapping function $\Phi$ such that $K(x, y) = \Phi(x)^T \Phi(y)$.
+* **Implication:** We can choose any P.D. function as our kernel and be guaranteed that it corresponds to some high-dimensional space, without needing to derive $\Phi$ explicitly.
+
+#### **Positive Definiteness (Matrix Definition)**
+To check if a kernel is P.D., we analyze the Kernel Matrix (Gram Matrix) constructed from data points.
+* For any non-zero vector $z$, a matrix $M$ is P.D. if $z^T M z > 0$ for all $z$.
+* **Eigenvalue Condition:** A matrix is P.D. if and only if **all of its eigenvalues are positive**.
+
+---
+
+### 11. Infinite Dimensionality (RBF Kernel)
+The lecture briefly touches upon the exponential (Gaussian/RBF) kernel.
+* The exponential function can be expanded using a Taylor Series into an infinite sum.
+* This implies that using an exponential-based kernel is equivalent to mapping the data into an **infinite-dimensional space**.
+* Even though the dimension is infinite, the calculation $K(x, y)$ remains a simple scalar operation in the original space.
+
+---
+
+### 12. Final SVM Formulation with Kernels
+By applying the Kernel Trick, the SVM formulation is generalized to non-linear problems.
+
+* **Dual Objective:** Replace $x_i^T x_j$ with $K(x_i, x_j)$:
+    $$\text{Maximize: } \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$
+   
+* **Decision Rule:** For a new test point $x'$, the classification is determined by:
+    $$\sum \alpha_i y_i K(x_i, x') - b \ge 0$$
+   
+**Next Lecture:** The course will move on to Generative Methods (probability methods).