80 lines
5.1 KiB
Markdown
80 lines
5.1 KiB
Markdown
# Lecture Summary: Directed Graphical Models and Naive Bayes
|
|
|
|
**Date:** 2025.11.20
|
|
**Topic:** Parameter Reduction, Directed Graphical Models, Chain Rule, and Naive Bayes Classifier.
|
|
|
|
---
|
|
|
|
### **1. Motivation: The Need for Parameter Reduction**
|
|
The lecture begins by reviewing Generative Methods using the Gaussian distribution.
|
|
* **The Problem:** In high-dimensional settings (e.g., analyzing images or complex biological data), estimating the full Joint Probability Distribution is computationally expensive and data-intensive.
|
|
* For a $D$-dimensional Multivariate Gaussian, we must estimate the mean vector $\mu$ ($D$ parameters) and the Covariance Matrix $\Sigma$ (symmetric $D \times D$ matrix).
|
|
* The total number of parameters is roughly $O(D^2)$, specifically $D + \frac{D(D+1)}{2}$.
|
|
* For large $D$, this requires a massive amount of training data to avoid overfitting.
|
|
* **The Solution:** We use **Prior Knowledge** (domain knowledge) about the relationships between variables to reduce the number of parameters.
|
|
* By assuming certain variables are independent, we can decompose the complex joint distribution into smaller, simpler conditional distributions.
|
|
|
|
---
|
|
|
|
### **2. Directed Graphical Models (Bayesian Networks)**
|
|
A Directed Graphical Model represents random variables as nodes in a graph, where edges denote conditional dependencies.
|
|
|
|
#### **Decomposition via Chain Rule**
|
|
* The joint probability $P(x)$ can be decomposed using the chain rule:
|
|
$$P(x_1, ..., x_D) = \prod_{i=1}^{D} P(x_i | \text{parents}(x_i))$$
|
|
* **Example Structure:**
|
|
If we have a graph where $x_1$ has no parents, $x_2$ depends on $x_1$, etc., the joint distribution splits into:
|
|
$$P(x) = P(x_1)P(x_2|x_1)P(x_3|x_1)...$$
|
|
|
|
#### **Parameter Counting Example (Gaussian Case)**
|
|
The lecture compares the number of parameters required for a "Full" Gaussian model vs. a "Reduced" Graphical Model.
|
|
* **Full Gaussian:** Assumes all variables are correlated.
|
|
* For a 10-dimensional vector ($D=10$), parameters = $10 + \frac{10 \times 11}{2} = 65$.
|
|
* **Reduced Model:** Uses a graph structure where variables are conditionally independent.
|
|
* Instead of one giant covariance matrix, we estimate parameters for several smaller conditional distributions (often univariate Gaussians).
|
|
* **Calculation:** For a univariate conditional Gaussian $P(x_i | x_j)$, we need parameters for the linear relationship (mean coefficients) and variance.
|
|
* In the specific example provided, the parameters reduced from 65 to 57. While the reduction in this small example is modest, for high-dimensional data with sparse connections, the reduction is drastic.
|
|
|
|
---
|
|
|
|
### **3. The Naive Bayes Classifier**
|
|
The **Naive Bayes** classifier is the most extreme (and popular) example of a Directed Graphical Model used for parameter reduction.
|
|
|
|
* **Assumption:** Given the class label $y$, all input features $x_1, ..., x_D$ are **mutually independent**.
|
|
* **Structure:** The class $y$ is the parent of all feature nodes $x_i$. There are no connections between the features themselves.
|
|
* **Formula:**
|
|
$$P(x|y) = P(x_1|y) P(x_2|y) \cdot ... \cdot P(x_D|y) = \prod_{d=1}^{D} P(x_d|y)$$
|
|
* **Advantage:** We only need to estimate the distribution of each feature individually, rather than their complex joint interactions.
|
|
|
|
---
|
|
|
|
### **4. Application: Spam Classifier**
|
|
The lecture applies the Naive Bayes framework to a discrete problem: classifying emails as **Spam ($y=1$)** or **Not Spam ($y=0$)**.
|
|
|
|
#### **Feature Engineering**
|
|
* **Input:** Emails with varying text lengths.
|
|
* **Transformation:** A "Bag of Words" approach is used.
|
|
1. Create a dictionary of $N$ words (e.g., $N=10,000$).
|
|
2. Represent each email as a fixed-length binary vector $x \in \{0, 1\}^{10,000}$.
|
|
3. $x_i = 1$ if the $i$-th word appears in the email, $0$ otherwise.
|
|
|
|
#### **The "Curse of Dimensionality" (Without Naive Bayes)**
|
|
* Since the features are discrete (binary), we cannot use Gaussian distributions. We must use probability tables.
|
|
* If we tried to model the full joint distribution $P(x_1, ..., x_{10000} | y)$, we would need a probability table for every possible combination of words.
|
|
* **Parameter Count:** $2^{10,000}$ entries. This is computationally impossible.
|
|
|
|
#### **Applying Naive Bayes**
|
|
* By assuming word independence given the class, we decompose the problem:
|
|
$$P(x|y) \approx \prod_{i=1}^{10,000} P(x_i|y)$$
|
|
* **Parameter Estimation:**
|
|
* We only need to estimate $P(x_i=1 | y=1)$ and $P(x_i=1 | y=0)$ for each word.
|
|
* This requires simply counting the frequency of each word in Spam vs. Non-Spam emails.
|
|
* **Reduced Parameter Count:**
|
|
* Instead of $2^{10,000}$, we need roughly $2 \times 10,000$ parameters (one probability per word per class).
|
|
* This transforms an impossible problem into a highly efficient and simple one.
|
|
|
|
### **5. Summary**
|
|
* **Generative Methods** aim to model the underlying distribution $P(x, y)$.
|
|
* **Graphical Models** allow us to inject prior knowledge (independence assumptions) to make this feasible.
|
|
* **Naive Bayes** assumes full conditional independence, reducing parameter estimation from exponential to linear complexity, making it ideal for high-dimensional discrete data like text classification.
|