# Study Guide: Learning in Generative Methods & Bayes Optimal Classifier

**Date:** 2025.11.13
**Topic:** Maximum Likelihood Estimation (MLE), Missing Data Handling, and Bayes Optimal Classifier.

---

### **1. Overview: Learning in Generative Methods**
The fundamental goal of generative methods is to **estimate the underlying distribution** of the data. Unlike discriminative methods (e.g., Logistic Regression, SVM) which focus on finding a separating boundary, generative methods introduce a probabilistic model and learn its parameters.

* **Discriminative Model:** Learns specific parameters (like $w, b$ in linear models) to separate classes.
* **Generative Model:** Learns parameters (like $\mu, \Sigma$ in Gaussian models) that best describe how the data is distributed.

#### **Why Gaussian?**
The Gaussian distribution is the standard model for generative methods because of its mathematical convenience: **both its conditional and marginal distributions are also Gaussian**. This property simplifies probabilistic inference significantly.


[Image of multivariate gaussian distribution 3d plot]


---

### **2. The Learning Process: Parameter Estimation**
"Learning" in this context means finding the best parameters ($\mu, \Sigma$) for the Gaussian model given the training data.

#### **Step 1: Define the Objective Function**
We need a metric to evaluate how well our model fits the data. The core idea is **Likelihood**:
* **Goal:** We want to assign **high probability** to the observed (empirical) data points.
* **Likelihood Function:** For independent data points, the likelihood is the product of their individual probabilities.
    $$P(Z | \mu, \Sigma) = \prod_{i=1}^{N} P(z_i | \mu, \Sigma)$$

#### **Step 2: Log-Likelihood (MLE)**
Directly maximizing the product is difficult. We apply the **logarithm** to convert the product into a sum, creating the **Log-Likelihood** function. This does not change the location of the maximum.
* **Objective:** Maximize $\sum_{i=1}^{N} \ln P(z_i | \mu, \Sigma)$.

#### **Step 3: Optimization (Derivation)**
We calculate the partial derivatives of the log-likelihood function with respect to the parameters and set them to zero to find the maximum.

* **Optimal Mean ($\hat{\mu}$):**
    The derivation yields the **Empirical Mean**. It is simply the average of the data points.
    $$\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} z_i$$

* **Optimal Covariance ($\hat{\Sigma}$):**
    The derivation yields the **Empirical Covariance**.
    $$\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{\mu})(z_i - \hat{\mu})^T$$

**Conclusion:** The "learning" for a Gaussian generative model is simply calculating the sample mean and sample covariance of the training data. This is a closed-form solution, meaning no iterative updates are strictly necessary.

---

### **3. Inference: Making Predictions**
Once the joint distribution $P(z)$ (where $z$ contains both input features $x$ and class labels $y$) is learned, we can perform inference.

#### **Classification**
To classify a new data point $x_{new}$:
1.  We aim to calculate the conditional probability $P(y | x_{new})$.
2.  Using the properties of the multivariate Gaussian, we treat the label $y$ as just another dimension in the random vector.
3.  We calculate probabilities for each class and compare them (e.g., $P(y=1 | x)$ vs $P(y=0 | x)$).

#### **Handling Missing Data**
Generative models offer a theoretically robust way to handle missing variables.
* **Scenario:** We have inputs $x = [x_1, x_2]$, but $x_2$ is missing during inference.
* **Method:** **Marginalization**.
    1.  Start with the Joint PDF.
    2.  Integrate (marginalize) out the missing variable $x_2$.
        $$P(y | x_1) = \frac{\int P(x_1, x_2, y) dx_2}{P(x_1)}$$
    3.  Because the model is Gaussian, marginalization is trivial: simply select the sub-vector and sub-matrix corresponding to the observed variables.
* This is superior to heuristic methods like imputing the mean.

---

### **4. Bayes Optimal Classifier**
The lecture introduces the concept of the theoretical "perfect" classifier.

* **Definition:** The **Bayes Optimal Classifier** is the ideal classifier that would exist if we knew the *true* underlying distribution of the data.
* **Decision Rule:** It assigns the class with the highest posterior probability $P(C_k | x)$.
    $$P_1(x_{new}) \ge P_2(x_{new}) \rightarrow \text{Class 1}$$

#### **Bayes Error**
* Even the optimal classifier has an irreducible error called the **Bayes Error**.
* **Cause:** Classes often overlap in the feature space. In the overlapping regions, even the best decision rule will make mistakes with some probability.
* **Implication:** No machine learning algorithm can genuinely achieve 0% error (100% accuracy) on non-trivial problems. The goal of ML is to approximate the Bayes Error limit.
* **Mathematical Definition:** The error is the integral of the minimum probability density over the overlapping region:
    $$\text{Error} = \int \min[P(C_1|x), P(C_2|x)] dx$$