# Study Guide: Learning in Generative Methods & Bayes Optimal Classifier **Date:** 2025.11.13 **Topic:** Maximum Likelihood Estimation (MLE), Missing Data Handling, and Bayes Optimal Classifier. --- ### **1. Overview: Learning in Generative Methods** The fundamental goal of generative methods is to **estimate the underlying distribution** of the data. Unlike discriminative methods (e.g., Logistic Regression, SVM) which focus on finding a separating boundary, generative methods introduce a probabilistic model and learn its parameters. * **Discriminative Model:** Learns specific parameters (like $w, b$ in linear models) to separate classes. * **Generative Model:** Learns parameters (like $\mu, \Sigma$ in Gaussian models) that best describe how the data is distributed. #### **Why Gaussian?** The Gaussian distribution is the standard model for generative methods because of its mathematical convenience: **both its conditional and marginal distributions are also Gaussian**. This property simplifies probabilistic inference significantly. [Image of multivariate gaussian distribution 3d plot] --- ### **2. The Learning Process: Parameter Estimation** "Learning" in this context means finding the best parameters ($\mu, \Sigma$) for the Gaussian model given the training data. #### **Step 1: Define the Objective Function** We need a metric to evaluate how well our model fits the data. The core idea is **Likelihood**: * **Goal:** We want to assign **high probability** to the observed (empirical) data points. * **Likelihood Function:** For independent data points, the likelihood is the product of their individual probabilities. $$P(Z | \mu, \Sigma) = \prod_{i=1}^{N} P(z_i | \mu, \Sigma)$$ #### **Step 2: Log-Likelihood (MLE)** Directly maximizing the product is difficult. We apply the **logarithm** to convert the product into a sum, creating the **Log-Likelihood** function. This does not change the location of the maximum. * **Objective:** Maximize $\sum_{i=1}^{N} \ln P(z_i | \mu, \Sigma)$. #### **Step 3: Optimization (Derivation)** We calculate the partial derivatives of the log-likelihood function with respect to the parameters and set them to zero to find the maximum. * **Optimal Mean ($\hat{\mu}$):** The derivation yields the **Empirical Mean**. It is simply the average of the data points. $$\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} z_i$$ * **Optimal Covariance ($\hat{\Sigma}$):** The derivation yields the **Empirical Covariance**. $$\hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} (z_i - \hat{\mu})(z_i - \hat{\mu})^T$$ **Conclusion:** The "learning" for a Gaussian generative model is simply calculating the sample mean and sample covariance of the training data. This is a closed-form solution, meaning no iterative updates are strictly necessary. --- ### **3. Inference: Making Predictions** Once the joint distribution $P(z)$ (where $z$ contains both input features $x$ and class labels $y$) is learned, we can perform inference. #### **Classification** To classify a new data point $x_{new}$: 1. We aim to calculate the conditional probability $P(y | x_{new})$. 2. Using the properties of the multivariate Gaussian, we treat the label $y$ as just another dimension in the random vector. 3. We calculate probabilities for each class and compare them (e.g., $P(y=1 | x)$ vs $P(y=0 | x)$). #### **Handling Missing Data** Generative models offer a theoretically robust way to handle missing variables. * **Scenario:** We have inputs $x = [x_1, x_2]$, but $x_2$ is missing during inference. * **Method:** **Marginalization**. 1. Start with the Joint PDF. 2. Integrate (marginalize) out the missing variable $x_2$. $$P(y | x_1) = \frac{\int P(x_1, x_2, y) dx_2}{P(x_1)}$$ 3. Because the model is Gaussian, marginalization is trivial: simply select the sub-vector and sub-matrix corresponding to the observed variables. * This is superior to heuristic methods like imputing the mean. --- ### **4. Bayes Optimal Classifier** The lecture introduces the concept of the theoretical "perfect" classifier. * **Definition:** The **Bayes Optimal Classifier** is the ideal classifier that would exist if we knew the *true* underlying distribution of the data. * **Decision Rule:** It assigns the class with the highest posterior probability $P(C_k | x)$. $$P_1(x_{new}) \ge P_2(x_{new}) \rightarrow \text{Class 1}$$ #### **Bayes Error** * Even the optimal classifier has an irreducible error called the **Bayes Error**. * **Cause:** Classes often overlap in the feature space. In the overlapping regions, even the best decision rule will make mistakes with some probability. * **Implication:** No machine learning algorithm can genuinely achieve 0% error (100% accuracy) on non-trivial problems. The goal of ML is to approximate the Bayes Error limit. * **Mathematical Definition:** The error is the integral of the minimum probability density over the overlapping region: $$\text{Error} = \int \min[P(C_1|x), P(C_2|x)] dx$$