# Study Guide: Bayes Optimal Classifier & Graphical Models (Bayesian Networks) **Date:** 2025.11.17 **Topic:** Bayes Error, Graphical Models (Directed), Conditional Independence, and D-separation. --- ### **1. Recap: Bayes Optimal Classifier and Bayes Error** The lecture begins by revisiting the concept of the **Bayes Optimal Classifier**. * **Decision Rule:** For a new data point $x_{new}$, the classifier compares the posterior probabilities $P(C_1 | x_{new})$ and $P(C_2 | x_{new})$. It assigns the label associated with the higher probability. * **Optimality:** No other classifier can outperform the Bayes Optimal Classifier. It achieves the theoretical minimum error rate. #### **Bayes Error (Irreducible Error)** * **Definition:** Even the perfect classifier will make mistakes because classes often overlap in the feature space. This inevitable error is called the **Bayes Error**. * **Cause:** It arises from inherent uncertainty, noise, or ambiguity in the data itself, not from the classifier's limitations. * **Goal of ML:** The objective of any machine learning algorithm is not to achieve 0% error (which is impossible) but to approach the Bayes Error limit as closely as possible. * **Formula:** The risk (expected error) is the integral of the minimum probability over the domain: $$R^* = \int \min[P_1(x), P_2(x)] dx$$ If priors are equal, this simplifies to the integral of the overlap region. --- ### **2. Introduction to Graphical Models** The focus shifts to **Generative Models**, specifically **Graphical Models** (also known as Bayesian Networks). * **Motivation:** * A full multivariate Gaussian model requires estimating a **Covariance Matrix** with $D \times D$ elements. * The number of parameters grows quadratically ($O(D^2)$), which corresponds to $\frac{D(D+1)}{2}$ parameters. * For high-dimensional data (like images with millions of pixels), estimating these parameters requires an enormous amount of data, which is often infeasible. * **Solution:** Use **Prior Knowledge** to simplify the model. If we know that certain variables are independent, we can set their covariance terms to zero, significantly reducing the number of parameters to learn. --- ### **3. The Chain Rule and Independence** Graphical models leverage the **Chain Rule of Probability** to decompose a complex joint distribution into simpler conditional probabilities. * **General Chain Rule:** $$P(x_1, ..., x_D) = P(x_1) P(x_2|x_1) P(x_3|x_1, x_2) ... P(x_D|x_1...x_{D-1})$$ * **Simplification with Independence:** If variable $x_3$ depends only on $x_1$ and is independent of $x_2$, then $P(x_3|x_1, x_2)$ simplifies to $P(x_3|x_1)$. * **Structure:** This creates a **Directed Acyclic Graph (DAG)** (or Bayes Network) where: * **Nodes** represent random variables. * **Edges (Arrows)** represent conditional dependencies (causality). --- ### **4. Building a Bayesian Network (Causal Graph)** The lecture illustrates this with a practical example involving a crying baby. * **Scenario:** We want to model the causes of a baby crying. * **Variables:** * **Cry:** The observable effect. * **Hungry, Sick, Diaper:** Direct causes of crying. * **Pororo:** A distractor (e.g., watching a cartoon) that might stop the crying. * **Dependencies:** * "Hungry" and "Sick" might be independent of each other generally. * "Cry" depends on all of them. * "Pororo" depends on "Cry" (parent turns on TV *because* baby is crying) or affects "Cry". --- ### **5. The Three Canonical Patterns of Independence** To understand complex graphs, we decompose them into three fundamental 3-node patterns. Understanding these patterns allows us to determine if variables are independent given some evidence. #### **1. Tail-to-Tail (Common Cause)** * **Structure:** $X \leftarrow Z \rightarrow Y$ (Z causes both X and Y). * **Property:** $X$ and $Y$ are dependent. However, if $Z$ is observed (given), $X$ and $Y$ become **independent**. * **Example:** If $Z$ (Cause) determines both $X$ and $Y$, knowing $Z$ explains the correlation, decoupling $X$ and $Y$. #### **2. Head-to-Tail (Causal Chain)** * **Structure:** $X \rightarrow Z \rightarrow Y$ (X causes Z, which causes Y). * **Property:** $X$ and $Y$ are dependent. If $Z$ is observed, the path is blocked, and $X$ and $Y$ become **independent**. * **Example:** $X$ influences $Y$ only through $Z$. If $Z$ is fixed, $X$ cannot influence $Y$ further. #### **3. Head-to-Head (Common Effect / V-Structure)** * **Structure:** $X \rightarrow Z \leftarrow Y$ (X and Y both cause Z). * **Property:** **Crucial Difference.** $X$ and $Y$ are naturally **independent** (marginal independence). However, if $Z$ is observed (or a descendant is observed), they become **dependent** ("explaining away"). * **Example:** $X$ (Hungry) $\rightarrow$ $Z$ (Cry) $\leftarrow$ $Y$ (Sick). * Being hungry tells us nothing about being sick (Independent). * But if we *know* the baby is crying ($Z$ observed): finding out the baby is Hungry ($X$) makes it less likely they are Sick ($Y$). The causes compete to explain the effect. --- ### **6. D-Separation** These rules form the basis of **D-separation** (Directed Separation), a formal method to determine conditional independence in any directed graph. * If all paths between two variables are "blocked" by the evidence set, the variables are D-separated (independent). * A path is blocked if: * It contains a chain or fork where the middle node is **observed**. * It contains a collider where the middle node (and all its descendants) are **NOT observed**.