# Study Guide: Bayes Optimal Classifier & Graphical Models (Bayesian Networks)

**Date:** 2025.11.17
**Topic:** Bayes Error, Graphical Models (Directed), Conditional Independence, and D-separation.

---

### **1. Recap: Bayes Optimal Classifier and Bayes Error**

The lecture begins by revisiting the concept of the **Bayes Optimal Classifier**.
* **Decision Rule:** For a new data point $x_{new}$, the classifier compares the posterior probabilities $P(C_1 | x_{new})$ and $P(C_2 | x_{new})$. It assigns the label associated with the higher probability.
* **Optimality:** No other classifier can outperform the Bayes Optimal Classifier. It achieves the theoretical minimum error rate.

#### **Bayes Error (Irreducible Error)**
* **Definition:** Even the perfect classifier will make mistakes because classes often overlap in the feature space. This inevitable error is called the **Bayes Error**.
* **Cause:** It arises from inherent uncertainty, noise, or ambiguity in the data itself, not from the classifier's limitations.
* **Goal of ML:** The objective of any machine learning algorithm is not to achieve 0% error (which is impossible) but to approach the Bayes Error limit as closely as possible.
* **Formula:** The risk (expected error) is the integral of the minimum probability over the domain:
    $$R^* = \int \min[P_1(x), P_2(x)] dx$$
    If priors are equal, this simplifies to the integral of the overlap region.

---

### **2. Introduction to Graphical Models**

The focus shifts to **Generative Models**, specifically **Graphical Models** (also known as Bayesian Networks).

* **Motivation:**
    * A full multivariate Gaussian model requires estimating a **Covariance Matrix** with $D \times D$ elements.
    * The number of parameters grows quadratically ($O(D^2)$), which corresponds to $\frac{D(D+1)}{2}$ parameters.
    * For high-dimensional data (like images with millions of pixels), estimating these parameters requires an enormous amount of data, which is often infeasible.
* **Solution:** Use **Prior Knowledge** to simplify the model. If we know that certain variables are independent, we can set their covariance terms to zero, significantly reducing the number of parameters to learn.

---

### **3. The Chain Rule and Independence**

Graphical models leverage the **Chain Rule of Probability** to decompose a complex joint distribution into simpler conditional probabilities.

* **General Chain Rule:**
    $$P(x_1, ..., x_D) = P(x_1) P(x_2|x_1) P(x_3|x_1, x_2) ... P(x_D|x_1...x_{D-1})$$
* **Simplification with Independence:**
    If variable $x_3$ depends only on $x_1$ and is independent of $x_2$, then $P(x_3|x_1, x_2)$ simplifies to $P(x_3|x_1)$.
* **Structure:** This creates a **Directed Acyclic Graph (DAG)** (or Bayes Network) where:
    * **Nodes** represent random variables.
    * **Edges (Arrows)** represent conditional dependencies (causality).


---

### **4. Building a Bayesian Network (Causal Graph)**

The lecture illustrates this with a practical example involving a crying baby.

* **Scenario:** We want to model the causes of a baby crying.
* **Variables:**
    * **Cry:** The observable effect.
    * **Hungry, Sick, Diaper:** Direct causes of crying.
    * **Pororo:** A distractor (e.g., watching a cartoon) that might stop the crying.
* **Dependencies:**
    * "Hungry" and "Sick" might be independent of each other generally.
    * "Cry" depends on all of them.
    * "Pororo" depends on "Cry" (parent turns on TV *because* baby is crying) or affects "Cry".


---

### **5. The Three Canonical Patterns of Independence**

To understand complex graphs, we decompose them into three fundamental 3-node patterns. Understanding these patterns allows us to determine if variables are independent given some evidence.

#### **1. Tail-to-Tail (Common Cause)**
* **Structure:** $X \leftarrow Z \rightarrow Y$ (Z causes both X and Y).
* **Property:** $X$ and $Y$ are dependent. However, if $Z$ is observed (given), $X$ and $Y$ become **independent**.
* **Example:** If $Z$ (Cause) determines both $X$ and $Y$, knowing $Z$ explains the correlation, decoupling $X$ and $Y$.

#### **2. Head-to-Tail (Causal Chain)**
* **Structure:** $X \rightarrow Z \rightarrow Y$ (X causes Z, which causes Y).
* **Property:** $X$ and $Y$ are dependent. If $Z$ is observed, the path is blocked, and $X$ and $Y$ become **independent**.
* **Example:** $X$ influences $Y$ only through $Z$. If $Z$ is fixed, $X$ cannot influence $Y$ further.

#### **3. Head-to-Head (Common Effect / V-Structure)**
* **Structure:** $X \rightarrow Z \leftarrow Y$ (X and Y both cause Z).
* **Property:** **Crucial Difference.** $X$ and $Y$ are naturally **independent** (marginal independence). However, if $Z$ is observed (or a descendant is observed), they become **dependent** ("explaining away").
* **Example:** $X$ (Hungry) $\rightarrow$ $Z$ (Cry) $\leftarrow$ $Y$ (Sick).
    * Being hungry tells us nothing about being sick (Independent).
    * But if we *know* the baby is crying ($Z$ observed): finding out the baby is Hungry ($X$) makes it less likely they are Sick ($Y$). The causes compete to explain the effect.

---

### **6. D-Separation**

These rules form the basis of **D-separation** (Directed Separation), a formal method to determine conditional independence in any directed graph.
* If all paths between two variables are "blocked" by the evidence set, the variables are D-separated (independent).
* A path is blocked if:
    * It contains a chain or fork where the middle node is **observed**.
    * It contains a collider where the middle node (and all its descendants) are **NOT observed**.