Study Guide: Bayes Optimal Classifier & Graphical Models (Bayesian Networks)

Date: 2025.11.17 Topic: Bayes Error, Graphical Models (Directed), Conditional Independence, and D-separation.

1. Recap: Bayes Optimal Classifier and Bayes Error

The lecture begins by revisiting the concept of the Bayes Optimal Classifier.

Decision Rule: For a new data point x_{new}, the classifier compares the posterior probabilities P(C_1 | x_{new}) and P(C_2 | x_{new}). It assigns the label associated with the higher probability.
Optimality: No other classifier can outperform the Bayes Optimal Classifier. It achieves the theoretical minimum error rate.

Bayes Error (Irreducible Error)

Definition: Even the perfect classifier will make mistakes because classes often overlap in the feature space. This inevitable error is called the Bayes Error.
Cause: It arises from inherent uncertainty, noise, or ambiguity in the data itself, not from the classifier's limitations.
Goal of ML: The objective of any machine learning algorithm is not to achieve 0% error (which is impossible) but to approach the Bayes Error limit as closely as possible.
Formula: The risk (expected error) is the integral of the minimum probability over the domain: R^* = \int \min[P_1(x), P_2(x)] dx If priors are equal, this simplifies to the integral of the overlap region.

2. Introduction to Graphical Models

The focus shifts to Generative Models, specifically Graphical Models (also known as Bayesian Networks).

Motivation:
- A full multivariate Gaussian model requires estimating a Covariance Matrix with D \times D elements.
- The number of parameters grows quadratically (O(D^2)), which corresponds to \frac{D(D+1)}{2} parameters.
- For high-dimensional data (like images with millions of pixels), estimating these parameters requires an enormous amount of data, which is often infeasible.
Solution: Use Prior Knowledge to simplify the model. If we know that certain variables are independent, we can set their covariance terms to zero, significantly reducing the number of parameters to learn.

3. The Chain Rule and Independence

Graphical models leverage the Chain Rule of Probability to decompose a complex joint distribution into simpler conditional probabilities.

General Chain Rule: P(x_1, ..., x_D) = P(x_1) P(x_2|x_1) P(x_3|x_1, x_2) ... P(x_D|x_1...x_{D-1})
Simplification with Independence: If variable x_3 depends only on x_1 and is independent of x_2, then P(x_3|x_1, x_2) simplifies to P(x_3|x_1).
Structure: This creates a Directed Acyclic Graph (DAG) (or Bayes Network) where:
- Nodes represent random variables.
- Edges (Arrows) represent conditional dependencies (causality).

4. Building a Bayesian Network (Causal Graph)

The lecture illustrates this with a practical example involving a crying baby.

Scenario: We want to model the causes of a baby crying.
Variables:
- Cry: The observable effect.
- Hungry, Sick, Diaper: Direct causes of crying.
- Pororo: A distractor (e.g., watching a cartoon) that might stop the crying.
Dependencies:
- "Hungry" and "Sick" might be independent of each other generally.
- "Cry" depends on all of them.
- "Pororo" depends on "Cry" (parent turns on TV because baby is crying) or affects "Cry".

5. The Three Canonical Patterns of Independence

To understand complex graphs, we decompose them into three fundamental 3-node patterns. Understanding these patterns allows us to determine if variables are independent given some evidence.

1. Tail-to-Tail (Common Cause)

Structure: X \leftarrow Z \rightarrow Y (Z causes both X and Y).
Property: X and Y are dependent. However, if Z is observed (given), X and Y become independent.
Example: If Z (Cause) determines both X and Y, knowing Z explains the correlation, decoupling X and Y.

2. Head-to-Tail (Causal Chain)

Structure: X \rightarrow Z \rightarrow Y (X causes Z, which causes Y).
Property: X and Y are dependent. If Z is observed, the path is blocked, and X and Y become independent.
Example: X influences Y only through Z. If Z is fixed, X cannot influence Y further.

3. Head-to-Head (Common Effect / V-Structure)

Structure: X \rightarrow Z \leftarrow Y (X and Y both cause Z).
Property: Crucial Difference. X and Y are naturally independent (marginal independence). However, if Z is observed (or a descendant is observed), they become dependent ("explaining away").
Example: X (Hungry) \rightarrow Z (Cry) \leftarrow Y (Sick).
- Being hungry tells us nothing about being sick (Independent).
- But if we know the baby is crying (Z observed): finding out the baby is Hungry (X) makes it less likely they are Sick (Y). The causes compete to explain the effect.

6. D-Separation

These rules form the basis of D-separation (Directed Separation), a formal method to determine conditional independence in any directed graph.

If all paths between two variables are "blocked" by the evidence set, the variables are D-separated (independent).
A path is blocked if:
- It contains a chain or fork where the middle node is observed.
- It contains a collider where the middle node (and all its descendants) are NOT observed.

5.5 KiB Raw Permalink Blame History