5.5 KiB
Study Guide: Bayes Optimal Classifier & Graphical Models (Bayesian Networks)
Date: 2025.11.17 Topic: Bayes Error, Graphical Models (Directed), Conditional Independence, and D-separation.
1. Recap: Bayes Optimal Classifier and Bayes Error
The lecture begins by revisiting the concept of the Bayes Optimal Classifier.
- Decision Rule: For a new data point
x_{new}, the classifier compares the posterior probabilitiesP(C_1 | x_{new})andP(C_2 | x_{new}). It assigns the label associated with the higher probability. - Optimality: No other classifier can outperform the Bayes Optimal Classifier. It achieves the theoretical minimum error rate.
Bayes Error (Irreducible Error)
- Definition: Even the perfect classifier will make mistakes because classes often overlap in the feature space. This inevitable error is called the Bayes Error.
- Cause: It arises from inherent uncertainty, noise, or ambiguity in the data itself, not from the classifier's limitations.
- Goal of ML: The objective of any machine learning algorithm is not to achieve 0% error (which is impossible) but to approach the Bayes Error limit as closely as possible.
- Formula: The risk (expected error) is the integral of the minimum probability over the domain:
R^* = \int \min[P_1(x), P_2(x)] dxIf priors are equal, this simplifies to the integral of the overlap region.
2. Introduction to Graphical Models
The focus shifts to Generative Models, specifically Graphical Models (also known as Bayesian Networks).
- Motivation:
- A full multivariate Gaussian model requires estimating a Covariance Matrix with
D \times Delements. - The number of parameters grows quadratically (
O(D^2)), which corresponds to\frac{D(D+1)}{2}parameters. - For high-dimensional data (like images with millions of pixels), estimating these parameters requires an enormous amount of data, which is often infeasible.
- A full multivariate Gaussian model requires estimating a Covariance Matrix with
- Solution: Use Prior Knowledge to simplify the model. If we know that certain variables are independent, we can set their covariance terms to zero, significantly reducing the number of parameters to learn.
3. The Chain Rule and Independence
Graphical models leverage the Chain Rule of Probability to decompose a complex joint distribution into simpler conditional probabilities.
- General Chain Rule:
P(x_1, ..., x_D) = P(x_1) P(x_2|x_1) P(x_3|x_1, x_2) ... P(x_D|x_1...x_{D-1}) - Simplification with Independence:
If variable
x_3depends only onx_1and is independent ofx_2, thenP(x_3|x_1, x_2)simplifies toP(x_3|x_1). - Structure: This creates a Directed Acyclic Graph (DAG) (or Bayes Network) where:
- Nodes represent random variables.
- Edges (Arrows) represent conditional dependencies (causality).
4. Building a Bayesian Network (Causal Graph)
The lecture illustrates this with a practical example involving a crying baby.
- Scenario: We want to model the causes of a baby crying.
- Variables:
- Cry: The observable effect.
- Hungry, Sick, Diaper: Direct causes of crying.
- Pororo: A distractor (e.g., watching a cartoon) that might stop the crying.
- Dependencies:
- "Hungry" and "Sick" might be independent of each other generally.
- "Cry" depends on all of them.
- "Pororo" depends on "Cry" (parent turns on TV because baby is crying) or affects "Cry".
5. The Three Canonical Patterns of Independence
To understand complex graphs, we decompose them into three fundamental 3-node patterns. Understanding these patterns allows us to determine if variables are independent given some evidence.
1. Tail-to-Tail (Common Cause)
- Structure:
X \leftarrow Z \rightarrow Y(Z causes both X and Y). - Property:
XandYare dependent. However, ifZis observed (given),XandYbecome independent. - Example: If
Z(Cause) determines bothXandY, knowingZexplains the correlation, decouplingXandY.
2. Head-to-Tail (Causal Chain)
- Structure:
X \rightarrow Z \rightarrow Y(X causes Z, which causes Y). - Property:
XandYare dependent. IfZis observed, the path is blocked, andXandYbecome independent. - Example:
XinfluencesYonly throughZ. IfZis fixed,Xcannot influenceYfurther.
3. Head-to-Head (Common Effect / V-Structure)
- Structure:
X \rightarrow Z \leftarrow Y(X and Y both cause Z). - Property: Crucial Difference.
XandYare naturally independent (marginal independence). However, ifZis observed (or a descendant is observed), they become dependent ("explaining away"). - Example:
X(Hungry)\rightarrowZ(Cry)\leftarrowY(Sick).- Being hungry tells us nothing about being sick (Independent).
- But if we know the baby is crying (
Zobserved): finding out the baby is Hungry (X) makes it less likely they are Sick (Y). The causes compete to explain the effect.
6. D-Separation
These rules form the basis of D-separation (Directed Separation), a formal method to determine conditional independence in any directed graph.
- If all paths between two variables are "blocked" by the evidence set, the variables are D-separated (independent).
- A path is blocked if:
- It contains a chain or fork where the middle node is observed.
- It contains a collider where the middle node (and all its descendants) are NOT observed.