4.8 KiB
Study Guide: Discrete Probability Models & Undirected Graphical Models
Date: 2025.11.24 Topic: Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models).
1. Discrete Probability Distributions
The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes).
Binomial Distribution
- Scenario: A coin toss (Binary outcome: Head/Tail).
- Random Variables:
m_1(count of Heads),m_2(count of Tails). - Parameters: Probability of Head (
\mu) and Tail (1-\mu). - Formula: For a sequence of tosses, we consider the number of ways to arrange the outcomes.
P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}
Multinomial Distribution
- Scenario: Rolling a die with
Kfaces (e.g.,K=6). This generalizes the binomial distribution. - Definition:
- We have
Ntotal events (trials). - We observe counts
m_1, m_2, ..., m_kfor each of theKpossible outcomes. - Parameters
\mu_1, ..., \mu_krepresent the probability of each outcome.
- We have
- Probability Mass Function:
P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}
2. Learning: Maximum Likelihood Estimation (MLE)
How do we estimate the parameters (\mu_k) from data?
-
Goal: Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 (
\sum \mu_k = 1). -
Method: Lagrange Multipliers.
- Objective: Maximize Log-Likelihood:
L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k) - Constraint:
\sum_{k=1}^{K} \mu_k - 1 = 0. - Lagrangian:
L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1)(Note: Constant terms likeN!vanish during differentiation). - Derivation: Taking the derivative w.r.t
\mu_kand setting to 0 yields\mu_k = - \frac{m_k}{\lambda}. Solving for\lambdausing the constraint gives\lambda = -N.
- Objective: Maximize Log-Likelihood:
-
Result:
\mu_k = \frac{m_k}{N}- The optimal parameter is simply the empirical fraction (count of specific events divided by total events).
- This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures.
3. Undirected Graphical Models (Markov Random Fields)
When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use Undirected Graphs instead of Bayesian Networks (Directed Acyclic Graphs).
Comparison
- Directed (Bayesian Network): Uses conditional probabilities (e.g.,
P(A|B)). Represents causality or asymmetric relationships. - Undirected (Markov Random Field - MRF): Uses "Potential Functions" (
\psi). Represents correlation or symmetric constraints.
Conditional Independence in MRF
Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed).
- Global Markov Property: Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set.
- Example: If nodes
X_1andX_5are not directly connected, they are conditionally independent given the intermediate nodes (e.g.,X_3) that block the path.
- Example: If nodes
4. Factorization in Undirected Graphs
Since we cannot use chain rules of conditional probabilities (because P(A|B) \neq P(B|A) generally), we model the joint distribution using Cliques.
Cliques and Maximal Cliques
- Clique: A subgraph where every pair of nodes is connected (fully connected).
- Maximal Clique: A clique that cannot be expanded by including any other adjacent node.
The Joint Distribution Formula
We associate a Potential Function (\psi_C) with each maximal clique C.
P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)
- Potential Function (
\psi): A non-negative function that scores the compatibility of variables in a clique. It is not a probability (doesn't sum to 1). - Partition Function (
Z): The normalization constant required to make the total probability sum to 1.Z = \sum_x \prod_{C} \psi_C(x_C)
Example Decomposition
Given a graph with maximal cliques \{x_1, x_2\}, \{x_1, x_3\}, and \{x_3, x_4, x_5\}:
P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)
Hammersley-Clifford Theorem
This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques.