# Study Guide: Discrete Probability Models & Undirected Graphical Models **Date:** 2025.11.24 **Topic:** Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models). --- ### **1. Discrete Probability Distributions** The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes). #### **Binomial Distribution** * **Scenario:** A coin toss (Binary outcome: Head/Tail). * **Random Variables:** $m_1$ (count of Heads), $m_2$ (count of Tails). * **Parameters:** Probability of Head ($\mu$) and Tail ($1-\mu$). * **Formula:** For a sequence of tosses, we consider the number of ways to arrange the outcomes. $$P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}$$ #### **Multinomial Distribution** * **Scenario:** Rolling a die with $K$ faces (e.g., $K=6$). This generalizes the binomial distribution. * **Definition:** * We have $N$ total events (trials). * We observe counts $m_1, m_2, ..., m_k$ for each of the $K$ possible outcomes. * Parameters $\mu_1, ..., \mu_k$ represent the probability of each outcome. * **Probability Mass Function:** $$P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}$$ --- ### **2. Learning: Maximum Likelihood Estimation (MLE)** How do we estimate the parameters ($\mu_k$) from data? * **Goal:** Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 ($\sum \mu_k = 1$). * **Method:** **Lagrange Multipliers**. 1. **Objective:** Maximize Log-Likelihood: $$L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k)$$ 2. **Constraint:** $\sum_{k=1}^{K} \mu_k - 1 = 0$. 3. **Lagrangian:** $$L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1)$$ (Note: Constant terms like $N!$ vanish during differentiation). 4. **Derivation:** Taking the derivative w.r.t $\mu_k$ and setting to 0 yields $\mu_k = - \frac{m_k}{\lambda}$. Solving for $\lambda$ using the constraint gives $\lambda = -N$. * **Result:** $$\mu_k = \frac{m_k}{N}$$ * The optimal parameter is simply the **empirical fraction** (count of specific events divided by total events). * This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures. --- ### **3. Undirected Graphical Models (Markov Random Fields)** When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use **Undirected Graphs** instead of Bayesian Networks (Directed Acyclic Graphs). #### **Comparison** * **Directed (Bayesian Network):** Uses conditional probabilities (e.g., $P(A|B)$). Represents causality or asymmetric relationships. * **Undirected (Markov Random Field - MRF):** Uses "Potential Functions" ($\psi$). Represents correlation or symmetric constraints. #### **Conditional Independence in MRF** Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed). * **Global Markov Property:** Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set. * *Example:* If nodes $X_1$ and $X_5$ are not directly connected, they are conditionally independent given the intermediate nodes (e.g., $X_3$) that block the path. --- ### **4. Factorization in Undirected Graphs** Since we cannot use chain rules of conditional probabilities (because $P(A|B) \neq P(B|A)$ generally), we model the joint distribution using **Cliques**. #### **Cliques and Maximal Cliques** * **Clique:** A subgraph where every pair of nodes is connected (fully connected). * **Maximal Clique:** A clique that cannot be expanded by including any other adjacent node. #### **The Joint Distribution Formula** We associate a **Potential Function** ($\psi_C$) with each maximal clique $C$. $$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$ * **Potential Function ($\psi$):** A non-negative function that scores the compatibility of variables in a clique. It is *not* a probability (doesn't sum to 1). * **Partition Function ($Z$):** The normalization constant required to make the total probability sum to 1. $$Z = \sum_x \prod_{C} \psi_C(x_C)$$ #### **Example Decomposition** Given a graph with maximal cliques $\{x_1, x_2\}$, $\{x_1, x_3\}$, and $\{x_3, x_4, x_5\}$: $$P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)$$ #### **Hammersley-Clifford Theorem** This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques.