# Study Guide: Discrete Probability Models & Undirected Graphical Models

**Date:** 2025.11.24
**Topic:** Multinomial Distribution, Maximum Likelihood Estimation (MLE), and Markov Random Fields (Undirected Graphical Models).

---

### **1. Discrete Probability Distributions**
The lecture shifts focus from continuous models (like Gaussian) to discrete models, which are essential for tasks like text classification (e.g., Naive Bayes).

#### **Binomial Distribution**
* **Scenario:** A coin toss (Binary outcome: Head/Tail).
* **Random Variables:** $m_1$ (count of Heads), $m_2$ (count of Tails).
* **Parameters:** Probability of Head ($\mu$) and Tail ($1-\mu$).
* **Formula:** For a sequence of tosses, we consider the number of ways to arrange the outcomes.
    $$P(m_1, m_2) = \frac{N!}{m_1!m_2!} \mu^{m_1} (1-\mu)^{m_2}$$

#### **Multinomial Distribution**
* **Scenario:** Rolling a die with $K$ faces (e.g., $K=6$). This generalizes the binomial distribution.
* **Definition:**
    * We have $N$ total events (trials).
    * We observe counts $m_1, m_2, ..., m_k$ for each of the $K$ possible outcomes.
    * Parameters $\mu_1, ..., \mu_k$ represent the probability of each outcome.
* **Probability Mass Function:**
    $$P(m_1, ..., m_k | \mu) = \frac{N!}{m_1! ... m_k!} \prod_{k=1}^{K} \mu_k^{m_k}$$

---

### **2. Learning: Maximum Likelihood Estimation (MLE)**
How do we estimate the parameters ($\mu_k$) from data?

* **Goal:** Maximize the likelihood of the observed data subject to the constraint that probabilities sum to 1 ($\sum \mu_k = 1$).
* **Method:** **Lagrange Multipliers**.
    1.  **Objective:** Maximize Log-Likelihood:
        $$L = \ln(N!) - \sum \ln(m_k!) + \sum m_k \ln(\mu_k)$$
    2.  **Constraint:** $\sum_{k=1}^{K} \mu_k - 1 = 0$.
    3.  **Lagrangian:**
        $$L' = \sum_{k=1}^{K} m_k \ln(\mu_k) + \lambda (\sum_{k=1}^{K} \mu_k - 1)$$
        (Note: Constant terms like $N!$ vanish during differentiation).
    4.  **Derivation:** Taking the derivative w.r.t $\mu_k$ and setting to 0 yields $\mu_k = - \frac{m_k}{\lambda}$. Solving for $\lambda$ using the constraint gives $\lambda = -N$.

* **Result:**
    $$\mu_k = \frac{m_k}{N}$$
    * The optimal parameter is simply the **empirical fraction** (count of specific events divided by total events).
    * This provides the theoretical justification for the simple "counting" method used in the Naive Bayes classifier discussed in previous lectures.

---

### **3. Undirected Graphical Models (Markov Random Fields)**

When causal relationships (direction) are unclear or interactions are symmetric (e.g., neighboring pixels in an image, social network friends), we use **Undirected Graphs** instead of Bayesian Networks (Directed Acyclic Graphs).

#### **Comparison**
* **Directed (Bayesian Network):** Uses conditional probabilities (e.g., $P(A|B)$). Represents causality or asymmetric relationships.
* **Undirected (Markov Random Field - MRF):** Uses "Potential Functions" ($\psi$). Represents correlation or symmetric constraints.

#### **Conditional Independence in MRF**
Determining independence is simpler in undirected graphs than in directed graphs (no D-separation rules needed).
* **Global Markov Property:** Two sets of nodes are conditionally independent given a separating set if all paths between them pass through the separating set.
    * *Example:* If nodes $X_1$ and $X_5$ are not directly connected, they are conditionally independent given the intermediate nodes (e.g., $X_3$) that block the path.

---

### **4. Factorization in Undirected Graphs**

Since we cannot use chain rules of conditional probabilities (because $P(A|B) \neq P(B|A)$ generally), we model the joint distribution using **Cliques**.

#### **Cliques and Maximal Cliques**
* **Clique:** A subgraph where every pair of nodes is connected (fully connected).
* **Maximal Clique:** A clique that cannot be expanded by including any other adjacent node.

#### **The Joint Distribution Formula**
We associate a **Potential Function** ($\psi_C$) with each maximal clique $C$.
$$P(x) = \frac{1}{Z} \prod_{C} \psi_C(x_C)$$

* **Potential Function ($\psi$):** A non-negative function that scores the compatibility of variables in a clique. It is *not* a probability (doesn't sum to 1).
* **Partition Function ($Z$):** The normalization constant required to make the total probability sum to 1.
    $$Z = \sum_x \prod_{C} \psi_C(x_C)$$

#### **Example Decomposition**
Given a graph with maximal cliques $\{x_1, x_2\}$, $\{x_1, x_3\}$, and $\{x_3, x_4, x_5\}$:
$$P(x) = \frac{1}{Z} \psi_{12}(x_1, x_2) \psi_{13}(x_1, x_3) \psi_{345}(x_3, x_4, x_5)$$

#### **Hammersley-Clifford Theorem**
This theorem provides the theoretical guarantee that a strictly positive distribution can satisfy the conditional independence properties of an undirected graph if and only if it can be factorized over the graph's cliques.