Skip to content

Commit 3d2fefb

Browse files
author
liuzi
committed
BN
1 parent 3d99ea7 commit 3d2fefb

File tree

5 files changed

+178
-9
lines changed

5 files changed

+178
-9
lines changed

_config.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -152,10 +152,10 @@ paginate: 10
152152
baseurl: ""
153153

154154
# ------------ The following options are not recommended to be modified ------------------
155-
# math: true
155+
math: true
156156
kramdown:
157157
footnote_backlink: "↩︎"
158-
# math_engine: mathjax
158+
math_engine: mathjax
159159
# input: GFM
160160
syntax_highlighter: rouge
161161
syntax_highlighter_opts: # Rouge Options › https://github.com/jneen/rouge#full-options

_posts/2025-02-15-generative-models.md

Lines changed: 176 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ layout: post
33
title: Generative Models
44
date: 2025-02-15 21:10 +0800
55
categories: [Fundamentals]
6-
tags: [generative models, discriminative models, bayes' theorem]
6+
tags: [generative models, discriminative models, graphical models]
77
pin: true
88
math: true
99
mermaid: true
@@ -21,7 +21,7 @@ In discriminative modeling, the focus is on learning the conditional probability
2121
$$p(y|{x})$$. Techniques like logistic regression exemplify this by modeling the probability of a label based on input features. Alternatively, methods such as the perceptron algorithm aim to find a decision boundary that maps new observations to specific labels $$\{0,1\}$$, such as $$0$$ for dogs and $$1$$ for elephants.
2222

2323
Conversely, generative modeling focuses on understanding how the data is generated by learning the joint probability distribution $$p(x,y)$$ or the likelihood
24-
$$p(x|{y})$$ along with the prior probability $$p(y)$$. This approach models the distribution of the input data for each class, enabling the generation of new data points and facilitating classification by applying Bayes' theorem to compute the posterior probability:
24+
$$p(x|y)$$ along with the prior probability $$p(y)$$. This approach models the distribution of the input data for each class, enabling the generation of new data points and facilitating classification by applying Bayes rule to compute the posterior probability:
2525

2626
$$
2727
p(y|x) = \frac{p(x|y)p(y)}{p(x)}
@@ -214,14 +214,14 @@ $$
214214
The following figure shows the training set and the contours of two Gaussian distributions. These two Guassian distributions share the same covariance matrix $$\Sigma$$, leading to the same shape and orientation of the contours. But they have different means $$\mu_0$$ and $$\mu_1$$, leading to different positions of the contours. The straight line shown in the figure is the decision boundary at which
215215
$$p(y=1|x) = 0.5$$. Thus on the left side of the line, the model predicts $$y=0$$ and on the right side, the model predicts $$y=1$$.
216216

217-
![Gaussian Discriminant Analysis](/assets/img/posts/gaussian_discriminant_analysis.png){: width="450" height="300" }
217+
**![Gaussian Discriminant Analysis](/assets/img/posts/gaussian_discriminant_analysis.png){: width="450" height="300" }
218218

219219
**Figure 1:** Gaussian Discriminant Analysis. Image source: Section 4.1.2 on page 40 from [Stanford CS229 Notes](https://cs229.stanford.edu/main_notes.pdf).
220-
{: .text-center .small}
220+
{: .text-center .small}**
221221

222222
### Comparison between Discriminative and Generative Models
223223

224-
Apply the Bayes' theorem to the generative model GDA (Gaussian Discriminant Analysis), we have:
224+
Apply the Bayes rule to the generative model GDA (Gaussian Discriminant Analysis), we have:
225225

226226
$$
227227
\begin{aligned}
@@ -259,29 +259,198 @@ The GDA model and the logistic regression model produce distinct decision bounda
259259

260260
- Because of these stronger assumptions, GDA excels when these assumptions accurately reflect the underlying data distribution. In contrast, logistic regression, with its more flexible and weaker assumptions, demonstrates greater robustness across diverse data distributions, provided there's sufficient training data available.
261261

262-
To be more general, this comparison can be extended to all discriminative and generative models. Generative models learn the joint probability distribution p(x,y) and make stronger assumptions about the data, which is beneficial when these assumptions hold true and training data is limited. Discriminative models, on the other hand, directly learn the conditional probability
263-
p(y|x) without modeling the input distribution, making them more robust to misspecification of the data distribution but typically requiring larger datasets to achieve optimal performance.
262+
To be more general, this comparison can be extended to all discriminative and generative models. Generative models learn the joint probability distribution $$p(x,y)$$ and make stronger assumptions about the data, which is beneficial when these assumptions hold true and training data is limited. Discriminative models, on the other hand, directly learn the conditional probability
263+
$$p(y|x)$$ without modeling the input distribution, making them more robust to misspecification of the data distribution but typically requiring larger datasets to achieve optimal performance.
264+
265+
266+
## From Bayesian Graphical Networks to Deep Generative Models
267+
268+
A key feature of generative models is their ability to learn and represent the underlying data distribution, allowing them to generate new data points. In essence, a generative model defines a probability distribution over the data, denoted as $$p(x)$$. This raises an important question: how can we efficiently model complex data distributions, especially in high-dimensional spaces where numerous random variables are involved?
269+
270+
For instance, when classifying grayscale images of digits (like in the MNIST dataset) with 28×28 pixels, we have 784 random variables. Even in a binary black and white case where each pixel takes either 0 or 255, we would have $$2^{784}$$ possible configurations - an astronomically large number that makes direct estimation of the joint probability computationally intractable.
271+
272+
![Digit Picture](/assets/img/posts/grey_digit.png){: width="500" height="150" }
273+
274+
**Figure 2:** Example of a grayscale digit from the MNIST dataset.
275+
{: .text-center .small}
276+
277+
### The Challenge of Joint Probability Estimation
278+
279+
How can we effectively model the joint probability distribution over all variables $$\{x_1, x_2, ..., x_n\}$$? To specify this distribution completely, we would need to estimate $$2^{n}-1$$ parameters (where n=784 in our digit example), with the last parameter determined by the constraint that probabilities must sum to 1. This exponential growth in parameters makes direct estimation practically impossible.
264280

281+
The solution lies in factorizing the joint probability distribution into a product of simpler conditional probability distributions (CPDs) that require fewer parameters to estimate. Let's explore several approaches to factorization:
265282

283+
### Factorization Approaches for Joint Distributions
284+
285+
#### 1. Independence Assumption
286+
287+
The simplest factorization assumes all variables are statistically independent:
288+
289+
$$
290+
p(x_1, x_2, ..., x_n) = p(x_1)p(x_2) \ldots p(x_n)
291+
$$
292+
293+
This reduces the parameter count from $$2^n-1$$ to just $$n$$, as each $$p(x_i)$$ requires only one parameter for binary variables. However, this assumption is rarely valid for structured data like images, where pixels exhibit strong spatial dependencies. Models built on this assumption fail to capture the coherent shape of digits, as illustrated below:
294+
295+
![IID Digit Picture](/assets/img/posts/iid_digit.png){: width="350" height="150" }
296+
297+
**Figure 3:** Generated digits when assuming pixel independence - note the lack of structure.
298+
{: .text-center .small}
266299

300+
#### 2. Chain Rule Factorization
301+
302+
The chain rule of probability offers a mathematically exact factorization:
303+
304+
$$
305+
p(x_1, x_2, ..., x_n) = p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)...p(x_n|x_1,x_2,...,x_{n-1})
306+
$$
307+
308+
While this is mathematically correct, it doesn't reduce computational complexity. The parameter count remains at $$2^0+2^1+2^2+...+2^{n-1}=2^{n}-1$$, which is still exponential in the number of variables.
309+
310+
#### 3. Markov Chain Assumption
311+
312+
If we assume each variable depends only on its immediate predecessor (first-order Markov property:
313+
$$X_{i+1} \perp X_1,...,X_{i-1} | X_i$$), we can simplify the factorization to:
314+
315+
$$
316+
\begin{align*}
317+
p(x_1, \ldots, x_n) &= p(x_1)p(x_2|x_1)p(x_3|\not{x_1},x_2) \ldots p(x_n|\not{x_1},\not{\ldots},x_{n-1})\\
318+
&= p(x_1)p(x_2|x_1)p(x_3|x_2)\ldots p(x_n|x_{n-1})
319+
\end{align*}
320+
$$
321+
322+
Here we have 1 parameter for $$p(x_1)$$,
323+
2 parameters for all other $$p(x_i|x_{i-1})$$, reducing the parameter count to $$1+2(n-1)=2n-1$$, scaling linearly with the number of variables. This dramatic reduction demonstrates how conditional independence assumptions can make high-dimensional probability modeling tractable.
324+
325+
### Bayesian Networks: Graphical Representation of Probabilistic Dependencies
326+
327+
The Markov chain assumption represents a special case of a more general framework called **Bayesian networks** (also known as **belief networks** or **directed graphical models**). Bayesian networks provide a powerful visual and mathematical framework for representing complex probabilistic relationships among multiple random variables.
328+
329+
A Bayesian network is a directed acyclic graph (DAG) that encodes the joint probability distribution of a collection of random variables. In this framework, we denote the graph as $$G=(V,E)$$, where:
330+
- $$V$$ is the set of nodes, with each node corresponding to a random variable, denoted as $$x_i$$ for convenience.
331+
- $$E$$ represents the set of directed edges, which indicate conditional dependencies between the random variables.
332+
333+
The fundamental principle of Bayesian networks is that the joint probability distribution over all variables $$\{x_1, x_2, ..., x_n\}$$ can be factorized into a product of conditional probability distributions (CPDs) based on the graph structure:
334+
335+
$$
336+
p(x_1, x_2, ..., x_n) = \prod_{i=1}^n p(x_i | \text{pa}(x_i))
337+
$$
338+
339+
where $$\text{pa}(x_i)$$ is the set of parents of $$x_i$$ in the graph $$G$$. This factorization leverages conditional independence assumptions encoded by the graph structure, potentially reducing the number of parameters needed to specify the distribution from exponential to linear or polynomial in the number of variables.
340+
341+
#### An Illustrative Example
342+
343+
Consider the following Bayesian network that models factors affecting a student's academic outcomes:
344+
345+
```mermaid
346+
graph TB
347+
I(Intelligence) --> G(Grade)
348+
I --> S(SAT)
349+
G --> L(Letter)
350+
D(Difficulty) --> G
351+
```
352+
353+
In this example:
354+
- Intelligence (I) influences both Grade (G) and SAT score (S)
355+
- Course Difficulty (D) influences Grade (G)
356+
- Grade (G) influences the recommendation Letter (L)
357+
358+
Applying the general chain rule of probability, we could factorize the joint probability distribution as:
359+
360+
$$
361+
p(d,i,g,s,l) = p(d)p(i|d)p(g|i,d)p(s|i,d,g)p(l|i,d,g,s)
362+
$$
363+
364+
However, the Bayesian network structure encodes specific conditional independence assumptions that allow us to simplify this factorization to:
365+
366+
$$
367+
p(d,i,g,s,l) = p(d)p(i)p(g|i,d)p(s|i)p(l|g)
368+
$$
369+
370+
These simplifications reflect the following conditional independence assumptions:
371+
$$I \perp D$$ , $$S \perp \{D,G\} | I$$ , $$L \perp \{I,D,S\} | G$$. This example demonstrates how Bayesian networks can dramatically reduce the complexity of modeling joint distributions by encoding domain knowledge through graphical structure.
372+
373+
374+
#### Generative vs. Discriminative Bayesian Networks
375+
376+
Returning to our grayscale digit classification example, let's examine how Bayesian networks can represent both generative and discriminative approaches.
377+
378+
In a **generative model** with the naive Bayes assumption, we structure our network with the label variable $$y$$ as the parent of all feature variables:
379+
380+
```mermaid
381+
graph TD
382+
Y((Y)) --> X1((X₁))
383+
Y --> X2((X₂))
384+
Y --> X3((...))
385+
Y --> Xn((Xₙ))
386+
```
387+
388+
This structure encodes the assumption that all features are conditionally independent given the class label, resulting in the factorization:
389+
390+
$$
391+
p(y,x_1,x_2,...,x_n) = p(y) \prod_{i=1}^n p(x_i|y)
392+
$$
393+
394+
In the generative approach, we explicitly model both $$p(y)$$ (class prior) and
395+
$$p(\mathbf{x}|y)$$ (conditional density for continuous features or conditional probability for discrete features) to represent $$p(\mathbf{x},y)$$.
396+
397+
Alternatively, the **discriminative approach** directly models
398+
$$p(y|\mathbf{x})$$ without explicitly representing how the features themselves are generated. Using the chain rule, $$p(\mathbf{x},y) = p(y)p(\mathbf{x}|y) = p(\mathbf{x})p(y|\mathbf{x})$$, we can see that for classification purposes, we only need to learn $$p(y|\mathbf{x})$$ since $$p(\mathbf{x})$$ is already given by the input data.
399+
400+
Maintaining all dependencies in a generative model by applying the chain rule would give us:
401+
402+
$$
403+
p(y,x_1,x_2,...,x_n) = p(y)p(x_1|y)p(x_2|y,x_1)p(x_3|y,x_1,x_2)...p(x_n|y,x_1,x_2,...,x_{n-1})
404+
$$
405+
406+
While for the discriminative approach:
407+
408+
$$
409+
p(y,x_1,x_2,...,x_n) = p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)...p(x_n|x_1,x_2,...,x_{n-1})p(y|x_1,x_2,...,x_n)
410+
$$
411+
412+
![Generative vs Discriminative Models](/assets/img/posts/bn_gen_dis.png){: width="500" height="250"}
413+
414+
**Figure 4:** Visual representation of generative models (left) vs discriminative models (right). In generative models, we model how Y generates X values, while in discriminative models, we model how X values determine Y. Image source: [Stanford CS236 Lecture 2](https://deepgenerativemodels.github.io/assets/slides/cs236_lecture2.pdf).
415+
{: .text-center .small}
267416

417+
## TO FINISH
268418

419+
<!--
420+
#### Learning and Inference in Bayesian Networks -need to be verified
269421
422+
In practice, Bayesian networks require two key operations:
423+
1. **Learning** - Estimating the structure of the network and/or the parameters of the conditional probability distributions
424+
2. **Inference** - Computing probabilities of interest given observed evidence
270425
426+
Parameter learning can be done through maximum likelihood estimation or Bayesian approaches, while structure learning often involves scoring different network structures based on how well they fit the data while maintaining simplicity.
271427
428+
For inference, exact methods like variable elimination and belief propagation work well for small networks, while approximate methods like Markov Chain Monte Carlo (MCMC) sampling are used for more complex networks.
272429
430+
#### Connection to Modern Deep Generative Models
273431
432+
The conditional independence assumptions encoded in Bayesian networks provide the theoretical foundation for many modern deep generative models. Neural networks now allow us to represent complex conditional probability distributions that would be intractable with traditional parametric approaches.
274433
434+
For example, autoregressive models like PixelCNN and WaveNet implement the chain rule factorization using neural networks to model each conditional distribution $$p(x_i|x_1,...,x_{i-1})$$. Similarly, variational autoencoders (VAEs) can be interpreted as Bayesian networks with continuous latent variables and neural network-parameterized conditional distributions.
275435
276436
437+
## Modern Deep Generative Models TODO
277438
439+
Modern deep generative models extend these ideas using neural networks to learn complex conditional probability distributions. These include:
278440
441+
1. **Autoregressive Models** - Model the joint distribution as a product of conditionals, often using recurrent or masked architectures
442+
2. **Generative Adversarial Networks (GANs)** - Use an adversarial process to generate highly realistic samples
443+
3. **Diffusion Models** - Build a Markov chain that gradually adds noise to the data until it becomes a simple distribution
444+
4. **Normalizing Flows** - Transform simple distributions into complex ones through invertible transformations
445+
5. **Variational Autoencoders (VAEs)** - Learn a latent space representation that captures the data distribution
279446
447+
These approaches maintain the spirit of Bayesian networks but leverage the expressive power of neural networks to model complex dependencies in high-dimensional data. -->
280448

281449

282450
## References
283451

284452
[1] Andrew, Ng. "[CS229: Machine Learning Course Notes](https://cs229.stanford.edu/main_notes.pdf)". Stanford University, 2018. \\
285453
[2] Salton, Gerard & Michael J., McGill. "[Introduction to Modern Information Retrieval](https://archive.org/details/introductiontomo00salt)". McGraw-Hill, 1983.
454+
[3] Stefano Ermon, et al. "[CS236 Deep Generative Models Module](https://deepgenerativemodels.github.io/)". Stanford University, Fall 2023.
286455

287456

assets/img/posts/bn_gen_dis.png

54.6 KB
Loading

assets/img/posts/grey_digit.png

12.1 KB
Loading

assets/img/posts/iid_digit.png

6.31 KB
Loading

0 commit comments

Comments
 (0)