Skip to content

Commit dcea80b

Browse files
author
liuzi
committed
generative models-eg2
1 parent ac52ef4 commit dcea80b

File tree

1 file changed

+42
-6
lines changed

1 file changed

+42
-6
lines changed

_posts/2025-02-15-generative-models.md

Lines changed: 42 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ mermaid: true
1414

1515
When discussing **generative models**, it's essential to understand how machine learning approaches tasks. Consider a scenario where we aim to distinguish between elephants and dogs, there are primarily two modeling approaches: discriminative and generative:
1616

17-
1. **Discriminative Modeling:** This approach involves building a model that directly predicts classification labels or identifies the decision boundary between elephants and dogs.
18-
2. **Generative Modeling:** This approach entails constructing separate models for elephants and dogs, capturing their respective characteristics. A new animal is then compared against each model to determine which it resembles more closely.
17+
1. **Discriminative Modeling:** This approach involves building a model that directly predicts **classification labels** or identifies the **decision boundary** between elephants and dogs.
18+
2. **Generative Modeling:** This approach entails constructing separate models for elephants and dogs, capturing their **respective characteristics**. A new animal is then compared against each model to determine which it resembles more closely.
1919

2020
In discriminative modeling, the focus is on learning the conditional probability of labels given the input data, denoted as
21-
$$ p(y|{x}) $$. Techniques like logistic regression exemplify this by modeling the probability of a label based on input features. Alternatively, methods such as the perceptron algorithm aim to find a decision boundary that maps new observations to specific labels $$\{0,1\}$$, such as $$0$$ for dogs and $$1$$ for elephants.
21+
$$p(y|{x})$$. Techniques like logistic regression exemplify this by modeling the probability of a label based on input features. Alternatively, methods such as the perceptron algorithm aim to find a decision boundary that maps new observations to specific labels $$\{0,1\}$$, such as $$0$$ for dogs and $$1$$ for elephants.
2222

2323
Conversely, generative modeling focuses on understanding how the data is generated by learning the joint probability distribution $$p(x,y)$$ or the likelihood
2424
$$p(x|{y})$$ along with the prior probability $$p(y)$$. This approach models the distribution of the input data for each class, enabling the generation of new data points and facilitating classification by applying Bayes' theorem to compute the posterior probability:
@@ -37,7 +37,7 @@ p(x) &= \sum_{y} p(x,y) \\
3737
\end{aligned}
3838
$$
3939

40-
Actually $$p(x)$$ acts as a normalization constant as it does not depend on the label $$y$$. To be more specific, $$p(x)$$ does not change no matter how $$y$$ varies. So when calculating
40+
Actually $$p(x)$$ acts as a **normalization constant** as it does not depend on the label $$y$$. To be more specific, $$p(x)$$ does not change no matter how $$y$$ varies. So when calculating
4141
$$p(y|x)$$, we do not need to compute $$p(x)$$:
4242

4343
$$
@@ -144,7 +144,7 @@ $$
144144
\end{aligned}
145145
$$
146146

147-
Therefore gives us the stochastic gradient ascent rule, where $$(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}$$ is the gradient of the loss function with respect to the $$i$$-th training example:
147+
Therefore gives us the stochastic gradient ascent rule, where $$(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}$$ is the gradient of the log-likelihood function with respect to the $$i$$-th training example:
148148

149149
$$
150150
\theta_j := \theta_j + \alpha(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}
@@ -155,6 +155,41 @@ $$
155155

156156
### Example 2: Gaussian Discriminant Analysis as a Generative Model
157157

158+
Let's say the feature vector $$x$$ of an email is using TF-IDF[[2]](#references) that measures the importance of words in the email. TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It is calculated as:
159+
160+
$$
161+
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
162+
$$
163+
164+
where:
165+
166+
$$
167+
\begin{aligned}
168+
\text{TF}(t, d) &= \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \\
169+
\text{IDF}(t) &= \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)
170+
\end{aligned}
171+
$$
172+
173+
As TF-IDF is continuous we can model
174+
$$p(x|y)$$ as a multivariate normal distribution. Then the model can be represented as:
175+
176+
$$
177+
\begin{aligned}
178+
y &\sim \text{Bernoulli}(\phi) \\
179+
x|y=0 &\sim \mathcal{N}(\mu_0, \Sigma) \\
180+
x|y=1 &\sim \mathcal{N}(\mu_1, \Sigma)
181+
\end{aligned}
182+
$$
183+
184+
Writing out the distributions, we have:
185+
186+
$$
187+
\begin{aligned}
188+
p(y) &= \phi^y(1-\phi)^{1-y} \\
189+
p(x|y=0) &= \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp\left(-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)\right) \\
190+
p(x|y=1) &= \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp\left(-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right) \\
191+
\end{aligned}
192+
$$
158193

159194

160195

@@ -170,6 +205,7 @@ $$
170205

171206
## References
172207

173-
[1] Ng, Andrew. "CS229: Machine Learning Course Notes." Stanford University, 2018.
208+
[1] Ng, Andrew. "[CS229: Machine Learning Course Notes](https://cs229.stanford.edu/main_notes.pdf)". Stanford University, 2018.
209+
[2] Manning, Christopher D., et al. "[Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf)". Stanford University, 2009.
174210

175211

0 commit comments

Comments
 (0)