generative models-eg2

liuzi · liuzi · commit dcea80be4a56 · 2025-02-26T04:27:33.000Z
diff --git a/_posts/2025-02-15-generative-models.md b/_posts/2025-02-15-generative-models.md
@@ -14,11 +14,11 @@ mermaid: true
 
 When discussing **generative models**, it's essential to understand how machine learning approaches tasks. Consider a scenario where we aim to distinguish between elephants and dogs, there are primarily two modeling approaches: discriminative and generative:
 
-1. **Discriminative Modeling:** This approach involves building a model that directly predicts classification labels or identifies the decision boundary between elephants and dogs.
-2. **Generative Modeling:** This approach entails constructing separate models for elephants and dogs, capturing their respective characteristics. A new animal is then compared against each model to determine which it resembles more closely.
+1. **Discriminative Modeling:** This approach involves building a model that directly predicts **classification labels** or identifies the **decision boundary** between elephants and dogs.
+2. **Generative Modeling:** This approach entails constructing separate models for elephants and dogs, capturing their **respective characteristics**. A new animal is then compared against each model to determine which it resembles more closely.
 
 In discriminative modeling, the focus is on learning the conditional probability of labels given the input data, denoted as 
-$$ p(y|{x}) $$. Techniques like logistic regression exemplify this by modeling the probability of a label based on input features. Alternatively, methods such as the perceptron algorithm aim to find a decision boundary that maps new observations to specific labels $$\{0,1\}$$, such as $$0$$ for dogs and $$1$$ for elephants.
+$$p(y|{x})$$. Techniques like logistic regression exemplify this by modeling the probability of a label based on input features. Alternatively, methods such as the perceptron algorithm aim to find a decision boundary that maps new observations to specific labels $$\{0,1\}$$, such as $$0$$ for dogs and $$1$$ for elephants.
 
 Conversely, generative modeling focuses on understanding how the data is generated by learning the joint probability distribution $$p(x,y)$$ or the likelihood 
 $$p(x|{y})$$ along with the prior probability $$p(y)$$. This approach models the distribution of the input data for each class, enabling the generation of new data points and facilitating classification by applying Bayes' theorem to compute the posterior probability:
@@ -37,7 +37,7 @@ p(x) &= \sum_{y} p(x,y) \\
 \end{aligned}
 $$
 
-Actually $$p(x)$$ acts as a normalization constant as it does not depend on the label $$y$$. To be more specific, $$p(x)$$ does not change no matter how $$y$$ varies. So when calculating 
+Actually $$p(x)$$ acts as a **normalization constant** as it does not depend on the label $$y$$. To be more specific, $$p(x)$$ does not change no matter how $$y$$ varies. So when calculating 
 $$p(y|x)$$, we do not need to compute $$p(x)$$:
 
 $$
@@ -144,7 +144,7 @@ $$
 \end{aligned}
 $$
 
-Therefore gives us the stochastic gradient ascent rule, where $$(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}$$ is the gradient of the loss function with respect to the $$i$$-th training example:
+Therefore gives us the stochastic gradient ascent rule, where $$(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}$$ is the gradient of the log-likelihood function with respect to the $$i$$-th training example:
 
 $$
 \theta_j := \theta_j + \alpha(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}
@@ -155,6 +155,41 @@ $$
 
 ### Example 2: Gaussian Discriminant Analysis as a Generative Model
 
+Let's say the feature vector $$x$$ of an email is using TF-IDF[[2]](#references) that measures the importance of words in the email. TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It is calculated as:
+
+$$
+\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
+$$
+
+where:
+
+$$
+\begin{aligned}
+\text{TF}(t, d) &= \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \\
+\text{IDF}(t) &= \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)
+\end{aligned}
+$$
+
+As TF-IDF is continuous we can model 
+$$p(x|y)$$ as a multivariate normal distribution. Then the model can be represented as:
+
+$$
+\begin{aligned}
+y &\sim \text{Bernoulli}(\phi) \\
+x|y=0 &\sim \mathcal{N}(\mu_0, \Sigma) \\
+x|y=1 &\sim \mathcal{N}(\mu_1, \Sigma)
+\end{aligned}
+$$
+
+Writing out the distributions, we have:
+
+$$
+\begin{aligned}
+p(y) &= \phi^y(1-\phi)^{1-y} \\
+p(x|y=0) &= \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp\left(-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)\right) \\
+p(x|y=1) &= \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp\left(-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right) \\
+\end{aligned}
+$$
 
 
 
@@ -170,6 +205,7 @@ $$
 
 ## References
 
-[1] Ng, Andrew. "CS229: Machine Learning Course Notes." Stanford University, 2018.
+[1] Ng, Andrew. "[CS229: Machine Learning Course Notes](https://cs229.stanford.edu/main_notes.pdf)". Stanford University, 2018.
+[2] Manning, Christopher D., et al. "[Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf)". Stanford University, 2009.