diff --git a/07_RegressionModels/01_01_introduction/index.Rmd b/07_RegressionModels/01_01_introduction/index.Rmd
index 1fadce87c..357f819f3 100644
--- a/07_RegressionModels/01_01_introduction/index.Rmd
+++ b/07_RegressionModels/01_01_introduction/index.Rmd
@@ -1,211 +1,213 @@
----
-title : Introduction to regression
-subtitle : Regression
-author : Brian Caffo, Jeff Leek and Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../libraries
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
----
-## A famous motivating example
-
-```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-runif(1)
-```
-
-
-
-### (Perhaps surprisingly, this example is still relevant)
-
-
-
-[http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html](http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html)
-
-[Predicting height: the Victorian approach beats modern genomics](http://www.wired.com/wiredscience/2009/03/predicting-height-the-victorian-approach-beats-modern-genomics/)
-
----
-## Questions for this class
-* Consider trying to answer the following kinds of questions:
- * To use the parents' heights to predict childrens' heights.
- * To try to find a parsimonious, easily described mean
- relationship between parent and children's heights.
- * To investigate the variation in childrens' heights that appears
- unrelated to parents' heights (residual variation).
- * To quantify what impact genotype information has beyond parental height in explaining child height.
- * To figure out how/whether and what assumptions are needed to
- generalize findings beyond the data in question.
- * Why do children of very tall parents tend to be
- tall, but a little shorter than their parents and why children of very short parents tend to be short, but a little taller than their parents? (This is a famous question called 'Regression to the mean'.)
-
----
-## Galton's Data
-
-* Let's look at the data first, used by Francis Galton in 1885.
-* Galton was a statistician who invented the term and concepts
- of regression and correlation, founded the journal Biometrika,
- and was the cousin of Charles Darwin.
-* You may need to run `install.packages("UsingR")` if the `UsingR` library is not installed.
-* Let's look at the marginal (parents disregarding children and children disregarding parents) distributions first.
- * Parent distribution is all heterosexual couples.
- * Correction for gender via multiplying female heights by 1.08.
- * Overplotting is an issue from discretization.
-
----
-## Code
-
-```{r galton,fig.height=3.5,fig.width=8}
-library(UsingR); data(galton)
-par(mfrow=c(1,2))
-hist(galton$child,col="blue",breaks=100)
-hist(galton$parent,col="blue",breaks=100)
-```
-
----
-## Finding the middle via least squares
-* Consider only the children's heights.
- * How could one describe the "middle"?
- * One definition, let $Y_i$ be the height of child $i$ for $i = 1, \ldots, n = 928$, then define the middle as the value of $\mu$
- that minimizes $$\sum_{i=1}^n (Y_i - \mu)^2$$
-* This is physical center of mass of the histrogram.
-* You might have guessed that the answer $\mu = \bar X$.
-
-
----
-## Experiment
-### Use R studio's manipulate to see what value of $\mu$ minimizes the sum of the squared deviations.
-
-```
-library(manipulate)
-myHist <- function(mu){
- hist(galton$child,col="blue",breaks=100)
- lines(c(mu, mu), c(0, 150),col="red",lwd=5)
- mse <- mean((galton$child - mu)^2)
- text(63, 150, paste("mu = ", mu))
- text(63, 140, paste("MSE = ", round(mse, 2)))
-}
-manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
-```
-
----
-## The least squares estimate is the empirical mean
-```{r lsm, dependson="galton",fig.height=4,fig.width=4}
- hist(galton$child,col="blue",breaks=100)
- meanChild <- mean(galton$child)
- lines(rep(meanChild,100),seq(0,150,length=100),col="red",lwd=5)
-```
-
----
-### The math follows as:
-$$
-\begin{align}
-\sum_{i=1}^n (Y_i - \mu)^2 & = \
-\sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \\
-& = \sum_{i=1}^n (Y_i - \bar Y)^2 + \
-2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) +\
-\sum_{i=1}^n (\bar Y - \mu)^2 \\
-& = \sum_{i=1}^n (Y_i - \bar Y)^2 + \
-2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\
-\sum_{i=1}^n (\bar Y - \mu)^2 \\
-& = \sum_{i=1}^n (Y_i - \bar Y)^2 + \
-2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\
-\sum_{i=1}^n (\bar Y - \mu)^2 \\
-& = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2\\
-& \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \
-\end{align}
-$$
-
----
-## Comparing childrens' heights and their parents' heights
-
-```{r, dependson="galton",fig.height=4,fig.width=4}
-plot(galton$parent,galton$child,pch=19,col="blue")
-```
-
----
-Size of point represents number of points at that (X, Y) combination (See the Rmd file for the code).
-
-```{r freqGalton, dependson="galton",fig.height=6,fig.width=6,echo=FALSE}
-freqData <- as.data.frame(table(galton$child, galton$parent))
-names(freqData) <- c("child", "parent", "freq")
-plot(as.numeric(as.vector(freqData$parent)),
- as.numeric(as.vector(freqData$child)),
- pch = 21, col = "black", bg = "lightblue",
- cex = .15 * freqData$freq,
- xlab = "parent", ylab = "child")
-```
-
----
-## Regression through the origin
-* Suppose that $X_i$ are the parents' heights.
-* Consider picking the slope $\beta$ that minimizes $$\sum_{i=1}^n (Y_i - X_i \beta)^2$$
-* This is exactly using the origin as a pivot point picking the
-line that minimizes the sum of the squared vertical distances
-of the points to the line
-* Use R studio's manipulate function to experiment
-* Subtract the means so that the origin is the mean of the parent
-and children's heights
-
----
-```
-myPlot <- function(beta){
- y <- galton$child - mean(galton$child)
- x <- galton$parent - mean(galton$parent)
- freqData <- as.data.frame(table(x, y))
- names(freqData) <- c("child", "parent", "freq")
- plot(
- as.numeric(as.vector(freqData$parent)),
- as.numeric(as.vector(freqData$child)),
- pch = 21, col = "black", bg = "lightblue",
- cex = .15 * freqData$freq,
- xlab = "parent",
- ylab = "child"
- )
- abline(0, beta, lwd = 3)
- points(0, 0, cex = 2, pch = 19)
- mse <- mean( (y - beta * x)^2 )
- title(paste("beta = ", beta, "mse = ", round(mse, 3)))
-}
-manipulate(myPlot(beta), beta = slider(0.6, 1.2, step = 0.02))
-```
-
----
-## The solution
-### In the next few lectures we'll talk about why this is the solution
-```{r}
-lm(I(child - mean(child))~ I(parent - mean(parent)) - 1, data = galton)
-```
-
----
-## Visualizing the best fit line
-### Size of points are frequencies at that X, Y combination
-```{r, fig.height=5,fig.width=5,echo=FALSE}
-freqData <- as.data.frame(table(galton$child, galton$parent))
-names(freqData) <- c("child", "parent", "freq")
-plot(as.numeric(as.vector(freqData$parent)),
- as.numeric(as.vector(freqData$child)),
- pch = 21, col = "black", bg = "lightblue",
- cex = .05 * freqData$freq,
- xlab = "parent", ylab = "child")
-lm1 <- lm(galton$child ~ galton$parent)
-lines(galton$parent,lm1$fitted,col="red",lwd=3)
-```
-
+---
+title : Introduction to regression
+subtitle : Regression
+author : Brian Caffo, Jeff Leek and Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../libraries
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+## A famous motivating example
+
+```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+runif(1)
+```
+
+
+
+### (Perhaps surprisingly, this example is still relevant)
+
+
+
+[http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html](http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html)
+
+[Predicting height: the Victorian approach beats modern genomics](http://www.wired.com/wiredscience/2009/03/predicting-height-the-victorian-approach-beats-modern-genomics/)
+
+
+## Questions for this class
+* Consider trying to answer the following kinds of questions:
+ * To use the parents' heights to predict childrens' heights.
+ * To try to find a parsimonious, easily described mean
+ relationship between parent and children's heights.
+ * To investigate the variation in childrens' heights that appears
+ unrelated to parents' heights (residual variation).
+ * To quantify what impact genotype information has beyond parental height in explaining child height.
+ * To figure out how/whether and what assumptions are needed to
+ generalize findings beyond the data in question.
+ * Why do children of very tall parents tend to be
+ tall, but a little shorter than their parents and why children of very short parents tend to be short, but a little taller than their parents? (This is a famous question called 'Regression to the mean'.)
+
+
+## Galton's Data
+
+* Let's look at the data first, used by Francis Galton in 1885.
+* Galton was a statistician who invented the term and concepts
+ of regression and correlation, founded the journal Biometrika,
+ and was the cousin of Charles Darwin.
+* You may need to run `install.packages("UsingR")` if the `UsingR` library is not installed.
+* Let's look at the marginal (parents disregarding children and children disregarding parents) distributions first.
+ * Parent distribution is all heterosexual couples.
+ * Correction for gender via multiplying female heights by 1.08.
+ * Overplotting is an issue from discretization.
+
+
+## Code
+
+```{r galton,fig.height=3.5,fig.width=8}
+library(UsingR); data(galton)
+par(mfrow=c(1,2))
+hist(galton$child,col="blue",breaks=100)
+hist(galton$parent,col="blue",breaks=100)
+```
+
+
+## Finding the middle via least squares
+* Consider only the children's heights.
+ * How could one describe the "middle"?
+ * One definition, let $Y_i$ be the height of child $i$ for $i = 1, \ldots, n = 928$, then define the middle as the value of $\mu$
+ that minimizes $$\sum_{i=1}^n (Y_i - \mu)^2$$
+* This is physical center of mass of the histrogram.
+* You might have guessed that the answer $\mu = \bar X$.
+
+
+
+## Experiment
+### Use R studio's manipulate to see what value of $\mu$ minimizes the sum of the squared deviations.
+
+```
+library(manipulate)
+myHist <- function(mu){
+ hist(galton$child,col="blue",breaks=100)
+ lines(c(mu, mu), c(0, 150),col="red",lwd=5)
+ mse <- mean((galton$child - mu)^2)
+ text(63, 150, paste("mu = ", mu))
+ text(63, 140, paste("MSE = ", round(mse, 2)))
+}
+manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
+```
+
+
+## The least squares estimate is the empirical mean
+```{r lsm, dependson="galton",fig.height=4,fig.width=4}
+ hist(galton$child,col="blue",breaks=100)
+ meanChild <- mean(galton$child)
+ lines(rep(meanChild,100),seq(0,150,length=100),col="red",lwd=5)
+```
+
+
+### The math follows as:
+$$
+\begin{align}
+\sum_{i=1}^n (Y_i - \mu)^2 & = \
+\sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \\
+& = \sum_{i=1}^n (Y_i - \bar Y)^2 + \
+2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) +\
+\sum_{i=1}^n (\bar Y - \mu)^2 \\
+& = \sum_{i=1}^n (Y_i - \bar Y)^2 + \
+2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\
+\sum_{i=1}^n (\bar Y - \mu)^2 \\
+& = \sum_{i=1}^n (Y_i - \bar Y)^2 + \
+2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\
+\sum_{i=1}^n (\bar Y - \mu)^2 \\
+& = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2\\
+& \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \
+\end{align}
+$$
+
+
+## Comparing childrens' heights and their parents' heights
+
+```{r, dependson="galton",fig.height=4,fig.width=4}
+plot(galton$parent,galton$child,pch=19,col="blue")
+```
+
+
+Size of point represents number of points at that (X, Y) combination (See the Rmd file for the code).
+
+```{r freqGalton, dependson="galton",fig.height=6,fig.width=6,echo=FALSE}
+freqData <- as.data.frame(table(galton$child, galton$parent))
+names(freqData) <- c("child", "parent", "freq")
+plot(as.numeric(as.vector(freqData$parent)),
+ as.numeric(as.vector(freqData$child)),
+ pch = 21, col = "black", bg = "lightblue",
+ cex = .15 * freqData$freq,
+ xlab = "parent", ylab = "child")
+```
+
+
+## Regression through the origin
+* Suppose that $X_i$ are the parents' heights.
+* Consider picking the slope $\beta$ that minimizes $$\sum_{i=1}^n (Y_i - X_i \beta)^2$$
+* This is exactly using the origin as a pivot point picking the
+line that minimizes the sum of the squared vertical distances
+of the points to the line
+* Use R studio's manipulate function to experiment
+* Subtract the means so that the origin is the mean of the parent
+and children's heights
+
+
+```
+myPlot <- function(beta){
+ y <- galton$child - mean(galton$child)
+ x <- galton$parent - mean(galton$parent)
+ freqData <- as.data.frame(table(x, y))
+ names(freqData) <- c("child", "parent", "freq")
+ plot(
+ as.numeric(as.vector(freqData$parent)),
+ as.numeric(as.vector(freqData$child)),
+ pch = 21, col = "black", bg = "lightblue",
+ cex = .15 * freqData$freq,
+ xlab = "parent",
+ ylab = "child"
+ )
+ abline(0, beta, lwd = 3)
+ points(0, 0, cex = 2, pch = 19)
+ mse <- mean( (y - beta * x)^2 )
+ title(paste("beta = ", beta, "mse = ", round(mse, 3)))
+}
+manipulate(myPlot(beta), beta = slider(0.6, 1.2, step = 0.02))
+```
+
+
+## The solution
+### In the next few lectures we'll talk about why this is the solution
+```{r}
+lm(I(child - mean(child))~ I(parent - mean(parent)) - 1, data = galton)
+```
+
+
+## Visualizing the best fit line
+### Size of points are frequencies at that X, Y combination
+```{r, fig.height=5,fig.width=5,echo=FALSE}
+freqData <- as.data.frame(table(galton$child, galton$parent))
+names(freqData) <- c("child", "parent", "freq")
+plot(as.numeric(as.vector(freqData$parent)),
+ as.numeric(as.vector(freqData$child)),
+ pch = 21, col = "black", bg = "lightblue",
+ cex = .05 * freqData$freq,
+ xlab = "parent", ylab = "child")
+lm1 <- lm(galton$child ~ galton$parent)
+lines(galton$parent,lm1$fitted,col="red",lwd=3)
+```
+
diff --git a/07_RegressionModels/01_01_introduction/index.html b/07_RegressionModels/01_01_introduction/index.html
index 4bd2ea6b9..e47ef0c64 100644
--- a/07_RegressionModels/01_01_introduction/index.html
+++ b/07_RegressionModels/01_01_introduction/index.html
@@ -1,369 +1,313 @@
-
-
-
Consider trying to answer the following kinds of questions:
-
-
-
To use the parents' heights to predict childrens' heights.
-
To try to find a parsimonious, easily described mean
-relationship between parent and children's heights.
-
To investigate the variation in childrens' heights that appears
-unrelated to parents' heights (residual variation).
-
To quantify what impact genotype information has beyond parental height in explaining child height.
-
To figure out how/whether and what assumptions are needed to
-generalize findings beyond the data in question.
-
Why do children of very tall parents tend to be
-tall, but a little shorter than their parents and why children of very short parents tend to be short, but a little taller than their parents? (This is a famous question called 'Regression to the mean'.)
-
-
-
-
-
-
-
-
-
-
Galton's Data
-
-
-
-
Let's look at the data first, used by Francis Galton in 1885.
-
Galton was a statistician who invented the term and concepts
-of regression and correlation, founded the journal Biometrika,
-and was the cousin of Charles Darwin.
-
You may need to run install.packages("UsingR") if the UsingR library is not installed.
-
Let's look at the marginal (parents disregarding children and children disregarding parents) distributions first.
-
-
-
Parent distribution is all heterosexual couples.
-
Correction for gender via multiplying female heights by 1.08.
One definition, let \(Y_i\) be the height of child \(i\) for \(i = 1, \ldots, n = 928\), then define the middle as the value of \(\mu\)
-that minimizes \[\sum_{i=1}^n (Y_i - \mu)^2\]
-
-
This is physical center of mass of the histrogram.
-
You might have guessed that the answer \(\mu = \bar X\).
-
-
-
-
-
-
-
-
-
Experiment
-
-
-
Use R studio's manipulate to see what value of \(\mu\) minimizes the sum of the squared deviations.
Size of point represents number of points at that (X, Y) combination (See the Rmd file for the code).
-
-
-
-
-
-
-
-
-
-
Regression through the origin
-
-
-
-
Suppose that \(X_i\) are the parents' heights.
-
Consider picking the slope \(\beta\) that minimizes \[\sum_{i=1}^n (Y_i - X_i \beta)^2\]
-
This is exactly using the origin as a pivot point picking the
-line that minimizes the sum of the squared vertical distances
-of the points to the line
-
Use R studio's manipulate function to experiment
-
Subtract the means so that the origin is the mean of the parent
-and children's heights
Consider trying to answer the following kinds of questions:
+
To use the parents’ heights to predict childrens’ heights.
+
To try to find a parsimonious, easily described mean relationship between parent and children’s heights.
+
To investigate the variation in childrens’ heights that appears unrelated to parents’ heights (residual variation).
+
To quantify what impact genotype information has beyond parental height in explaining child height.
+
To figure out how/whether and what assumptions are needed to generalize findings beyond the data in question.
+
Why do children of very tall parents tend to be tall, but a little shorter than their parents and why children of very short parents tend to be short, but a little taller than their parents? (This is a famous question called ‘Regression to the mean’.)
+
+
+
+
Galton’s Data
+
+
Let’s look at the data first, used by Francis Galton in 1885.
+
Galton was a statistician who invented the term and concepts of regression and correlation, founded the journal Biometrika, and was the cousin of Charles Darwin.
+
You may need to run install.packages("UsingR") if the UsingR library is not installed.
+
Let’s look at the marginal (parents disregarding children and children disregarding parents) distributions first.
+
Parent distribution is all heterosexual couples.
+
Correction for gender via multiplying female heights by 1.08.
One definition, let \(Y_i\) be the height of child \(i\) for \(i = 1, \ldots, n = 928\), then define the middle as the value of \(\mu\) that minimizes \[\sum_{i=1}^n (Y_i - \mu)^2\]
+
This is physical center of mass of the histrogram.
+
You might have guessed that the answer \(\mu = \bar X\).
+
+
+
+
Experiment
+
+
Use R studio’s manipulate to see what value of \(\mu\) minimizes the sum of the squared deviations.
Size of points are frequencies at that X, Y combination
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/01_02_notation/index.Rmd b/07_RegressionModels/01_02_notation/index.Rmd
index 040699101..e8f0e7963 100644
--- a/07_RegressionModels/01_02_notation/index.Rmd
+++ b/07_RegressionModels/01_02_notation/index.Rmd
@@ -20,7 +20,7 @@ mode : selfcontained # {standalone, draft}
* We will try to minimize the amount of mathematics required for this class.
* No caclculus is required.
----
+
## Notation for data
@@ -35,7 +35,7 @@ mode : selfcontained # {standalone, draft}
* $X_i$ is a conceptual random variable.
* $x$ is a number that we plug into.
----
+
## The empirical mean
* Define the empirical mean as
@@ -55,7 +55,7 @@ The the mean of the $\tilde X_i$ is 0.
\sum_{i=1}^n (X_i - \mu)^2
$$
----
+
## The emprical standard deviation and variance
@@ -71,7 +71,7 @@ $S = \sqrt{S^2}$. Notice that the standard deviation has the same units as the d
* Sometimes people divide by $n$ rather than $n-1$ (the latter
produces an unbiased estimate.)
----
+
## Normalization
* The the data defined by
@@ -84,7 +84,7 @@ have empirical mean zero and empirical standard deviation 1.
* Example, a value of 2 form normalized data means that data point
was two standard deviations larger than the mean.
----
+
## The empirical covariance
* Consider now when we have pairs of data, $(X_i, Y_i)$.
* Their empirical covariance is
@@ -102,7 +102,7 @@ $$
where $S_x$ and $S_y$ are the estimates of standard deviations
for the $X$ observations and $Y$ observations, respectively.
----
+
## Some facts about correlation
* $Cor(X, Y) = Cor(Y, X)$
* $-1 \leq Cor(X, Y) \leq 1$
diff --git a/07_RegressionModels/01_02_notation/index.html b/07_RegressionModels/01_02_notation/index.html
index 42e277dcf..54c6ae95a 100644
--- a/07_RegressionModels/01_02_notation/index.html
+++ b/07_RegressionModels/01_02_notation/index.html
@@ -1,242 +1,238 @@
-
-
-
- Some basic notation and background
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Some basic notation and background
-
Regression
-
Brian Caffo, PhD Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
Some basic definitions
-
-
-
-
In this module, we'll cover some basic definitions and notation used throughout the class.
-
We will try to minimize the amount of mathematics required for this class.
-
No caclculus is required.
-
-
-
-
-
-
-
-
-
Notation for data
-
-
-
-
We write \(X_1, X_2, \ldots, X_n\) to describe \(n\) data points.
-
As an example, consider the data set \(\{1, 2, 5\}\) then
-
-
We often use a different letter than \(X\), such as \(Y_1, \ldots , Y_n\).
-
We will typically use Greek letters for things we don't know.
-Such as, \(\mu\) is a mean that we'd like to estimate.
-
We will use capital letters for conceptual values of the variables and lowercase letters for realized values.
-
-
-
So this way we can write \(P(X_i > x)\).
-
\(X_i\) is a conceptual random variable.
-
\(x\) is a number that we plug into.
-
-
-
-
-
-
-
-
-
-
The empirical mean
-
-
-
-
Define the empirical mean as
-\[
-\bar X = \frac{1}{n}\sum_{i=1}^n X_i.
-\]
-
Notice if we subtract the mean from data points, we get data that has mean 0. That is, if we define
-\[
-\tilde X_i = X_i - \bar X.
-\]
-The the mean of the \(\tilde X_i\) is 0.
-
This process is called "centering" the random variables.
-
The mean is a measure of central tendancy of the data.
-
Recall from the previous lecture that the mean is
-the least squares solution for minimizing
-\[
-\sum_{i=1}^n (X_i - \mu)^2
-\]
-
-
-
-
-
-
-
-
-
The emprical standard deviation and variance
-
-
-
-
Define the empirical variance as
-\[
-S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2
-= \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - n \bar X ^ 2 \right)
-\]
-
The empirical standard deviation is defined as
-\(S = \sqrt{S^2}\). Notice that the standard deviation has the same units as the data.
-
The data defined by \(X_i / s\) have empirical standard deviation 1. This is called "scaling" the data.
-
The empirical standard deviation is a measure of spread.
-
Sometimes people divide by \(n\) rather than \(n-1\) (the latter
-produces an unbiased estimate.)
-
-
-
-
-
-
-
-
-
Normalization
-
-
-
-
The the data defined by
-\[
-Z_i = \frac{X_i - \bar X}{s}
-\]
-have empirical mean zero and empirical standard deviation 1.
-
The process of centering then scaling the data is called "normalizing" the data.
-
Normalized data are centered at 0 and have units equal to standard deviations of the original data.
-
Example, a value of 2 form normalized data means that data point
-was two standard deviations larger than the mean.
-
-
-
-
-
-
-
-
-
The empirical covariance
-
-
-
-
Consider now when we have pairs of data, \((X_i, Y_i)\).
-
Their empirical covariance is
-\[
-Cov(X, Y) =
-\frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X) (Y_i - \bar Y)
-= \frac{1}{n-1}\left( \sum_{i=1}^n X_i Y_i - n \bar X \bar Y\right)
-\]
-
Some people prefer to divide by \(n\) rather than \(n-1\) (the latter
-produces an unbiased estimate.)
-
The correlation is defined is
-\[
-Cor(X, Y) = \frac{Cov(X, Y)}{S_x S_y}
-\]
-where \(S_x\) and \(S_y\) are the estimates of standard deviations
-for the \(X\) observations and \(Y\) observations, respectively.
-
-
-
-
-
-
-
-
-
Some facts about correlation
-
-
-
-
\(Cor(X, Y) = Cor(Y, X)\)
-
\(-1 \leq Cor(X, Y) \leq 1\)
-
\(Cor(X,Y) = 1\) and \(Cor(X, Y) = -1\) only when the \(X\) or \(Y\) observations fall perfectly on a positive or negative sloped line, respectively.
-
\(Cor(X, Y)\) measures the strength of the linear relationship between the \(X\) and \(Y\) data, with stronger relationships as \(Cor(X,Y)\) heads towards -1 or 1.
We often use a different letter than \(X\), such as \(Y_1, \ldots , Y_n\).
+
We will typically use Greek letters for things we don’t know. Such as, \(\mu\) is a mean that we’d like to estimate.
+
We will use capital letters for conceptual values of the variables and lowercase letters for realized values.
+
So this way we can write \(P(X_i > x)\).
+
\(X_i\) is a conceptual random variable.
+
\(x\) is a number that we plug into.
+
+
+
+
The empirical mean
+
+
Define the empirical mean as \[
+\bar X = \frac{1}{n}\sum_{i=1}^n X_i.
+\]
+
Notice if we subtract the mean from data points, we get data that has mean 0. That is, if we define \[
+\tilde X_i = X_i - \bar X.
+\] The the mean of the \(\tilde X_i\) is 0.
+
This process is called “centering” the random variables.
+
The mean is a measure of central tendancy of the data.
+
Recall from the previous lecture that the mean is the least squares solution for minimizing \[
+ \sum_{i=1}^n (X_i - \mu)^2
+ \]
+
+
+
+
The emprical standard deviation and variance
+
+
Define the empirical variance as \[
+S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2
+= \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - n \bar X ^ 2 \right)
+\]
+
The empirical standard deviation is defined as \(S = \sqrt{S^2}\). Notice that the standard deviation has the same units as the data.
+
The data defined by \(X_i / s\) have empirical standard deviation 1. This is called “scaling” the data.
+
The empirical standard deviation is a measure of spread.
+
Sometimes people divide by \(n\) rather than \(n-1\) (the latter produces an unbiased estimate.)
+
+
+
+
Normalization
+
+
The the data defined by \[
+Z_i = \frac{X_i - \bar X}{s}
+\] have empirical mean zero and empirical standard deviation 1.
+
The process of centering then scaling the data is called “normalizing” the data.
+
Normalized data are centered at 0 and have units equal to standard deviations of the original data.
+
Example, a value of 2 form normalized data means that data point was two standard deviations larger than the mean.
+
+
+
+
The empirical covariance
+
+
Consider now when we have pairs of data, \((X_i, Y_i)\).
+
Their empirical covariance is \[
+Cov(X, Y) =
+\frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X) (Y_i - \bar Y)
+= \frac{1}{n-1}\left( \sum_{i=1}^n X_i Y_i - n \bar X \bar Y\right)
+\]
+
Some people prefer to divide by \(n\) rather than \(n-1\) (the latter produces an unbiased estimate.)
+
The correlation is defined is \[
+Cor(X, Y) = \frac{Cov(X, Y)}{S_x S_y}
+\] where \(S_x\) and \(S_y\) are the estimates of standard deviations for the \(X\) observations and \(Y\) observations, respectively.
+
+
+
+
Some facts about correlation
+
+
\(Cor(X, Y) = Cor(Y, X)\)
+
\(-1 \leq Cor(X, Y) \leq 1\)
+
\(Cor(X,Y) = 1\) and \(Cor(X, Y) = -1\) only when the \(X\) or \(Y\) observations fall perfectly on a positive or negative sloped line, respectively.
+
\(Cor(X, Y)\) measures the strength of the linear relationship between the \(X\) and \(Y\) data, with stronger relationships as \(Cor(X,Y)\) heads towards -1 or 1.
+
\(Cor(X, Y) = 0\) implies no linear relationship.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/01_03_ols/index.Rmd b/07_RegressionModels/01_03_ols/index.Rmd
index 17f93b5c5..d9a429e02 100644
--- a/07_RegressionModels/01_03_ols/index.Rmd
+++ b/07_RegressionModels/01_03_ols/index.Rmd
@@ -1,252 +1,254 @@
----
-title : Least squares estimation of regression lines
-subtitle : Regression via least squares
-author : Brian Caffo, Jeff Leek and Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../libraries
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
-
----
-
-```{r setup, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE, tidy = FALSE}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-```
-
-## General least squares for linear equations
-Consider again the parent and child height data from Galton
-
-```{r, fig.height=5, fig.width=5, echo=FALSE}
-library(UsingR)
-data(galton)
-freqData <- as.data.frame(table(galton$child, galton$parent))
-names(freqData) <- c("child", "parent", "freq")
-plot(as.numeric(as.vector(freqData$parent)),
- as.numeric(as.vector(freqData$child)),
- pch = 21, col = "black", bg = "lightblue",
- cex = .05 * freqData$freq,
- xlab = "parent", ylab = "child")
-```
-
----
-## Fitting the best line
-* Let $Y_i$ be the $i^{th}$ child's height and $X_i$ be the
-$i^{th}$ (average over the pair of) parents' heights.
-* Consider finding the best line
- * Child's Height = $\beta_0$ + Parent's Height $\beta_1$
-* Use least squares
- $$
- \sum_{i=1}^n \{Y_i - (\beta_0 + \beta_1 X_i)\}^2
- $$
-* How do we do it?
-
----
-## Let's solve this problem generally
-* Let $\mu_i = \beta_0 + \beta_1 X_i$ and our estimates be
-$\hat \mu_i = \hat \beta_0 + \hat \beta_1 X_i$.
-* We want to minimize
-$$ \dagger \sum_{i=1}^n (Y_i - \mu_i)^2 = \sum_{i=1}^n (Y_i - \hat \mu_i) ^ 2 + 2 \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) + \sum_{i=1}^n (\hat \mu_i - \mu_i)^2$$
-* Suppose that $$\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0$$ then
-$$ \dagger
-=\sum_{i=1}^n (Y_i - \hat \mu_i) ^ 2 + \sum_{i=1}^n (\hat \mu_i - \mu_i)^2\geq \sum_{i=1}^n (Y_i - \hat \mu_i) ^ 2$$
-
----
-## Mean only regression
-* So we know that if:
-$$ \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0$$
-where $\mu_i = \beta_0 + \beta_1 X_i$ and $\hat \mu_i = \hat \beta_0 + \hat \beta_1 X_i$ then the line
-$$Y = \hat \beta_0 + \hat \beta_1 X$$
-is the least squares line.
-* Consider forcing $\beta_1 = 0$ and thus $\hat \beta_1=0$;
-that is, only considering horizontal lines
-* The solution works out to be
-$$\hat \beta_0 = \bar Y.$$
-
----
-## Let's show it
-$$\begin{align} \
-\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i)
-= & \sum_{i=1}^n (Y_i - \hat \beta_0) (\hat \beta_0 - \beta_0) \\
-= & (\hat \beta_0 - \beta_0) \sum_{i=1}^n (Y_i - \hat \beta_0) \
-\end{align} $$
-
-Thus, this will equal 0 if $\sum_{i=1}^n (Y_i - \hat \beta_0)
-= n\bar Y - n \hat \beta_0=0$
-
-Thus $\hat \beta_0 = \bar Y.$
-
----
-## Regression through the origin
-* Recall that if:
-$$ \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0$$
-where $\mu_i = \beta_0 + \beta_1 X_i$ and $\hat \mu_i = \hat \beta_0 + \hat \beta_1 X_i$ then the line
-$$Y = \hat \beta_0 + \hat \beta_1 X$$
-is the least squares line.
-* Consider forcing $\beta_0 = 0$ and thus $\hat \beta_0=0$;
-that is, only considering lines through the origin
-* The solution works out to be
-$$\hat \beta_1 = \frac{\sum_{i=1^n} Y_i X_i}{\sum_{i=1}^n X_i^2}.$$
-
----
-## Let's show it
-$$\begin{align} \
-\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i)
-= & \sum_{i=1}^n (Y_i - \hat \beta_1 X_i) (\hat \beta_1 X_i - \beta_1 X_i) \\
-= & (\hat \beta_1 - \beta_1) \sum_{i=1}^n (Y_i X_i - \hat \beta_1 X_i ^2) \
-\end{align} $$
-
-Thus, this will equal 0 if $\sum_{i=1}^n (Y_i X_i - \hat \beta_1 X_i ^2) = \sum_{i=1}^n Y_i X_i - \hat \beta_1 \sum_{i=1}^n X_i^2 =0$
-
-Thus
-$$\hat \beta_1 = \frac{\sum_{i=1^n} Y_i X_i}{\sum_{i=1}^n X_i^2}.$$
-
-
----
-## Recapping what we know
-* If we define $\mu_i = \beta_0$ then $\hat \beta_0 = \bar Y$.
- * If we only look at horizontal lines, the least squares estimate of the intercept of that line is the average of the outcomes.
-* If we define $\mu_i = X_i \beta_1$ then $\hat \beta_1 = \frac{\sum_{i=1^n} Y_i X_i}{\sum_{i=1}^n X_i^2}$
- * If we only look at lines through the origin, we get the estimated slope is the cross product of the X and Ys divided by the cross product of the Xs with themselves.
-* What about when $\mu_i = \beta_0 + \beta_1 X_i$? That is, we don't want to restrict ourselves to horizontal lines or lines through the origin.
-
----
-## Let's figure it out
-$$\begin{align} \
-\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i)
-= & \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat\beta_1 X_i) (\hat \beta_0 + \hat \beta_1 X_i - \beta_0 - \beta_1 X_i) \\
-= & (\hat \beta_0 - \beta_0) \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) + (\beta_1 - \beta_1)\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i)X_i\\
-\end{align} $$
-Note that
-
-$$0=\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) = n \bar Y - n \hat \beta_0 - n \hat \beta_1 \bar X ~~\mbox{implies that}~~\hat \beta_0 = \bar Y - \hat \beta_1 \bar X $$
-
-Then
-$$\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) X_i = \sum_{i=1}^n (Y_i - \bar Y + \hat \beta_1 \bar X - \hat \beta_1 X_i)X_i$$
-
----
-## Continued
-$$=\sum_{i=1}^n \{(Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X) \}X_i$$
-And thus
-$$ \sum_{i=1}^n (Y_i - \bar Y)X_i - \hat \beta_1 \sum_{i=1}^n
-(X_i - \bar X) X_i = 0.$$
-So we arrive at
-$$
-\hat \beta_1 =
-\frac{\sum_{i=1}^n \{(Y_i - \bar Y)X_i}{\sum_{i=1}^n
-(X_i - \bar X) X_i} =
-\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{\sum_{i=1}^n
-(X_i - \bar X) (X_i - \bar X)}
-= Cor(Y, X) \frac{Sd(Y)}{Sd(X)}.
-$$
-And recall
-$$
-\hat \beta_0 = \bar Y - \hat \beta_1 \bar X.
-$$
-
----
-## Consequences
-* The least squares model fit to the line $Y = \beta_0 + \beta_1 X$ through the data pairs $(X_i, Y_i)$ with $Y_i$ as the outcome obtains the line $Y = \hat \beta_0 + \hat \beta_1 X$ where
- $$\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X$$
-* $\hat \beta_1$ has the units of $Y / X$, $\hat \beta_0$ has the units of $Y$.
-* The line passes through the point $(\bar X, \bar Y$)
-* The slope of the regression line with $X$ as the outcome and $Y$ as the predictor is $Cor(Y, X) Sd(X)/ Sd(Y)$.
-* The slope is the same one you would get if you centered the data,
-$(X_i - \bar X, Y_i - \bar Y)$, and did regression through the origin.
-* If you normalized the data, $\{ \frac{X_i - \bar X}{Sd(X)}, \frac{Y_i - \bar Y}{Sd(Y)}\}$, the slope is $Cor(Y, X)$.
-
----
-## Revisiting Galton's data
-### Double check our calculations using R
-```{r, fig.height=4,fig.width=4,echo=TRUE}
-y <- galton$child
-x <- galton$parent
-beta1 <- cor(y, x) * sd(y) / sd(x)
-beta0 <- mean(y) - beta1 * mean(x)
-rbind(c(beta0, beta1), coef(lm(y ~ x)))
-```
-
----
-## Revisiting Galton's data
-### Reversing the outcome/predictor relationship
-```{r, fig.height=4,fig.width=4,echo=TRUE}
-beta1 <- cor(y, x) * sd(x) / sd(y)
-beta0 <- mean(x) - beta1 * mean(y)
-rbind(c(beta0, beta1), coef(lm(x ~ y)))
-```
-
----
-## Revisiting Galton's data
-### Regression through the origin yields an equivalent slope if you center the data first
-```{r, fig.height=4,fig.width=4,echo=TRUE}
-yc <- y - mean(y)
-xc <- x - mean(x)
-beta1 <- sum(yc * xc) / sum(xc ^ 2)
-c(beta1, coef(lm(y ~ x))[2])
-```
-
----
-## Revisiting Galton's data
-### Normalizing variables results in the slope being the correlation
-```{r, echo=TRUE}
-yn <- (y - mean(y))/sd(y)
-xn <- (x - mean(x))/sd(x)
-c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])
-```
-
-
----
-## Plotting the fit
-* Size of points are frequencies at that X, Y combination.
-* For the red lie the child is outcome.
-* For the blue, the parent is the outcome (accounting for the fact that the response is plotted on the horizontal axis).
-* Black line assumes $Cor(Y, X) = 1$ (slope is $Sd(Y)/Sd(x)$).
-* Big black dot is $(\bar X, \bar Y)$.
-
----
-The code to add the lines
-
-```
-abline(mean(y) - mean(x) * cor(y, x) * sd(y) / sd(x),
- sd(y) / sd(x) * cor(y, x),
- lwd = 3, col = "red")
-abline(mean(y) - mean(x) * sd(y) / sd(x) / cor(y, x),
- sd(y) cor(y, x) / sd(x),
- lwd = 3, col = "blue")
-abline(mean(y) - mean(x) * sd(y) / sd(x),
- sd(y) / sd(x),
- lwd = 2)
-points(mean(x), mean(y), cex = 2, pch = 19)
-```
-
----
-```{r, fig.height=6,fig.width=6,echo=FALSE}
-freqData <- as.data.frame(table(galton$child, galton$parent))
-names(freqData) <- c("child", "parent", "freq")
-plot(as.numeric(as.vector(freqData$parent)),
- as.numeric(as.vector(freqData$child)),
- pch = 21, col = "black", bg = "lightblue",
- cex = .05 * freqData$freq,
- xlab = "parent", ylab = "child", xlim = c(62, 74), ylim = c(62, 74))
-abline(mean(y) - mean(x) * cor(y, x) * sd(y) / sd(x), sd(y) / sd(x) * cor(y, x), lwd = 3, col = "red")
-abline(mean(y) - mean(x) * sd(y) / sd(x) / cor(y, x), sd(y) / sd(x) / cor(y, x), lwd = 3, col = "blue")
-abline(mean(y) - mean(x) * sd(y) / sd(x), sd(y) / sd(x), lwd = 2)
-points(mean(x), mean(y), cex = 2, pch = 19)
-```
+---
+title : Least squares estimation of regression lines
+subtitle : Regression via least squares
+author : Brian Caffo, Jeff Leek and Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../libraries
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+
+
+```{r setup, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE, tidy = FALSE}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+```
+
+## General least squares for linear equations
+Consider again the parent and child height data from Galton
+
+```{r, fig.height=5, fig.width=5, echo=FALSE}
+library(UsingR)
+data(galton)
+freqData <- as.data.frame(table(galton$child, galton$parent))
+names(freqData) <- c("child", "parent", "freq")
+plot(as.numeric(as.vector(freqData$parent)),
+ as.numeric(as.vector(freqData$child)),
+ pch = 21, col = "black", bg = "lightblue",
+ cex = .05 * freqData$freq,
+ xlab = "parent", ylab = "child")
+```
+
+
+## Fitting the best line
+* Let $Y_i$ be the $i^{th}$ child's height and $X_i$ be the
+$i^{th}$ (average over the pair of) parents' heights.
+* Consider finding the best line
+ * Child's Height = $\beta_0$ + Parent's Height $\beta_1$
+* Use least squares
+ $$
+ \sum_{i=1}^n \{Y_i - (\beta_0 + \beta_1 X_i)\}^2
+ $$
+* How do we do it?
+
+
+## Let's solve this problem generally
+* Let $\mu_i = \beta_0 + \beta_1 X_i$ and our estimates be
+$\hat \mu_i = \hat \beta_0 + \hat \beta_1 X_i$.
+* We want to minimize
+$$ \dagger \sum_{i=1}^n (Y_i - \mu_i)^2 = \sum_{i=1}^n (Y_i - \hat \mu_i) ^ 2 + 2 \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) + \sum_{i=1}^n (\hat \mu_i - \mu_i)^2$$
+* Suppose that $$\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0$$ then
+$$ \dagger
+=\sum_{i=1}^n (Y_i - \hat \mu_i) ^ 2 + \sum_{i=1}^n (\hat \mu_i - \mu_i)^2\geq \sum_{i=1}^n (Y_i - \hat \mu_i) ^ 2$$
+
+
+## Mean only regression
+* So we know that if:
+$$ \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0$$
+where $\mu_i = \beta_0 + \beta_1 X_i$ and $\hat \mu_i = \hat \beta_0 + \hat \beta_1 X_i$ then the line
+$$Y = \hat \beta_0 + \hat \beta_1 X$$
+is the least squares line.
+* Consider forcing $\beta_1 = 0$ and thus $\hat \beta_1=0$;
+that is, only considering horizontal lines
+* The solution works out to be
+$$\hat \beta_0 = \bar Y.$$
+
+
+## Let's show it
+$$\begin{align} \
+\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i)
+= & \sum_{i=1}^n (Y_i - \hat \beta_0) (\hat \beta_0 - \beta_0) \\
+= & (\hat \beta_0 - \beta_0) \sum_{i=1}^n (Y_i - \hat \beta_0) \
+\end{align} $$
+
+Thus, this will equal 0 if $\sum_{i=1}^n (Y_i - \hat \beta_0)
+= n\bar Y - n \hat \beta_0=0$
+
+Thus $\hat \beta_0 = \bar Y.$
+
+
+## Regression through the origin
+* Recall that if:
+$$ \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0$$
+where $\mu_i = \beta_0 + \beta_1 X_i$ and $\hat \mu_i = \hat \beta_0 + \hat \beta_1 X_i$ then the line
+$$Y = \hat \beta_0 + \hat \beta_1 X$$
+is the least squares line.
+* Consider forcing $\beta_0 = 0$ and thus $\hat \beta_0=0$;
+that is, only considering lines through the origin
+* The solution works out to be
+$$\hat \beta_1 = \frac{\sum_{i=1^n} Y_i X_i}{\sum_{i=1}^n X_i^2}.$$
+
+
+## Let's show it
+$$\begin{align} \
+\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i)
+= & \sum_{i=1}^n (Y_i - \hat \beta_1 X_i) (\hat \beta_1 X_i - \beta_1 X_i) \\
+= & (\hat \beta_1 - \beta_1) \sum_{i=1}^n (Y_i X_i - \hat \beta_1 X_i ^2) \
+\end{align} $$
+
+Thus, this will equal 0 if $\sum_{i=1}^n (Y_i X_i - \hat \beta_1 X_i ^2) = \sum_{i=1}^n Y_i X_i - \hat \beta_1 \sum_{i=1}^n X_i^2 =0$
+
+Thus
+$$\hat \beta_1 = \frac{\sum_{i=1^n} Y_i X_i}{\sum_{i=1}^n X_i^2}.$$
+
+
+
+## Recapping what we know
+* If we define $\mu_i = \beta_0$ then $\hat \beta_0 = \bar Y$.
+ * If we only look at horizontal lines, the least squares estimate of the intercept of that line is the average of the outcomes.
+* If we define $\mu_i = X_i \beta_1$ then $\hat \beta_1 = \frac{\sum_{i=1^n} Y_i X_i}{\sum_{i=1}^n X_i^2}$
+ * If we only look at lines through the origin, we get the estimated slope is the cross product of the X and Ys divided by the cross product of the Xs with themselves.
+* What about when $\mu_i = \beta_0 + \beta_1 X_i$? That is, we don't want to restrict ourselves to horizontal lines or lines through the origin.
+
+
+## Let's figure it out
+$$\begin{align} \
+\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i)
+= & \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat\beta_1 X_i) (\hat \beta_0 + \hat \beta_1 X_i - \beta_0 - \beta_1 X_i) \\
+= & (\hat \beta_0 - \beta_0) \sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) + (\beta_1 - \beta_1)\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i)X_i\\
+\end{align} $$
+Note that
+
+$$0=\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) = n \bar Y - n \hat \beta_0 - n \hat \beta_1 \bar X ~~\mbox{implies that}~~\hat \beta_0 = \bar Y - \hat \beta_1 \bar X $$
+
+Then
+$$\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) X_i = \sum_{i=1}^n (Y_i - \bar Y + \hat \beta_1 \bar X - \hat \beta_1 X_i)X_i$$
+
+
+## Continued
+$$=\sum_{i=1}^n \{(Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X) \}X_i$$
+And thus
+$$ \sum_{i=1}^n (Y_i - \bar Y)X_i - \hat \beta_1 \sum_{i=1}^n
+(X_i - \bar X) X_i = 0.$$
+So we arrive at
+$$
+\hat \beta_1 =
+\frac{\sum_{i=1}^n \{(Y_i - \bar Y)X_i}{\sum_{i=1}^n
+(X_i - \bar X) X_i} =
+\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{\sum_{i=1}^n
+(X_i - \bar X) (X_i - \bar X)}
+= Cor(Y, X) \frac{Sd(Y)}{Sd(X)}.
+$$
+And recall
+$$
+\hat \beta_0 = \bar Y - \hat \beta_1 \bar X.
+$$
+
+
+## Consequences
+* The least squares model fit to the line $Y = \beta_0 + \beta_1 X$ through the data pairs $(X_i, Y_i)$ with $Y_i$ as the outcome obtains the line $Y = \hat \beta_0 + \hat \beta_1 X$ where
+ $$\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X$$
+* $\hat \beta_1$ has the units of $Y / X$, $\hat \beta_0$ has the units of $Y$.
+* The line passes through the point $(\bar X, \bar Y$)
+* The slope of the regression line with $X$ as the outcome and $Y$ as the predictor is $Cor(Y, X) Sd(X)/ Sd(Y)$.
+* The slope is the same one you would get if you centered the data,
+$(X_i - \bar X, Y_i - \bar Y)$, and did regression through the origin.
+* If you normalized the data, $\{ \frac{X_i - \bar X}{Sd(X)}, \frac{Y_i - \bar Y}{Sd(Y)}\}$, the slope is $Cor(Y, X)$.
+
+
+## Revisiting Galton's data
+### Double check our calculations using R
+```{r, fig.height=4,fig.width=4,echo=TRUE}
+y <- galton$child
+x <- galton$parent
+beta1 <- cor(y, x) * sd(y) / sd(x)
+beta0 <- mean(y) - beta1 * mean(x)
+rbind(c(beta0, beta1), coef(lm(y ~ x)))
+```
+
+
+## Revisiting Galton's data
+### Reversing the outcome/predictor relationship
+```{r, fig.height=4,fig.width=4,echo=TRUE}
+beta1 <- cor(y, x) * sd(x) / sd(y)
+beta0 <- mean(x) - beta1 * mean(y)
+rbind(c(beta0, beta1), coef(lm(x ~ y)))
+```
+
+
+## Revisiting Galton's data
+### Regression through the origin yields an equivalent slope if you center the data first
+```{r, fig.height=4,fig.width=4,echo=TRUE}
+yc <- y - mean(y)
+xc <- x - mean(x)
+beta1 <- sum(yc * xc) / sum(xc ^ 2)
+c(beta1, coef(lm(y ~ x))[2])
+```
+
+
+## Revisiting Galton's data
+### Normalizing variables results in the slope being the correlation
+```{r, echo=TRUE}
+yn <- (y - mean(y))/sd(y)
+xn <- (x - mean(x))/sd(x)
+c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])
+```
+
+
+
+## Plotting the fit
+* Size of points are frequencies at that X, Y combination.
+* For the red lie the child is outcome.
+* For the blue, the parent is the outcome (accounting for the fact that the response is plotted on the horizontal axis).
+* Black line assumes $Cor(Y, X) = 1$ (slope is $Sd(Y)/Sd(x)$).
+* Big black dot is $(\bar X, \bar Y)$.
+
+
+The code to add the lines
+
+```
+abline(mean(y) - mean(x) * cor(y, x) * sd(y) / sd(x),
+ sd(y) / sd(x) * cor(y, x),
+ lwd = 3, col = "red")
+abline(mean(y) - mean(x) * sd(y) / sd(x) / cor(y, x),
+ sd(y) cor(y, x) / sd(x),
+ lwd = 3, col = "blue")
+abline(mean(y) - mean(x) * sd(y) / sd(x),
+ sd(y) / sd(x),
+ lwd = 2)
+points(mean(x), mean(y), cex = 2, pch = 19)
+```
+
+
+```{r, fig.height=6,fig.width=6,echo=FALSE}
+freqData <- as.data.frame(table(galton$child, galton$parent))
+names(freqData) <- c("child", "parent", "freq")
+plot(as.numeric(as.vector(freqData$parent)),
+ as.numeric(as.vector(freqData$child)),
+ pch = 21, col = "black", bg = "lightblue",
+ cex = .05 * freqData$freq,
+ xlab = "parent", ylab = "child", xlim = c(62, 74), ylim = c(62, 74))
+abline(mean(y) - mean(x) * cor(y, x) * sd(y) / sd(x), sd(y) / sd(x) * cor(y, x), lwd = 3, col = "red")
+abline(mean(y) - mean(x) * sd(y) / sd(x) / cor(y, x), sd(y) / sd(x) / cor(y, x), lwd = 3, col = "blue")
+abline(mean(y) - mean(x) * sd(y) / sd(x), sd(y) / sd(x), lwd = 2)
+points(mean(x), mean(y), cex = 2, pch = 19)
+```
diff --git a/07_RegressionModels/01_03_ols/index.html b/07_RegressionModels/01_03_ols/index.html
index 2f7fa3efb..fc132cc82 100644
--- a/07_RegressionModels/01_03_ols/index.html
+++ b/07_RegressionModels/01_03_ols/index.html
@@ -1,448 +1,333 @@
-
-
-
- Least squares estimation of regression lines
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Least squares estimation of regression lines
-
Regression via least squares
-
Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
General least squares for linear equations
-
-
-
Consider again the parent and child height data from Galton
-
-
-
-
-
-
-
-
-
-
Fitting the best line
-
-
-
-
Let \(Y_i\) be the \(i^{th}\) child's height and \(X_i\) be the
-\(i^{th}\) (average over the pair of) parents' heights.
If we define \(\mu_i = \beta_0\) then \(\hat \beta_0 = \bar Y\).
-
-
-
If we only look at horizontal lines, the least squares estimate of the intercept of that line is the average of the outcomes.
-
-
If we define \(\mu_i = X_i \beta_1\) then \(\hat \beta_1 = \frac{\sum_{i=1^n} Y_i X_i}{\sum_{i=1}^n X_i^2}\)
-
-
-
If we only look at lines through the origin, we get the estimated slope is the cross product of the X and Ys divided by the cross product of the Xs with themselves.
-
-
What about when \(\mu_i = \beta_0 + \beta_1 X_i\)? That is, we don't want to restrict ourselves to horizontal lines or lines through the origin.
\[0=\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) = n \bar Y - n \hat \beta_0 - n \hat \beta_1 \bar X ~~\mbox{implies that}~~\hat \beta_0 = \bar Y - \hat \beta_1 \bar X \]
-
-
Then
-\[\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) X_i = \sum_{i=1}^n (Y_i - \bar Y + \hat \beta_1 \bar X - \hat \beta_1 X_i)X_i\]
The least squares model fit to the line \(Y = \beta_0 + \beta_1 X\) through the data pairs \((X_i, Y_i)\) with \(Y_i\) as the outcome obtains the line \(Y = \hat \beta_0 + \hat \beta_1 X\) where
-\[\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]
-
\(\hat \beta_1\) has the units of \(Y / X\), \(\hat \beta_0\) has the units of \(Y\).
-
The line passes through the point \((\bar X, \bar Y\))
-
The slope of the regression line with \(X\) as the outcome and \(Y\) as the predictor is \(Cor(Y, X) Sd(X)/ Sd(Y)\).
-
The slope is the same one you would get if you centered the data,
-\((X_i - \bar X, Y_i - \bar Y)\), and did regression through the origin.
-
If you normalized the data, \(\{ \frac{X_i - \bar X}{Sd(X)}, \frac{Y_i - \bar Y}{Sd(Y)}\}\), the slope is \(Cor(Y, X)\).
So we know that if: \[ \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0\] where \(\mu_i = \beta_0 + \beta_1 X_i\) and \(\hat \mu_i = \hat \beta_0 + \hat \beta_1 X_i\) then the line \[Y = \hat \beta_0 + \hat \beta_1 X\] is the least squares line.
+
Consider forcing \(\beta_1 = 0\) and thus \(\hat \beta_1=0\); that is, only considering horizontal lines
+
The solution works out to be \[\hat \beta_0 = \bar Y.\]
If we define \(\mu_i = \beta_0\) then \(\hat \beta_0 = \bar Y\).
+
If we only look at horizontal lines, the least squares estimate of the intercept of that line is the average of the outcomes.
+
If we define \(\mu_i = X_i \beta_1\) then \(\hat \beta_1 = \frac{\sum_{i=1^n} Y_i X_i}{\sum_{i=1}^n X_i^2}\)
+
If we only look at lines through the origin, we get the estimated slope is the cross product of the X and Ys divided by the cross product of the Xs with themselves.
+
What about when \(\mu_i = \beta_0 + \beta_1 X_i\)? That is, we don’t want to restrict ourselves to horizontal lines or lines through the origin.
\[0=\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) = n \bar Y - n \hat \beta_0 - n \hat \beta_1 \bar X ~~\mbox{implies that}~~\hat \beta_0 = \bar Y - \hat \beta_1 \bar X \]
+
Then \[\sum_{i=1}^n (Y_i - \hat\beta_0 - \hat \beta_1 X_i) X_i = \sum_{i=1}^n (Y_i - \bar Y + \hat \beta_1 \bar X - \hat \beta_1 X_i)X_i\]
The least squares model fit to the line \(Y = \beta_0 + \beta_1 X\) through the data pairs \((X_i, Y_i)\) with \(Y_i\) as the outcome obtains the line \(Y = \hat \beta_0 + \hat \beta_1 X\) where \[\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]
+
\(\hat \beta_1\) has the units of \(Y / X\), \(\hat \beta_0\) has the units of \(Y\).
+
The line passes through the point \((\bar X, \bar Y\))
+
The slope of the regression line with \(X\) as the outcome and \(Y\) as the predictor is \(Cor(Y, X) Sd(X)/ Sd(Y)\).
+
The slope is the same one you would get if you centered the data, \((X_i - \bar X, Y_i - \bar Y)\), and did regression through the origin.
+
If you normalized the data, \(\{ \frac{X_i - \bar X}{Sd(X)}, \frac{Y_i - \bar Y}{Sd(Y)}\}\), the slope is \(Cor(Y, X)\).
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/01_04_rttm/index.Rmd b/07_RegressionModels/01_04_rttm/index.Rmd
index 2759fbb8f..0dc0ab3c3 100644
--- a/07_RegressionModels/01_04_rttm/index.Rmd
+++ b/07_RegressionModels/01_04_rttm/index.Rmd
@@ -1,104 +1,106 @@
----
-title : Historical side note, Regression to Mediocrity
-subtitle : Regression to the mean
-author : Brian Caffo, Jeff Leek, Roger Peng PhD
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../libraries
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
-
----
-
-```{r setup, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE, tidy = FALSE}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-```
-
-## A historically famous idea, Regression to the Mean
-* Why is it that the children of tall parents tend to be tall, but not as tall as their parents?
-* Why do children of short parents tend to be short, but not as short as their parents?
-* Why do parents of very short children, tend to be short, but not a short as their child? And the same with parents of very tall children?
-* Why do the best performing athletes this year tend to do a little worse the following?
-
----
-## Regression to the mean
-* These phenomena are all examples of so-called regression to the mean
-* Invented by Francis Galton in the paper "Regression towards mediocrity in hereditary stature" The Journal of the Anthropological Institute of Great Britain and Ireland , Vol. 15, (1886).
-* Think of it this way, imagine if you simulated pairs of random normals
- * The largest first ones would be the largest by chance, and the probability that there are smaller for the second simulation is high.
- * In other words $P(Y < x | X = x)$ gets bigger as $x$ heads into the very large values.
- * Similarly $P(Y > x | X = x)$ gets bigger as $x$ heads to very small values.
-* Think of the regression line as the intrisic part.
- * Unless $Cor(Y, X) = 1$ the intrinsic part isn't perfect
-
----
-## Regression to the mean
-* Suppose that we normalize $X$ (child's height) and $Y$ (parent's height) so that they both have mean 0 and variance 1.
-* Then, recall, our regression line passes through $(0, 0)$ (the mean of the X and Y).
-* If the slope of the regression line is $Cor(Y,X)$, regardless of which variable is the outcome (recall, both standard deviations are 1).
-* Notice if $X$ is the outcome and you create a plot where $X$ is the horizontal axis, the slope of the least squares line that you plot is $1/Cor(Y, X)$.
-
----
-## Normalizing the data and setting plotting parameters
-```{r, echo = TRUE}
-library(UsingR)
-data(father.son)
-y <- (father.son$sheight - mean(father.son$sheight)) / sd(father.son$sheight)
-x <- (father.son$fheight - mean(father.son$fheight)) / sd(father.son$fheight)
-rho <- cor(x, y)
-myPlot <- function(x, y) {
- plot(x, y,
- xlab = "Father's height, normalized",
- ylab = "Son's height, normalized",
- xlim = c(-3, 3), ylim = c(-3, 3),
- bg = "lightblue", col = "black", cex = 1.1, pch = 21,
- frame = FALSE)
-}
-```
-
----
-## Plot the data, code
-```
-myPlot(x, y)
-abline(0, 1) # if there were perfect correlation
-abline(0, rho, lwd = 2) # father predicts son
-abline(0, 1 / rho, lwd = 2) # son predicts father, son on vertical axis
-abline(h = 0); abline(v = 0) # reference lines for no relathionship
-```
-
----
-## Plot the data, results
-```{r, fig.height=6,fig.width=6,echo=FALSE}
-myPlot(x, y)
-abline(0, 1)
-abline(0, rho, lwd = 2)
-abline(0, 1 / rho, lwd = 2)
-abline(h = 0); abline(v = 0)
-```
-
----
-## Discussion
-* If you had to predict a son's normalized height, it would be
- $Cor(Y, X) * X_i$
-* If you had to predict a father's normalized height, it would be
- $Cor(Y, X) * Y_i$
-* Multiplication by this correlation shrinks toward 0 (regression toward the mean)
-* If the correlation is 1 there is no regression to the mean (if father's height perfectly determine's child's height and vice versa)
-* Note, regression to the mean has been thought about quite a bit and generalized
+---
+title : Historical side note, Regression to Mediocrity
+subtitle : Regression to the mean
+author : Brian Caffo, Jeff Leek, Roger Peng PhD
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../libraries
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+
+---
+
+```{r setup, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE, tidy = FALSE}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+```
+
+## A historically famous idea, Regression to the Mean
+* Why is it that the children of tall parents tend to be tall, but not as tall as their parents?
+* Why do children of short parents tend to be short, but not as short as their parents?
+* Why do parents of very short children, tend to be short, but not a short as their child? And the same with parents of very tall children?
+* Why do the best performing athletes this year tend to do a little worse the following?
+
+
+## Regression to the mean
+* These phenomena are all examples of so-called regression to the mean
+* Invented by Francis Galton in the paper "Regression towards mediocrity in hereditary stature" The Journal of the Anthropological Institute of Great Britain and Ireland , Vol. 15, (1886).
+* Think of it this way, imagine if you simulated pairs of random normals
+ * The largest first ones would be the largest by chance, and the probability that there are smaller for the second simulation is high.
+ * In other words $P(Y < x | X = x)$ gets bigger as $x$ heads into the very large values.
+ * Similarly $P(Y > x | X = x)$ gets bigger as $x$ heads to very small values.
+* Think of the regression line as the intrisic part.
+ * Unless $Cor(Y, X) = 1$ the intrinsic part isn't perfect
+
+
+## Regression to the mean
+* Suppose that we normalize $X$ (child's height) and $Y$ (parent's height) so that they both have mean 0 and variance 1.
+* Then, recall, our regression line passes through $(0, 0)$ (the mean of the X and Y).
+* If the slope of the regression line is $Cor(Y,X)$, regardless of which variable is the outcome (recall, both standard deviations are 1).
+* Notice if $X$ is the outcome and you create a plot where $X$ is the horizontal axis, the slope of the least squares line that you plot is $1/Cor(Y, X)$.
+
+
+## Normalizing the data and setting plotting parameters
+```{r, echo = TRUE}
+library(UsingR)
+data(father.son)
+y <- (father.son$sheight - mean(father.son$sheight)) / sd(father.son$sheight)
+x <- (father.son$fheight - mean(father.son$fheight)) / sd(father.son$fheight)
+rho <- cor(x, y)
+myPlot <- function(x, y) {
+ plot(x, y,
+ xlab = "Father's height, normalized",
+ ylab = "Son's height, normalized",
+ xlim = c(-3, 3), ylim = c(-3, 3),
+ bg = "lightblue", col = "black", cex = 1.1, pch = 21,
+ frame = FALSE)
+}
+```
+
+
+## Plot the data, code
+```
+myPlot(x, y)
+abline(0, 1) # if there were perfect correlation
+abline(0, rho, lwd = 2) # father predicts son
+abline(0, 1 / rho, lwd = 2) # son predicts father, son on vertical axis
+abline(h = 0); abline(v = 0) # reference lines for no relathionship
+```
+
+
+## Plot the data, results
+```{r, fig.height=6,fig.width=6,echo=FALSE}
+myPlot(x, y)
+abline(0, 1)
+abline(0, rho, lwd = 2)
+abline(0, 1 / rho, lwd = 2)
+abline(h = 0); abline(v = 0)
+```
+
+
+## Discussion
+* If you had to predict a son's normalized height, it would be
+ $Cor(Y, X) * X_i$
+* If you had to predict a father's normalized height, it would be
+ $Cor(Y, X) * Y_i$
+* Multiplication by this correlation shrinks toward 0 (regression toward the mean)
+* If the correlation is 1 there is no regression to the mean (if father's height perfectly determine's child's height and vice versa)
+* Note, regression to the mean has been thought about quite a bit and generalized
diff --git a/07_RegressionModels/01_04_rttm/index.html b/07_RegressionModels/01_04_rttm/index.html
index 3cf677f25..fc0f678c5 100644
--- a/07_RegressionModels/01_04_rttm/index.html
+++ b/07_RegressionModels/01_04_rttm/index.html
@@ -1,211 +1,222 @@
-
-
-
- Historical side note, Regression to Mediocrity
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Historical side note, Regression to Mediocrity
-
Regression to the mean
-
Brian Caffo, Jeff Leek, Roger Peng PhD Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
A historically famous idea, Regression to the Mean
-
-
-
-
Why is it that the children of tall parents tend to be tall, but not as tall as their parents?
-
Why do children of short parents tend to be short, but not as short as their parents?
-
Why do parents of very short children, tend to be short, but not a short as their child? And the same with parents of very tall children?
-
Why do the best performing athletes this year tend to do a little worse the following?
-
-
-
-
-
-
-
-
-
Regression to the mean
-
-
-
-
These phenomena are all examples of so-called regression to the mean
-
Invented by Francis Galton in the paper "Regression towvards mediocrity in hereditary stature" The Journal of the Anthropological Institute of Great Britain and Ireland , Vol. 15, (1886).
-
Think of it this way, imagine if you simulated pairs of random normals
-
-
-
The largest first ones would be the largest by chance, and the probability that there are smaller for the second simulation is high.
-
In other words \(P(Y < x | X = x)\) gets bigger as \(x\) heads into the very large values.
-
Similarly \(P(Y > x | X = x)\) gets bigger as \(x\) heads to very small values.
-
-
Think of the regression line as the intrisic part.
-
-
-
Unless \(Cor(Y, X) = 1\) the intrinsic part isn't perfect
-
-
-
-
-
-
-
-
-
-
Regression to the mean
-
-
-
-
Suppose that we normalize \(X\) (child's height) and \(Y\) (parent's height) so that they both have mean 0 and variance 1.
-
Then, recall, our regression line passes through \((0, 0)\) (the mean of the X and Y).
-
If the slope of the regression line is \(Cor(Y,X)\), regardless of which variable is the outcome (recall, both standard deviations are 1).
-
Notice if \(X\) is the outcome and you create a plot where \(X\) is the horizontal axis, the slope of the least squares line that you plot is \(1/Cor(Y, X)\).
-
-
-
-
-
-
-
-
-
Normalizing the data and setting plotting parameters
myPlot(x, y)
-abline(0, 1) # if there were perfect correlation
-abline(0, rho, lwd = 2) # father predicts son
-abline(0, 1 / rho, lwd = 2) # son predicts father, son on vertical axis
-abline(h = 0); abline(v = 0) # reference lines for no relathionship
-
-
-
-
-
-
-
-
-
Plot the data, results
-
-
-
-
-
-
-
-
-
-
-
Discussion
-
-
-
-
If you had to predict a son's normalized height, it would be
-\(Cor(Y, X) * X_i\)
-
If you had to predict a father's normalized height, it would be
-\(Cor(Y, X) * Y_i\)
-
Multiplication by this correlation shrinks toward 0 (regression toward the mean)
-
If the correlation is 1 there is no regression to the mean (if father's height perfectly determine's child's height and vice versa)
-
Note, regression to the mean has been thought about quite a bit and generalized
A historically famous idea, Regression to the Mean
+
+
Why is it that the children of tall parents tend to be tall, but not as tall as their parents?
+
Why do children of short parents tend to be short, but not as short as their parents?
+
Why do parents of very short children, tend to be short, but not a short as their child? And the same with parents of very tall children?
+
Why do the best performing athletes this year tend to do a little worse the following?
+
+
+
+
Regression to the mean
+
+
These phenomena are all examples of so-called regression to the mean
+
Invented by Francis Galton in the paper “Regression towards mediocrity in hereditary stature” The Journal of the Anthropological Institute of Great Britain and Ireland , Vol. 15, (1886).
+
Think of it this way, imagine if you simulated pairs of random normals
+
The largest first ones would be the largest by chance, and the probability that there are smaller for the second simulation is high.
+
In other words \(P(Y < x | X = x)\) gets bigger as \(x\) heads into the very large values.
+
Similarly \(P(Y > x | X = x)\) gets bigger as \(x\) heads to very small values.
+
Think of the regression line as the intrisic part.
+
Unless \(Cor(Y, X) = 1\) the intrinsic part isn’t perfect
+
+
+
+
Regression to the mean
+
+
Suppose that we normalize \(X\) (child’s height) and \(Y\) (parent’s height) so that they both have mean 0 and variance 1.
+
Then, recall, our regression line passes through \((0, 0)\) (the mean of the X and Y).
+
If the slope of the regression line is \(Cor(Y,X)\), regardless of which variable is the outcome (recall, both standard deviations are 1).
+
Notice if \(X\) is the outcome and you create a plot where \(X\) is the horizontal axis, the slope of the least squares line that you plot is \(1/Cor(Y, X)\).
+
+
+
+
Normalizing the data and setting plotting parameters
myPlot(x, y)
+abline(0, 1) # if there were perfect correlation
+abline(0, rho, lwd = 2) # father predicts son
+abline(0, 1 / rho, lwd = 2) # son predicts father, son on vertical axis
+abline(h = 0); abline(v = 0) # reference lines for no relathionship
+
+
+
Plot the data, results
+
+
+
+
+
+
Discussion
+
+
If you had to predict a son’s normalized height, it would be \(Cor(Y, X) * X_i\)
+
If you had to predict a father’s normalized height, it would be \(Cor(Y, X) * Y_i\)
+
Multiplication by this correlation shrinks toward 0 (regression toward the mean)
+
If the correlation is 1 there is no regression to the mean (if father’s height perfectly determine’s child’s height and vice versa)
+
Note, regression to the mean has been thought about quite a bit and generalized
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/01_05_linearRegression/index.Rmd b/07_RegressionModels/01_05_linearRegression/index.Rmd
index b41a996be..2a3f998fe 100644
--- a/07_RegressionModels/01_05_linearRegression/index.Rmd
+++ b/07_RegressionModels/01_05_linearRegression/index.Rmd
@@ -1,215 +1,218 @@
----
-title : Statistical linear regression models
-subtitle :
-author : Brian Caffo, Jeff Leek, Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../libraries
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
-
----
-```{r setup, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE, tidy = FALSE}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-```
-## Basic regression model with additive Gaussian errors.
-* Least squares is an estimation tool, how do we do inference?
-* Consider developing a probabilistic model for linear regression
-$$
-Y_i = \beta_0 + \beta_1 X_i + \epsilon_{i}
-$$
-* Here the $\epsilon_{i}$ are assumed iid $N(0, \sigma^2)$.
-* Note, $E[Y_i ~|~ X_i = x_i] = \mu_i = \beta_0 + \beta_1 x_i$
-* Note, $Var(Y_i ~|~ X_i = x_i) = \sigma^2$.
-* Likelihood equivalent model specification is that the $Y_i$ are independent $N(\mu_i, \sigma^2)$.
-
----
-## Likelihood
-$$
-{\cal L}(\beta, \sigma)
-= \prod_{i=1}^n \left\{(2 \pi \sigma^2)^{-1/2}\exp\left(-\frac{1}{2\sigma^2}(y_i - \mu_i)^2 \right) \right\}
-$$
-so that the twice the negative log (base e) likelihood is
-$$
--2 \log\{ {\cal L}(\beta, \sigma) \}
-= \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \mu_i)^2 + n\log(\sigma^2)
-$$
-Discussion
-* Maximizing the likelihood is the same as minimizing -2 log likelihood
-* The least squares estimate for $\mu_i = \beta_0 + \beta_1 x_i$ is exactly the maximimum likelihood estimate (regardless of $\sigma$)
-
----
-## Recap
-* Model $Y_i = \mu_i + \epsilon_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i$ are iid $N(0, \sigma^2)$
-* ML estimates of $\beta_0$ and $\beta_1$ are the least squares estimates
- $$\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X$$
-* $E[Y ~|~ X = x] = \beta_0 + \beta_1 x$
-* $Var(Y ~|~ X = x) = \sigma^2$
-
----
-## Interpretting regression coefficients, the itc
-* $\beta_0$ is the expected value of the response when the predictor is 0
-$$
-E[Y | X = 0] = \beta_0 + \beta_1 \times 0 = \beta_0
-$$
-* Note, this isn't always of interest, for example when $X=0$ is impossible or far outside of the range of data. (X is blood pressure, or height etc.)
-* Consider that
-$$
-Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
-= \beta_0 + a \beta_1 + \beta_1 (X_i - a) + \epsilon_i
-= \tilde \beta_0 + \beta_1 (X_i - a) + \epsilon_i
-$$
-So, shifting you $X$ values by value $a$ changes the intercept, but not the slope.
-* Often $a$ is set to $\bar X$ so that the intercept is interpretted as the expected response at the average $X$ value.
-
----
-## Interpretting regression coefficients, the slope
-* $\beta_1$ is the expected change in response for a 1 unit change in the predictor
-$$
-E[Y ~|~ X = x+1] - E[Y ~|~ X = x] =
-\beta_0 + \beta_1 (x + 1) - (\beta_0 + \beta_1 x ) = \beta_1
-$$
-* Consider the impact of changing the units of $X$.
-$$
-Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
-= \beta_0 + \frac{\beta_1}{a} (X_i a) + \epsilon_i
-= \beta_0 + \tilde \beta_1 (X_i a) + \epsilon_i
-$$
-* Therefore, multiplication of $X$ by a factor $a$ results in dividing the coefficient by a factor of $a$.
-* Example: $X$ is height in $m$ and $Y$ is weight in $kg$. Then $\beta_1$ is $kg/m$. Converting $X$ to $cm$ implies multiplying $X$ by $100 cm/m$. To get $\beta_1$ in the right units, we have to divide by $100 cm /m$ to get it to have the right units.
-$$
-X m \times \frac{100cm}{m} = (100 X) cm
-~~\mbox{and}~~
-\beta_1 \frac{kg}{m} \times\frac{1 m}{100cm} =
-\left(\frac{\beta_1}{100}\right)\frac{kg}{cm}
-$$
-
----
-## Using regression coeficients for prediction
-* If we would like to guess the outcome at a particular
- value of the predictor, say $X$, the regression model guesses
- $$
- \hat \beta_0 + \hat \beta_1 X
- $$
-* Note that at the observed value of $X$s, we obtain the
- predictions
- $$
- \hat \mu_i = \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i
- $$
-* Remember that least squares minimizes
-$$
-\sum_{i=1}^n (Y_i - \mu_i)
-$$
-for $\mu_i$ expressed as points on a line
-
----
-## Example
-### `diamond` data set from `UsingR`
-Data is diamond prices (Signapore dollars) and diamond weight
-in carats (standard measure of diamond mass, 0.2 $g$). To get the data use `library(UsingR); data(diamond)`
-
-Plotting the fitted regression line and data
-```
-data(diamond)
-plot(diamond$carat, diamond$price,
- xlab = "Mass (carats)",
- ylab = "Price (SIN $)",
- bg = "lightblue",
- col = "black", cex = 1.1, pch = 21,frame = FALSE)
-abline(lm(price ~ carat, data = diamond), lwd = 2)
-```
-
----
-## The plot
-```{r, echo = FALSE, fig.height=5,fig.width=5}
-data(diamond)
-plot(diamond$carat, diamond$price,
- xlab = "Mass (carats)",
- ylab = "Price (SIN $)",
- bg = "lightblue",
- col = "black", cex = 1.1, pch = 21,frame = FALSE)
-abline(lm(price ~ carat, data = diamond), lwd = 2)
-```
-
----
-## Fitting the linear regression model
-```{r}
-fit <- lm(price ~ carat, data = diamond)
-coef(fit)
-```
-
-* We estimate an expected `r round(coef(fit)[2], 2)` (SIN) dollar increase in price for every carat increase in mass of diamond.
-* The intercept `r round(coef(fit)[1], 2)` is the expected price
- of a 0 carat diamond.
-
----
-## Getting a more interpretable intercept
-```{r, echo = TRUE}
-fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
-coef(fit2)
-```
-
-Thus $`r round(coef(fit2)[1], 1)` is the expected price for
-the average sized diamond of the data (`r mean(diamond$carat)` carats).
-
----
-## Changing scale
-* A one carat increase in a diamond is pretty big, what about
- changing units to 1/10th of a carat?
-* We can just do this by just dividing the coeficient by 10.
- * We expect a `r round(coef(fit)[2], 2) / 10` (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond.
-* Showing that it's the same if we rescale the Xs and refit
-```{r, echo = TRUE}
-fit3 <- lm(price ~ I(carat * 10), data = diamond)
-coef(fit3)
-```
-
----
-## Predicting the price of a diamond
-```{r, echo = TRUE}
-newx <- c(0.16, 0.27, 0.34)
-coef(fit)[1] + coef(fit)[2] * newx
-predict(fit, newdata = data.frame(carat = newx))
-```
-
----
-Predicted values at the observed Xs (red)
-and at the new Xs (lines)
-```{r, echo = FALSE, fig.height=5,fig.width=5}
-data(diamond)
-plot(diamond$carat, diamond$price,
- xlab = "Mass (carats)",
- ylab = "Price (SIN $)",
- bg = "lightblue",
- col = "black", cex = 1.1, pch = 21,frame = FALSE)
-abline(fit, lwd = 2)
-points(diamond$carat, predict(fit), pch = 19, col = "red")
-lines(c(0.16, 0.16, 0.12),
- c(200, coef(fit)[1] + coef(fit)[2] * 0.16,
- coef(fit)[1] + coef(fit)[2] * 0.16))
-lines(c(0.27, 0.27, 0.12),
- c(200, coef(fit)[1] + coef(fit)[2] * 0.27,
- coef(fit)[1] + coef(fit)[2] * 0.27))
-lines(c(0.34, 0.34, 0.12),
- c(200, coef(fit)[1] + coef(fit)[2] * 0.34,
- coef(fit)[1] + coef(fit)[2] * 0.34))
-text(newx, rep(250, 3), labels = newx, pos = 2)
-```
-
+---
+title : Statistical linear regression models
+subtitle :
+author : Brian Caffo, Jeff Leek, Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../libraries
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+
+---
+```{r setup, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE, tidy = FALSE}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+```
+## Basic regression model with additive Gaussian errors.
+* Least squares is an estimation tool, how do we do inference?
+* Consider developing a probabilistic model for linear regression
+$$
+Y_i = \beta_0 + \beta_1 X_i + \epsilon_{i}
+$$
+* Here the $\epsilon_{i}$ are assumed iid $N(0, \sigma^2)$.
+* Note, $E[Y_i ~|~ X_i = x_i] = \mu_i = \beta_0 + \beta_1 x_i$
+* Note, $Var(Y_i ~|~ X_i = x_i) = \sigma^2$.
+* Likelihood equivalent model specification is that the $Y_i$ are independent $N(\mu_i, \sigma^2)$.
+
+
+## Likelihood
+$$
+{\cal L}(\beta, \sigma)
+= \prod_{i=1}^n \left\{(2 \pi \sigma^2)^{-1/2}\exp\left(-\frac{1}{2\sigma^2}(y_i - \mu_i)^2 \right) \right\}
+$$
+so that the twice the negative log (base e) likelihood is
+$$
+-2 \log\{ {\cal L}(\beta, \sigma) \}
+= \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \mu_i)^2 + n\log(\sigma^2)
+$$
+Discussion
+* Maximizing the likelihood is the same as minimizing -2 log likelihood
+* The least squares estimate for $\mu_i = \beta_0 + \beta_1 x_i$ is exactly the maximimum likelihood estimate (regardless of $\sigma$)
+
+
+## Recap
+* Model $Y_i = \mu_i + \epsilon_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i$ are iid $N(0, \sigma^2)$
+* ML estimates of $\beta_0$ and $\beta_1$ are the least squares estimates
+ $$\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X$$
+* $E[Y ~|~ X = x] = \beta_0 + \beta_1 x$
+* $Var(Y ~|~ X = x) = \sigma^2$
+
+
+## Interpretting regression coefficients, the itc
+* $\beta_0$ is the expected value of the response when the predictor is 0
+$$
+E[Y | X = 0] = \beta_0 + \beta_1 \times 0 = \beta_0
+$$
+* Note, this isn't always of interest, for example when $X=0$ is impossible or far outside of the range of data. (X is blood pressure, or height etc.)
+* Consider that
+$$
+Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
+= \beta_0 + a \beta_1 + \beta_1 (X_i - a) + \epsilon_i
+= \tilde \beta_0 + \beta_1 (X_i - a) + \epsilon_i
+$$
+So, shifting you $X$ values by value $a$ changes the intercept, but not the slope.
+* Often $a$ is set to $\bar X$ so that the intercept is interpretted as the expected response at the average $X$ value.
+
+
+## Interpretting regression coefficients, the slope
+* $\beta_1$ is the expected change in response for a 1 unit change in the predictor
+$$
+E[Y ~|~ X = x+1] - E[Y ~|~ X = x] =
+\beta_0 + \beta_1 (x + 1) - (\beta_0 + \beta_1 x ) = \beta_1
+$$
+* Consider the impact of changing the units of $X$.
+$$
+Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
+= \beta_0 + \frac{\beta_1}{a} (X_i a) + \epsilon_i
+= \beta_0 + \tilde \beta_1 (X_i a) + \epsilon_i
+$$
+* Therefore, multiplication of $X$ by a factor $a$ results in dividing the coefficient by a factor of $a$.
+* Example: $X$ is height in $m$ and $Y$ is weight in $kg$. Then $\beta_1$ is $kg/m$. Converting $X$ to $cm$ implies multiplying $X$ by $100 cm/m$. To get $\beta_1$ in the right units, we have to divide by $100 cm /m$ to get it to have the right units.
+$$
+X m \times \frac{100cm}{m} = (100 X) cm
+~~\mbox{and}~~
+\beta_1 \frac{kg}{m} \times\frac{1 m}{100cm} =
+\left(\frac{\beta_1}{100}\right)\frac{kg}{cm}
+$$
+
+
+## Using regression coeficients for prediction
+* If we would like to guess the outcome at a particular
+ value of the predictor, say $X$, the regression model guesses
+ $$
+ \hat \beta_0 + \hat \beta_1 X
+ $$
+* Note that at the observed value of $X$s, we obtain the
+ predictions
+ $$
+ \hat \mu_i = \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i
+ $$
+* Remember that least squares minimizes
+$$
+\sum_{i=1}^n (Y_i - \mu_i)
+$$
+for $\mu_i$ expressed as points on a line
+
+
+## Example
+### `diamond` data set from `UsingR`
+Data is diamond prices (Signapore dollars) and diamond weight
+in carats (standard measure of diamond mass, 0.2 $g$). To get the data use `library(UsingR); data(diamond)`
+
+Plotting the fitted regression line and data
+```{r}
+library(UsingR)
+data(diamond)
+plot(diamond$carat, diamond$price,
+ xlab = "Mass (carats)",
+ ylab = "Price (SIN $)",
+ bg = "lightblue",
+ col = "black", cex = 1.1, pch = 21,frame = FALSE)
+abline(lm(price ~ carat, data = diamond), lwd = 2)
+```
+
+
+## The plot
+```{r, echo = FALSE, fig.height=5,fig.width=5}
+data(diamond)
+plot(diamond$carat, diamond$price,
+ xlab = "Mass (carats)",
+ ylab = "Price (SIN $)",
+ bg = "lightblue",
+ col = "black", cex = 1.1, pch = 21,frame = FALSE)
+abline(lm(price ~ carat, data = diamond), lwd = 2)
+```
+
+
+## Fitting the linear regression model
+```{r}
+fit <- lm(price ~ carat, data = diamond)
+coef(fit)
+```
+
+* We estimate an expected `r round(coef(fit)[2], 2)` (SIN) dollar increase in price for every carat increase in mass of diamond.
+* The intercept `r round(coef(fit)[1], 2)` is the expected price
+ of a 0 carat diamond.
+
+
+## Getting a more interpretable intercept
+```{r, echo = TRUE}
+fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
+coef(fit2)
+```
+
+Thus $`r round(coef(fit2)[1], 1)` is the expected price for
+the average sized diamond of the data (`r mean(diamond$carat)` carats).
+
+
+## Changing scale
+* A one carat increase in a diamond is pretty big, what about
+ changing units to 1/10th of a carat?
+* We can just do this by just dividing the coeficient by 10.
+ * We expect a `r round(coef(fit)[2], 2) / 10` (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond.
+* Showing that it's the same if we rescale the Xs and refit
+```{r, echo = TRUE}
+fit3 <- lm(price ~ I(carat * 10), data = diamond)
+coef(fit3)
+```
+
+
+## Predicting the price of a diamond
+```{r, echo = TRUE}
+newx <- c(0.16, 0.27, 0.34)
+coef(fit)[1] + coef(fit)[2] * newx
+predict(fit, newdata = data.frame(carat = newx))
+```
+
+
+Predicted values at the observed Xs (red)
+and at the new Xs (lines)
+```{r, echo = FALSE, fig.height=5,fig.width=5}
+data(diamond)
+plot(diamond$carat, diamond$price,
+ xlab = "Mass (carats)",
+ ylab = "Price (SIN $)",
+ bg = "lightblue",
+ col = "black", cex = 1.1, pch = 21,frame = FALSE)
+abline(fit, lwd = 2)
+points(diamond$carat, predict(fit), pch = 19, col = "red")
+lines(c(0.16, 0.16, 0.12),
+ c(200, coef(fit)[1] + coef(fit)[2] * 0.16,
+ coef(fit)[1] + coef(fit)[2] * 0.16))
+lines(c(0.27, 0.27, 0.12),
+ c(200, coef(fit)[1] + coef(fit)[2] * 0.27,
+ coef(fit)[1] + coef(fit)[2] * 0.27))
+lines(c(0.34, 0.34, 0.12),
+ c(200, coef(fit)[1] + coef(fit)[2] * 0.34,
+ coef(fit)[1] + coef(fit)[2] * 0.34))
+text(newx, rep(250, 3), labels = newx, pos = 2)
+```
+
diff --git a/07_RegressionModels/01_05_linearRegression/index.html b/07_RegressionModels/01_05_linearRegression/index.html
index 876178f27..821d96383 100644
--- a/07_RegressionModels/01_05_linearRegression/index.html
+++ b/07_RegressionModels/01_05_linearRegression/index.html
@@ -1,373 +1,302 @@
-
-
-
- Statistical linear regression models
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Statistical linear regression models
-
-
Brian Caffo, Jeff Leek, Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
Basic regression model with additive Gaussian errors.
-
-
-
-
Least squares is an estimation tool, how do we do inference?
-
Consider developing a probabilistic model for linear regression
-\[
-Y_i = \beta_0 + \beta_1 X_i + \epsilon_{i}
-\]
-
Here the \(\epsilon_{i}\) are assumed iid \(N(0, \sigma^2)\).
Maximizing the likelihood is the same as minimizing -2 log likelihood
-
The least squares estimate for \(\mu_i = \beta_0 + \beta_1 x_i\) is exactly the maximimum likelihood estimate (regardless of \(\sigma\))
-
-
-
-
-
-
-
-
-
Recap
-
-
-
-
Model \(Y_i = \mu_i + \epsilon_i = \beta_0 + \beta_1 X_i + \epsilon_i\) where \(\epsilon_i\) are iid \(N(0, \sigma^2)\)
-
ML estimates of \(\beta_0\) and \(\beta_1\) are the least squares estimates
-\[\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]
-
\(E[Y ~|~ X = x] = \beta_0 + \beta_1 x\)
-
\(Var(Y ~|~ X = x) = \sigma^2\)
-
-
-
-
-
-
-
-
-
Interpretting regression coefficients, the itc
-
-
-
-
\(\beta_0\) is the expected value of the response when the predictor is 0
-\[
-E[Y | X = 0] = \beta_0 + \beta_1 \times 0 = \beta_0
-\]
-
Note, this isn't always of interest, for example when \(X=0\) is impossible or far outside of the range of data. (X is blood pressure, or height etc.)
-
Consider that
-\[
-Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
-= \beta_0 + a \beta_1 + \beta_1 (X_i - a) + \epsilon_i
-= \tilde \beta_0 + \beta_1 (X_i - a) + \epsilon_i
-\]
-So, shifting you \(X\) values by value \(a\) changes the intercept, but not the slope.
-
Often \(a\) is set to \(\bar X\) so that the intercept is interpretted as the expected response at the average \(X\) value.
-
-
-
-
-
-
-
-
-
Interpretting regression coefficients, the slope
-
-
-
-
\(\beta_1\) is the expected change in response for a 1 unit change in the predictor
-\[
-E[Y ~|~ X = x+1] - E[Y ~|~ X = x] =
-\beta_0 + \beta_1 (x + 1) - (\beta_0 + \beta_1 x ) = \beta_1
-\]
-
Consider the impact of changing the units of \(X\).
-\[
-Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
-= \beta_0 + \frac{\beta_1}{a} (X_i a) + \epsilon_i
-= \beta_0 + \tilde \beta_1 (X_i a) + \epsilon_i
-\]
-
Therefore, multiplication of \(X\) by a factor \(a\) results in dividing the coefficient by a factor of \(a\).
-
Example: \(X\) is height in \(m\) and \(Y\) is weight in \(kg\). Then \(\beta_1\) is \(kg/m\). Converting \(X\) to \(cm\) implies multiplying \(X\) by \(100 cm/m\). To get \(\beta_1\) in the right units, we have to divide by \(100 cm /m\) to get it to have the right units.
-\[
-X m \times \frac{100cm}{m} = (100 X) cm
-~~\mbox{and}~~
-\beta_1 \frac{kg}{m} \times\frac{1 m}{100cm} =
-\left(\frac{\beta_1}{100}\right)\frac{kg}{cm}
-\]
-
-
-
-
-
-
-
-
-
Using regression coeficients for prediction
-
-
-
-
If we would like to guess the outcome at a particular
-value of the predictor, say \(X\), the regression model guesses
-\[
-\hat \beta_0 + \hat \beta_1 X
-\]
-
Note that at the observed value of $X$s, we obtain the
-predictions
-\[
-\hat \mu_i = \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i
-\]
-
Remember that least squares minimizes
-\[
-\sum_{i=1}^n (Y_i - \mu_i)
-\]
-for \(\mu_i\) expressed as points on a line
-
-
-
-
-
-
-
-
-
Example
-
-
-
diamond data set from UsingR
-
-
Data is diamond prices (Signapore dollars) and diamond weight
-in carats (standard measure of diamond mass, 0.2 \(g\)). To get the data use library(UsingR); data(diamond)
Likelihood equivalent model specification is that the \(Y_i\) are independent \(N(\mu_i, \sigma^2)\).
+
+
+
+
Likelihood
+
\[
+{\cal L}(\beta, \sigma)
+= \prod_{i=1}^n \left\{(2 \pi \sigma^2)^{-1/2}\exp\left(-\frac{1}{2\sigma^2}(y_i - \mu_i)^2 \right) \right\}
+\] so that the twice the negative log (base e) likelihood is \[
+-2 \log\{ {\cal L}(\beta, \sigma) \}
+= \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \mu_i)^2 + n\log(\sigma^2)
+\] Discussion * Maximizing the likelihood is the same as minimizing -2 log likelihood * The least squares estimate for \(\mu_i = \beta_0 + \beta_1 x_i\) is exactly the maximimum likelihood estimate (regardless of \(\sigma\))
+
+
+
Recap
+
+
Model \(Y_i = \mu_i + \epsilon_i = \beta_0 + \beta_1 X_i + \epsilon_i\) where \(\epsilon_i\) are iid \(N(0, \sigma^2)\)
+
ML estimates of \(\beta_0\) and \(\beta_1\) are the least squares estimates \[\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]
+
\(E[Y ~|~ X = x] = \beta_0 + \beta_1 x\)
+
\(Var(Y ~|~ X = x) = \sigma^2\)
+
+
+
+
Interpretting regression coefficients, the itc
+
+
\(\beta_0\) is the expected value of the response when the predictor is 0 \[
+E[Y | X = 0] = \beta_0 + \beta_1 \times 0 = \beta_0
+\]
+
Note, this isn’t always of interest, for example when \(X=0\) is impossible or far outside of the range of data. (X is blood pressure, or height etc.)
+
Consider that \[
+Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
+= \beta_0 + a \beta_1 + \beta_1 (X_i - a) + \epsilon_i
+= \tilde \beta_0 + \beta_1 (X_i - a) + \epsilon_i
+\] So, shifting you \(X\) values by value \(a\) changes the intercept, but not the slope.
+
Often \(a\) is set to \(\bar X\) so that the intercept is interpretted as the expected response at the average \(X\) value.
+
+
+
+
Interpretting regression coefficients, the slope
+
+
\(\beta_1\) is the expected change in response for a 1 unit change in the predictor \[
+E[Y ~|~ X = x+1] - E[Y ~|~ X = x] =
+\beta_0 + \beta_1 (x + 1) - (\beta_0 + \beta_1 x ) = \beta_1
+\]
+
Consider the impact of changing the units of \(X\). \[
+Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
+= \beta_0 + \frac{\beta_1}{a} (X_i a) + \epsilon_i
+= \beta_0 + \tilde \beta_1 (X_i a) + \epsilon_i
+\]
+
Therefore, multiplication of \(X\) by a factor \(a\) results in dividing the coefficient by a factor of \(a\).
+
Example: \(X\) is height in \(m\) and \(Y\) is weight in \(kg\). Then \(\beta_1\) is \(kg/m\). Converting \(X\) to \(cm\) implies multiplying \(X\) by \(100 cm/m\). To get \(\beta_1\) in the right units, we have to divide by \(100 cm /m\) to get it to have the right units. \[
+X m \times \frac{100cm}{m} = (100 X) cm
+~~\mbox{and}~~
+\beta_1 \frac{kg}{m} \times\frac{1 m}{100cm} =
+\left(\frac{\beta_1}{100}\right)\frac{kg}{cm}
+\]
+
+
+
+
Using regression coeficients for prediction
+
+
If we would like to guess the outcome at a particular value of the predictor, say \(X\), the regression model guesses \[
+ \hat \beta_0 + \hat \beta_1 X
+ \]
+
Note that at the observed value of \(X\)s, we obtain the predictions \[
+ \hat \mu_i = \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i
+ \]
+
Remember that least squares minimizes \[
+\sum_{i=1}^n (Y_i - \mu_i)
+\] for \(\mu_i\) expressed as points on a line
+
+
+
+
Example
+
+
diamond data set from UsingR
+
Data is diamond prices (Signapore dollars) and diamond weight in carats (standard measure of diamond mass, 0.2 \(g\)). To get the data use library(UsingR); data(diamond)
+Predicted values at the observed Xs (red) and at the new Xs (lines)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/01_06_residualVariation/index.Rmd b/07_RegressionModels/01_06_residualVariation/index.Rmd
index f4c090164..80ad1c0f9 100644
--- a/07_RegressionModels/01_06_residualVariation/index.Rmd
+++ b/07_RegressionModels/01_06_residualVariation/index.Rmd
@@ -1,243 +1,246 @@
----
-title : Residuals and residual variation
-subtitle :
-author : Brian Caffo, Jeff Leek and Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../libraries
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
-
----
-```{r setup, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE, tidy = FALSE}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-```
-## Residuals
-* Model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.
-* Observed outcome $i$ is $Y_i$ at predictor value $X_i$
-* Predicted outcome $i$ is $\hat Y_i$ at predictor valuve $X_i$ is
- $$
- \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i
- $$
-* Residual, the between the observed and predicted outcome
- $$
- e_i = Y_i - \hat Y_i
- $$
- * The vertical distance between the observed data point and the regression line
-* Least squares minimizes $\sum_{i=1}^n e_i^2$
-* The $e_i$ can be thought of as estimates of the $\epsilon_i$.
-
----
-## Properties of the residuals
-* $E[e_i] = 0$.
-* If an intercept is included, $\sum_{i=1}^n e_i = 0$
-* If a regressor variable, $X_i$, is included in the model $\sum_{i=1}^n e_i X_i = 0$.
-* Residuals are useful for investigating poor model fit.
-* Positive residuals are above the line, negative residuals are below.
-* Residuals can be thought of as the outcome ($Y$) with the
- linear association of the predictor ($X$) removed.
-* One differentiates residual variation (variation after removing
-the predictor) from systematic variation (variation explained by the regression model).
-* Residual plots highlight poor model fit.
-
----
-## Code
-
-```{r}
-data(diamond)
-y <- diamond$price; x <- diamond$carat; n <- length(y)
-fit <- lm(y ~ x)
-e <- resid(fit)
-yhat <- predict(fit)
-max(abs(e -(y - yhat)))
-max(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x)))
-```
-
----
-## Residuals are the signed length of the red lines
-```{r, echo = FALSE, fig.height=5, fig.width=5}
-plot(diamond$carat, diamond$price,
- xlab = "Mass (carats)",
- ylab = "Price (SIN $)",
- bg = "lightblue",
- col = "black", cex = 1.1, pch = 21,frame = FALSE)
-abline(fit, lwd = 2)
-for (i in 1 : n)
- lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)
-```
-
----
-## Residuals versus X
-```{r, echo = FALSE, fig.height=5, fig.width=5}
-plot(diamond$carat, e,
- xlab = "Mass (carats)",
- ylab = "Residuals (SIN $)",
- bg = "lightblue",
- col = "black", cex = 1.1, pch = 21,frame = FALSE)
-abline(h = 0, lwd = 2)
-for (i in 1 : n)
- lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)
-```
-
----
-## Non-linear data
-```{r, echo = TRUE, fig.height=5, fig.width=5}
-x <- runif(100, -3, 3); y <- x + sin(x) + rnorm(100, sd = .2);
-plot(x, y); abline(lm(y ~ x))
-```
-
----
-```{r, echo = TRUE, fig.height=5, fig.width=5}
-plot(x, resid(lm(y ~ x)));
-abline(h = 0)
-```
-
----
-## Heteroskedasticity
-```{r, echo = TRUE, fig.height=4.5, fig.width=4.5}
-x <- runif(100, 0, 6); y <- x + rnorm(100, mean = 0, sd = .001 * x);
-plot(x, y); abline(lm(y ~ x))
-```
-
----
-## Getting rid of the blank space can be helpful
-```{r, echo = TRUE, fig.height=4.5, fig.width=4.5}
-plot(x, resid(lm(y ~ x)));
-abline(h = 0)
-```
-
----
-## Estimating residual variation
-* Model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.
-* The ML estimate of $\sigma^2$ is $\frac{1}{n}\sum_{i=1}^n e_i^2$,
-the average squared residual.
-* Most people use
- $$
- \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2.
- $$
-* The $n-2$ instead of $n$ is so that $E[\hat \sigma^2] = \sigma^2$
-
----
-## Diamond example
-```{r, echo = TRUE}
-y <- diamond$price; x <- diamond$carat; n <- length(y)
-fit <- lm(y ~ x)
-summary(fit)$sigma
-sqrt(sum(resid(fit)^2) / (n - 2))
-```
-
----
-## Summarizing variation
-$$
-\begin{align}
-\sum_{i=1}^n (Y_i - \bar Y)^2
-& = \sum_{i=1}^n (Y_i - \hat Y_i + \hat Y_i - \bar Y)^2 \\
-& = \sum_{i=1}^n (Y_i - \hat Y_i)^2 +
-2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) +
-\sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\
-\end{align}
-$$
-
-****
-### Scratch work
-$(Y_i - \hat Y_i) = \{Y_i - (\bar Y - \hat \beta_1 \bar X) - \hat \beta_1 X_i\} = (Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X)$
-
-$(\hat Y_i - \bar Y) = (\bar Y - \hat \beta_1 \bar X - \hat \beta_1 X_i - \bar Y )
-= \hat \beta_1 (X_i - \bar X)$
-
-$\sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y)
-= \sum_{i=1}^n \{(Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X))\}\{\hat \beta_1 (X_i - \bar X)\}$
-
-$=\hat \beta_1 \sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X) -\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2$
-
-$= \hat \beta_1^2 \sum_{i=1}^n (X_i - \bar X)^2-\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2 = 0$
-
----
-## Summarizing variation
-$$
-\sum_{i=1}^n (Y_i - \bar Y)^2
-= \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2
-$$
-
-Or
-
-Total Variation = Residual Variation + Regression Variation
-
-Define the percent of total varation described by the model as
-$$
-R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
-= 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
-$$
-
----
-## Relation between $R^2$ and $r$ (the corrrelation)
-Recall that $(\hat Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X)$
-so that
-$$
-R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
-= \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i - \bar X)}{\sum_{i=1}^n (Y_i - \bar Y)^2}
-= Cor(Y, X)^2
-$$
-Since, recall,
-$$
-\hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)}
-$$
-So, $R^2$ is literally $r$ squared.
-
----
-## Some facts about $R^2$
-* $R^2$ is the percentage of variation explained by the regression model.
-* $0 \leq R^2 \leq 1$
-* $R^2$ is the sample correlation squared.
-* $R^2$ can be a misleading summary of model fit.
- * Deleting data can inflate $R^2$.
- * (For later.) Adding terms to a regression model always increases $R^2$.
-* Do `example(anscombe)` to see the following data.
- * Basically same mean and variance of X and Y.
- * Identical correlations (hence same $R^2$ ).
- * Same linear regression relationship.
-
----
-## `data(anscombe);example(anscombe)`
-```{r, echo = FALSE, fig.height=5, fig.width=5, results='hide'}
-require(stats); require(graphics); data(anscombe)
-ff <- y ~ x
-mods <- setNames(as.list(1:4), paste0("lm", 1:4))
-for(i in 1:4) {
- ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
- ## or ff[[2]] <- as.name(paste0("y", i))
- ## ff[[3]] <- as.name(paste0("x", i))
- mods[[i]] <- lmi <- lm(ff, data = anscombe)
- #print(anova(lmi))
-}
-
-
-## Now, do what you should have done in the first place: PLOTS
-op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0))
-for(i in 1:4) {
- ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
- plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
- xlim = c(3, 19), ylim = c(3, 13))
- abline(mods[[i]], col = "blue")
-}
-mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
-par(op)
-```
-
+---
+title : Residuals and residual variation
+subtitle :
+author : Brian Caffo, Jeff Leek and Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../libraries
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+
+---
+```{r setup, cache = FALSE, echo = FALSE, message = FALSE, warning = FALSE, tidy = FALSE}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+```
+## Residuals
+* Model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.
+* Observed outcome $i$ is $Y_i$ at predictor value $X_i$
+* Predicted outcome $i$ is $\hat Y_i$ at predictor valuve $X_i$ is
+ $$
+ \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i
+ $$
+* Residual, the between the observed and predicted outcome
+ $$
+ e_i = Y_i - \hat Y_i
+ $$
+ * The vertical distance between the observed data point and the regression line
+* Least squares minimizes $\sum_{i=1}^n e_i^2$
+* The $e_i$ can be thought of as estimates of the $\epsilon_i$.
+
+
+## Properties of the residuals
+* $E[e_i] = 0$.
+* If an intercept is included, $\sum_{i=1}^n e_i = 0$
+* If a regressor variable, $X_i$, is included in the model $\sum_{i=1}^n e_i X_i = 0$.
+* Residuals are useful for investigating poor model fit.
+* Positive residuals are above the line, negative residuals are below.
+* Residuals can be thought of as the outcome ($Y$) with the
+ linear association of the predictor ($X$) removed.
+* One differentiates residual variation (variation after removing
+the predictor) from systematic variation (variation explained by the regression model).
+* Residual plots highlight poor model fit.
+
+
+## Code
+
+```{r}
+library(UsingR)
+data(diamond)
+y <- diamond$price; x <- diamond$carat; n <- length(y)
+fit <- lm(y ~ x)
+e <- resid(fit)
+yhat <- predict(fit)
+max(abs(e -(y - yhat)))
+max(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x)))
+```
+
+
+## Residuals are the signed length of the red lines
+```{r, echo = FALSE, fig.height=5, fig.width=5}
+plot(diamond$carat, diamond$price,
+ xlab = "Mass (carats)",
+ ylab = "Price (SIN $)",
+ bg = "lightblue",
+ col = "black", cex = 1.1, pch = 21,frame = FALSE)
+abline(fit, lwd = 2)
+for (i in 1 : n)
+ lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)
+```
+
+
+## Residuals versus X
+```{r, echo = FALSE, fig.height=5, fig.width=5}
+plot(diamond$carat, e,
+ xlab = "Mass (carats)",
+ ylab = "Residuals (SIN $)",
+ bg = "lightblue",
+ col = "black", cex = 1.1, pch = 21,frame = FALSE)
+abline(h = 0, lwd = 2)
+for (i in 1 : n)
+ lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)
+```
+
+
+## Non-linear data
+```{r, echo = TRUE, fig.height=5, fig.width=5}
+x <- runif(100, -3, 3); y <- x + sin(x) + rnorm(100, sd = .2);
+plot(x, y); abline(lm(y ~ x))
+```
+
+
+```{r, echo = TRUE, fig.height=5, fig.width=5}
+plot(x, resid(lm(y ~ x)));
+abline(h = 0)
+```
+
+
+## Heteroskedasticity
+```{r, echo = TRUE, fig.height=4.5, fig.width=4.5}
+x <- runif(100, 0, 6); y <- x + rnorm(100, mean = 0, sd = .001 * x);
+plot(x, y); abline(lm(y ~ x))
+```
+
+
+## Getting rid of the blank space can be helpful
+```{r, echo = TRUE, fig.height=4.5, fig.width=4.5}
+plot(x, resid(lm(y ~ x)));
+abline(h = 0)
+```
+
+
+## Estimating residual variation
+* Model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i \sim N(0, \sigma^2)$.
+* The ML estimate of $\sigma^2$ is $\frac{1}{n}\sum_{i=1}^n e_i^2$,
+the average squared residual.
+* Most people use
+ $$
+ \hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2.
+ $$
+* The $n-2$ instead of $n$ is so that $E[\hat \sigma^2] = \sigma^2$
+
+
+## Diamond example
+```{r, echo = TRUE}
+y <- diamond$price; x <- diamond$carat; n <- length(y)
+fit <- lm(y ~ x)
+summary(fit)$sigma
+sqrt(sum(resid(fit)^2) / (n - 2))
+```
+
+
+## Summarizing variation
+$$
+\begin{align}
+\sum_{i=1}^n (Y_i - \bar Y)^2
+& = \sum_{i=1}^n (Y_i - \hat Y_i + \hat Y_i - \bar Y)^2 \\
+& = \sum_{i=1}^n (Y_i - \hat Y_i)^2 +
+2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) +
+\sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\
+\end{align}
+$$
+
+****
+### Scratch work
+$(Y_i - \hat Y_i) = \{Y_i - (\bar Y - \hat \beta_1 \bar X) - \hat \beta_1 X_i\} = (Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X)$
+
+$(\hat Y_i - \bar Y) = (\bar Y - \hat \beta_1 \bar X - \hat \beta_1 X_i - \bar Y )
+= \hat \beta_1 (X_i - \bar X)$
+
+$\sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y)
+= \sum_{i=1}^n \{(Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X))\}\{\hat \beta_1 (X_i - \bar X)\}$
+
+$=\hat \beta_1 \sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X) -\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2$
+
+$= \hat \beta_1^2 \sum_{i=1}^n (X_i - \bar X)^2-\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2 = 0$
+
+
+## Summarizing variation
+$$
+\sum_{i=1}^n (Y_i - \bar Y)^2
+= \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2
+$$
+
+Or
+
+Total Variation = Residual Variation + Regression Variation
+
+Define the percent of total varation described by the model as
+$$
+R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
+= 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
+$$
+
+
+## Relation between $R^2$ and $r$ (the corrrelation)
+Recall that $(\hat Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X)$
+so that
+$$
+R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
+= \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i - \bar X)}{\sum_{i=1}^n (Y_i - \bar Y)^2}
+= Cor(Y, X)^2
+$$
+Since, recall,
+$$
+\hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)}
+$$
+So, $R^2$ is literally $r$ squared.
+
+
+## Some facts about $R^2$
+* $R^2$ is the percentage of variation explained by the regression model.
+* $0 \leq R^2 \leq 1$
+* $R^2$ is the sample correlation squared.
+* $R^2$ can be a misleading summary of model fit.
+ * Deleting data can inflate $R^2$.
+ * (For later.) Adding terms to a regression model always increases $R^2$.
+* Do `example(anscombe)` to see the following data.
+ * Basically same mean and variance of X and Y.
+ * Identical correlations (hence same $R^2$ ).
+ * Same linear regression relationship.
+
+
+## `data(anscombe);example(anscombe)`
+```{r, echo = FALSE, fig.height=5, fig.width=5, results='hide'}
+require(stats); require(graphics); data(anscombe)
+ff <- y ~ x
+mods <- setNames(as.list(1:4), paste0("lm", 1:4))
+for(i in 1:4) {
+ ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
+ ## or ff[[2]] <- as.name(paste0("y", i))
+ ## ff[[3]] <- as.name(paste0("x", i))
+ mods[[i]] <- lmi <- lm(ff, data = anscombe)
+ #print(anova(lmi))
+}
+
+
+## Now, do what you should have done in the first place: PLOTS
+op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0))
+for(i in 1:4) {
+ ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
+ plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
+ xlim = c(3, 19), ylim = c(3, 13))
+ abline(mods[[i]], col = "blue")
+}
+mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
+par(op)
+```
+
diff --git a/07_RegressionModels/01_06_residualVariation/index.html b/07_RegressionModels/01_06_residualVariation/index.html
index d2ca4d492..861c2ca41 100644
--- a/07_RegressionModels/01_06_residualVariation/index.html
+++ b/07_RegressionModels/01_06_residualVariation/index.html
@@ -1,404 +1,318 @@
-
-
-
- Residuals and residual variation
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Residuals and residual variation
-
-
Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
Residuals
-
-
-
-
Model \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\) where \(\epsilon_i \sim N(0, \sigma^2)\).
-
Observed outcome \(i\) is \(Y_i\) at predictor value \(X_i\)
-
Predicted outcome \(i\) is \(\hat Y_i\) at predictor valuve \(X_i\) is
-\[
-\hat Y_i = \hat \beta_0 + \hat \beta_1 X_i
-\]
-
Residual, the between the observed and predicted outcome
-\[
-e_i = Y_i - \hat Y_i
-\]
-
-
-
The vertical distance between the observed data point and the regression line
-
-
Least squares minimizes \(\sum_{i=1}^n e_i^2\)
-
The \(e_i\) can be thought of as estimates of the \(\epsilon_i\).
-
-
-
-
-
-
-
-
-
Properties of the residuals
-
-
-
-
\(E[e_i] = 0\).
-
If an intercept is included, \(\sum_{i=1}^n e_i = 0\)
-
If a regressor variable, \(X_i\), is included in the model \(\sum_{i=1}^n e_i X_i = 0\).
-
Residuals are useful for investigating poor model fit.
-
Positive residuals are above the line, negative residuals are below.
-
Residuals can be thought of as the outcome (\(Y\)) with the
-linear association of the predictor (\(X\)) removed.
-
One differentiates residual variation (variation after removing
-the predictor) from systematic variation (variation explained by the regression model).
Total Variation = Residual Variation + Regression Variation
-
-
Define the percent of total varation described by the model as
-\[
-R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
-= 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
-\]
-
-
-
-
-
-
-
-
Relation between \(R^2\) and \(r\) (the corrrelation)
Model \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\) where \(\epsilon_i \sim N(0, \sigma^2)\).
+
Observed outcome \(i\) is \(Y_i\) at predictor value \(X_i\)
+
Predicted outcome \(i\) is \(\hat Y_i\) at predictor valuve \(X_i\) is \[
+ \hat Y_i = \hat \beta_0 + \hat \beta_1 X_i
+ \]
+
Residual, the between the observed and predicted outcome \[
+ e_i = Y_i - \hat Y_i
+ \]
+
The vertical distance between the observed data point and the regression line
+
Least squares minimizes \(\sum_{i=1}^n e_i^2\)
+
The \(e_i\) can be thought of as estimates of the \(\epsilon_i\).
+
+
+
+
Properties of the residuals
+
+
\(E[e_i] = 0\).
+
If an intercept is included, \(\sum_{i=1}^n e_i = 0\)
+
If a regressor variable, \(X_i\), is included in the model \(\sum_{i=1}^n e_i X_i = 0\).
+
Residuals are useful for investigating poor model fit.
+
Positive residuals are above the line, negative residuals are below.
+
Residuals can be thought of as the outcome (\(Y\)) with the linear association of the predictor (\(X\)) removed.
+
One differentiates residual variation (variation after removing the predictor) from systematic variation (variation explained by the regression model).
Total Variation = Residual Variation + Regression Variation
+
Define the percent of total varation described by the model as \[
+R^2 = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
+= 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}
+\]
+
+
+
Relation between \(R^2\) and \(r\) (the corrrelation)
\(R^2\) is the percentage of variation explained by the regression model.
+
\(0 \leq R^2 \leq 1\)
+
\(R^2\) is the sample correlation squared.
+
\(R^2\) can be a misleading summary of model fit.
+
Deleting data can inflate \(R^2\).
+
(For later.) Adding terms to a regression model always increases \(R^2\).
+
Do example(anscombe) to see the following data.
+
Basically same mean and variance of X and Y.
+
Identical correlations (hence same \(R^2\) ).
+
Same linear regression relationship.
+
+
+
+
data(anscombe);example(anscombe)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/01_07_inference/index.Rmd b/07_RegressionModels/01_07_inference/index.Rmd
index 57a40b612..08a2ca623 100644
--- a/07_RegressionModels/01_07_inference/index.Rmd
+++ b/07_RegressionModels/01_07_inference/index.Rmd
@@ -1,214 +1,216 @@
----
-title : Inference in regression
-subtitle :
-author : Brian Caffo, Jeff Leek and Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../libraries
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
----
-```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-runif(1)
-```
-## Recall our model and fitted values
-* Consider the model
-$$
-Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
-$$
-* $\epsilon \sim N(0, \sigma^2)$.
-* We assume that the true model is known.
-* We assume that you've seen confidence intervals and hypothesis tests before.
-* $\hat \beta_0 = \bar Y - \hat \beta_1 \bar X$
-* $\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)}$.
-
----
-## Review
-* Statistics like $\frac{\hat \theta - \theta}{\hat \sigma_{\hat \theta}}$ often have the following properties.
- 1. Is normally distributed and has a finite sample Student's T distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).
- 3. Can be used to test $H_0 : \theta = \theta_0$ versus $H_a : \theta >, <, \neq \theta_0$.
- 4. Can be used to create a confidence interval for $\theta$ via $\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}$
- where $Q_{1-\alpha/2}$ is the relevant quantile from either a normal or T distribution.
-* In the case of regression with iid sampling assumptions and normal errors, our inferences will follow
-very similarily to what you saw in your inference class.
-* We won't cover asymptotics for regression analysis, but suffice it to say that under assumptions
-on the ways in which the $X$ values are collected, the iid sampling model, and mean model,
-the normal results hold to create intervals and confidence intervals
-
----
-## Standard errors (conditioned on X)
-$$
-\begin{align}
-Var(\hat \beta_1) & =
-Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y) (X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) \\
-& = \frac{Var\left(\sum_{i=1}^n Y_i (X_i - \bar X) \right) }{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\
-& = \frac{\sum_{i=1}^n \sigma^2(X_i - \bar X)^2}{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\
-& = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar X)^2} \\
-\end{align}
-$$
-
----
-## Results
-* $\sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \sigma^2 / \sum_{i=1}^n (X_i - \bar X)^2$
-* $\sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2$
-* In practice, $\sigma$ is replaced by its estimate.
-* It's probably not surprising that under iid Gaussian errors
-$$
-\frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}}
-$$
-follows a $t$ distribution with $n-2$ degrees of freedom and a normal distribution for large $n$.
-* This can be used to create confidence intervals and perform
-hypothesis tests.
-
----
-## Example diamond data set
-```{r}
-library(UsingR); data(diamond)
-y <- diamond$price; x <- diamond$carat; n <- length(y)
-beta1 <- cor(y, x) * sd(y) / sd(x)
-beta0 <- mean(y) - beta1 * mean(x)
-e <- y - beta0 - beta1 * x
-sigma <- sqrt(sum(e^2) / (n-2))
-ssx <- sum((x - mean(x))^2)
-seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
-seBeta1 <- sigma / sqrt(ssx)
-tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
-pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
-pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
-coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
-colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
-rownames(coefTable) <- c("(Intercept)", "x")
-```
-
----
-## Example continued
-
-```{r}
-coefTable
-fit <- lm(y ~ x);
-summary(fit)$coefficients
-```
-
----
-## Getting a confidence interval
-```{r}
-sumCoef <- summary(fit)$coefficients
-sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
-sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]
-```
-With 95% confidence, we estimate that a 0.1 carat increase in
-diamond size results in a `r round((sumCoef[2,1] - qt(.975, df = fit$df) * sumCoef[2, 2]) / 10, 1)` to `r round((sumCoef[2,1] + qt(.975, df = fit$df) * sumCoef[2, 2]) / 10, 1)` increase in price in (Singapore) dollars.
-
----
-## Prediction of outcomes
-* Consider predicting $Y$ at a value of $X$
- * Predicting the price of a diamond given the carat
- * Predicting the height of a child given the height of the parents
-* The obvious estimate for prediction at point $x_0$ is
-$$
-\hat \beta_0 + \hat \beta_1 x_0
-$$
-* A standard error is needed to create a prediction interval.
-* There's a distinction between intervals for the regression
- line at point $x_0$ and the prediction of what a $y$ would be
- at point $x_0$.
-* Line at $x_0$ se, $\hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$
-* Prediction interval se at $x_0$, $\hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$
-
----
-## Plotting the prediction intervals
-
-```
-plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
-abline(fit, lwd = 2)
-xVals <- seq(min(x), max(x), by = .01)
-yVals <- beta0 + beta1 * xVals
-se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
-se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
-lines(xVals, yVals + 2 * se1)
-lines(xVals, yVals - 2 * se1)
-lines(xVals, yVals + 2 * se2)
-lines(xVals, yVals - 2 * se2)
-```
-
----
-## Plotting the prediction intervals
-```{r, fig.height=5, fig.width==5, echo = FALSE, results='hide'}
-plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
-abline(fit, lwd = 2)
-xVals <- seq(min(x), max(x), by = .01)
-yVals <- beta0 + beta1 * xVals
-se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
-se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
-lines(xVals, yVals + 2 * se1)
-lines(xVals, yVals - 2 * se1)
-lines(xVals, yVals + 2 * se2)
-lines(xVals, yVals - 2 * se2)
-```
-
----
-## Discussion
-* Both intervals have varying widths.
- * Least width at the mean of the Xs.
-* We are quite confident in the regression line, so that
- interval is very narrow.
- * If we knew $\beta_0$ and $\beta_1$ this interval would have zero width.
-* The prediction interval must incorporate the variabilibity
- in the data around the line.
- * Even if we knew $\beta_0$ and $\beta_1$ this interval would still have width.
-
----
-
-## In R
-```
-newdata <- data.frame(x = xVals)
-p1 <- predict(fit, newdata, interval = ("confidence"))
-p2 <- predict(fit, newdata, interval = ("prediction"))
-plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
-abline(fit, lwd = 2)
-lines(xVals, p1[,2]); lines(xVals, p1[,3])
-lines(xVals, p2[,2]); lines(xVals, p2[,3])
-```
-
----
----
-## In R
-
-```{r, fig.height=5, fig.width=5, echo=FALSE,results='hide'}
-newdata <- data.frame(x = xVals)
-p1 <- predict(fit, newdata, interval = ("confidence"))
-p2 <- predict(fit, newdata, interval = ("prediction"))
-plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
-abline(fit, lwd = 2)
-lines(xVals, p1[,2]); lines(xVals, p1[,3])
-lines(xVals, p2[,2]); lines(xVals, p2[,3])
-```
-
-
-
-
-
-
-
-
-
-
-
+---
+title : Inference in regression
+subtitle :
+author : Brian Caffo, Jeff Leek and Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../libraries
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+runif(1)
+```
+## Recall our model and fitted values
+* Consider the model
+$$
+Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
+$$
+* $\epsilon \sim N(0, \sigma^2)$.
+* We assume that the true model is known.
+* We assume that you've seen confidence intervals and hypothesis tests before.
+* $\hat \beta_0 = \bar Y - \hat \beta_1 \bar X$
+* $\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)}$.
+
+
+## Review
+* Statistics like $\frac{\hat \theta - \theta}{\hat \sigma_{\hat \theta}}$ often have the following properties.
+ 1. Is normally distributed and has a finite sample Student's T distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).
+ 3. Can be used to test $H_0 : \theta = \theta_0$ versus $H_a : \theta >, <, \neq \theta_0$.
+ 4. Can be used to create a confidence interval for $\theta$ via $\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}$
+ where $Q_{1-\alpha/2}$ is the relevant quantile from either a normal or T distribution.
+* In the case of regression with iid sampling assumptions and normal errors, our inferences will follow
+very similarily to what you saw in your inference class.
+* We won't cover asymptotics for regression analysis, but suffice it to say that under assumptions
+on the ways in which the $X$ values are collected, the iid sampling model, and mean model,
+the normal results hold to create intervals and confidence intervals
+
+
+## Standard errors (conditioned on X)
+$$
+\begin{align}
+Var(\hat \beta_1) & =
+Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y) (X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) \\
+& = \frac{Var\left(\sum_{i=1}^n Y_i (X_i - \bar X) \right) }{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\
+& = \frac{\sum_{i=1}^n \sigma^2(X_i - \bar X)^2}{\left(\sum_{i=1}^n (X_i - \bar X)^2 \right)^2} \\
+& = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar X)^2} \\
+\end{align}
+$$
+
+
+## Results
+* $\sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) = \sigma^2 / \sum_{i=1}^n (X_i - \bar X)^2$
+* $\sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2$
+* In practice, $\sigma$ is replaced by its estimate.
+* It's probably not surprising that under iid Gaussian errors
+$$
+\frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}}
+$$
+follows a $t$ distribution with $n-2$ degrees of freedom and a normal distribution for large $n$.
+* This can be used to create confidence intervals and perform
+hypothesis tests.
+
+
+## Example diamond data set
+```{r}
+library(UsingR); data(diamond)
+y <- diamond$price; x <- diamond$carat; n <- length(y)
+beta1 <- cor(y, x) * sd(y) / sd(x)
+beta0 <- mean(y) - beta1 * mean(x)
+e <- y - beta0 - beta1 * x
+sigma <- sqrt(sum(e^2) / (n-2))
+ssx <- sum((x - mean(x))^2)
+seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
+seBeta1 <- sigma / sqrt(ssx)
+tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
+pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
+pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
+coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
+colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
+rownames(coefTable) <- c("(Intercept)", "x")
+```
+
+
+## Example continued
+
+```{r}
+coefTable
+fit <- lm(y ~ x);
+summary(fit)$coefficients
+```
+
+
+## Getting a confidence interval
+```{r}
+sumCoef <- summary(fit)$coefficients
+sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
+sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]
+```
+With 95% confidence, we estimate that a 0.1 carat increase in
+diamond size results in a `r round((sumCoef[2,1] - qt(.975, df = fit$df) * sumCoef[2, 2]) / 10, 1)` to `r round((sumCoef[2,1] + qt(.975, df = fit$df) * sumCoef[2, 2]) / 10, 1)` increase in price in (Singapore) dollars.
+
+
+## Prediction of outcomes
+* Consider predicting $Y$ at a value of $X$
+ * Predicting the price of a diamond given the carat
+ * Predicting the height of a child given the height of the parents
+* The obvious estimate for prediction at point $x_0$ is
+$$
+\hat \beta_0 + \hat \beta_1 x_0
+$$
+* A standard error is needed to create a prediction interval.
+* There's a distinction between intervals for the regression
+ line at point $x_0$ and the prediction of what a $y$ would be
+ at point $x_0$.
+* Line at $x_0$ se, $\hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$
+* Prediction interval se at $x_0$, $\hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}$
+
+
+## Plotting the prediction intervals
+
+```
+plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
+abline(fit, lwd = 2)
+xVals <- seq(min(x), max(x), by = .01)
+yVals <- beta0 + beta1 * xVals
+se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
+se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
+lines(xVals, yVals + 2 * se1)
+lines(xVals, yVals - 2 * se1)
+lines(xVals, yVals + 2 * se2)
+lines(xVals, yVals - 2 * se2)
+```
+
+
+## Plotting the prediction intervals
+```{r, fig.height=5, fig.width==5, echo = FALSE, results='hide'}
+plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
+abline(fit, lwd = 2)
+xVals <- seq(min(x), max(x), by = .01)
+yVals <- beta0 + beta1 * xVals
+se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
+se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
+lines(xVals, yVals + 2 * se1)
+lines(xVals, yVals - 2 * se1)
+lines(xVals, yVals + 2 * se2)
+lines(xVals, yVals - 2 * se2)
+```
+
+
+## Discussion
+* Both intervals have varying widths.
+ * Least width at the mean of the Xs.
+* We are quite confident in the regression line, so that
+ interval is very narrow.
+ * If we knew $\beta_0$ and $\beta_1$ this interval would have zero width.
+* The prediction interval must incorporate the variabilibity
+ in the data around the line.
+ * Even if we knew $\beta_0$ and $\beta_1$ this interval would still have width.
+
+
+
+## In R
+```
+newdata <- data.frame(x = xVals)
+p1 <- predict(fit, newdata, interval = ("confidence"))
+p2 <- predict(fit, newdata, interval = ("prediction"))
+plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
+abline(fit, lwd = 2)
+lines(xVals, p1[,2]); lines(xVals, p1[,3])
+lines(xVals, p2[,2]); lines(xVals, p2[,3])
+```
+
+
+
+## In R
+
+```{r, fig.height=5, fig.width=5, echo=FALSE,results='hide'}
+newdata <- data.frame(x = xVals)
+p1 <- predict(fit, newdata, interval = ("confidence"))
+p2 <- predict(fit, newdata, interval = ("prediction"))
+plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
+abline(fit, lwd = 2)
+lines(xVals, p1[,2]); lines(xVals, p1[,3])
+lines(xVals, p2[,2]); lines(xVals, p2[,3])
+```
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/01_07_inference/index.html b/07_RegressionModels/01_07_inference/index.html
index cb9c530d3..e36f8397b 100644
--- a/07_RegressionModels/01_07_inference/index.html
+++ b/07_RegressionModels/01_07_inference/index.html
@@ -1,369 +1,300 @@
-
-
-
- Inference in regression
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Inference in regression
-
-
Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
Recall our model and fitted values
-
-
-
-
Consider the model
-\[
-Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
-\]
-
\(\epsilon \sim N(0, \sigma^2)\).
-
We assume that the true model is known.
-
We assume that you've seen confidence intervals and hypothesis tests before.
-
\(\hat \beta_0 = \bar Y - \hat \beta_1 \bar X\)
-
\(\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)}\).
-
-
-
-
-
-
-
-
-
Review
-
-
-
-
Statistics like \(\frac{\hat \theta - \theta}{\hat \sigma_{\hat \theta}}\) often have the following properties.
-
-
-
Is normally distributed and has a finite sample Student's T distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).
-
Can be used to test \(H_0 : \theta = \theta_0\) versus \(H_a : \theta >, <, \neq \theta_0\).
-
Can be used to create a confidence interval for \(\theta\) via \(\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}\)
-where \(Q_{1-\alpha/2}\) is the relevant quantile from either a normal or T distribution.
-
-
In the case of regression with iid sampling assumptions and normal errors, our inferences will follow
-very similarily to what you saw in your inference class.
-
We won't cover asymptotics for regression analysis, but suffice it to say that under assumptions
-on the ways in which the \(X\) values are collected, the iid sampling model, and mean model,
-the normal results hold to create intervals and confidence intervals
In practice, \(\sigma\) is replaced by its estimate.
-
It's probably not surprising that under iid Gaussian errors
-\[
-\frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}}
-\]
-follows a \(t\) distribution with \(n-2\) degrees of freedom and a normal distribution for large \(n\).
-
This can be used to create confidence intervals and perform
-hypothesis tests.
Consider the model \[
+Y_i = \beta_0 + \beta_1 X_i + \epsilon_i
+\]
+
\(\epsilon \sim N(0, \sigma^2)\).
+
We assume that the true model is known.
+
We assume that you’ve seen confidence intervals and hypothesis tests before.
+
\(\hat \beta_0 = \bar Y - \hat \beta_1 \bar X\)
+
\(\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)}\).
+
+
+
+
Review
+
+
Statistics like \(\frac{\hat \theta - \theta}{\hat \sigma_{\hat \theta}}\) often have the following properties.
+
+
Is normally distributed and has a finite sample Student’s T distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).
+
Can be used to test \(H_0 : \theta = \theta_0\) versus \(H_a : \theta >, <, \neq \theta_0\).
+
Can be used to create a confidence interval for \(\theta\) via \(\hat \theta \pm Q_{1-\alpha/2} \hat \sigma_{\hat \theta}\) where \(Q_{1-\alpha/2}\) is the relevant quantile from either a normal or T distribution.
+
+
In the case of regression with iid sampling assumptions and normal errors, our inferences will follow very similarily to what you saw in your inference class.
+
We won’t cover asymptotics for regression analysis, but suffice it to say that under assumptions on the ways in which the \(X\) values are collected, the iid sampling model, and mean model, the normal results hold to create intervals and confidence intervals
In practice, \(\sigma\) is replaced by its estimate.
+
It’s probably not surprising that under iid Gaussian errors \[
+\frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}}
+\] follows a \(t\) distribution with \(n-2\) degrees of freedom and a normal distribution for large \(n\).
+
This can be used to create confidence intervals and perform hypothesis tests.
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/02_01_multivariate/index.Rmd b/07_RegressionModels/02_01_multivariate/index.Rmd
index 2a8ccbce8..70d66ba5b 100644
--- a/07_RegressionModels/02_01_multivariate/index.Rmd
+++ b/07_RegressionModels/02_01_multivariate/index.Rmd
@@ -1,278 +1,257 @@
----
-title : Multivariable regression
-subtitle :
-author : Brian Caffo, Roger Peng and Jeff Leek
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../libraries
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
----
-```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-runif(1)
-```
-## Multivariable regression analyses
-* If I were to present evidence of a relationship between
-breath mint useage (mints per day, X) and pulmonary function
-(measured in FEV), you would be skeptical.
- * Likely, you would say, 'smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That's probably the culprit.'
- * If asked what would convince you, you would likely say, 'If non-smoking breath mint users had lower lung function than non-smoking non-breath mint users and, similarly, if smoking breath mint users had lower lung function than smoking non-breath mint users, I'd be more inclined to believe you'.
-* In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.
-
----
-## Multivariable regression analyses
-* An insurance company is interested in how last year's claims can predict a person's time in the hospital this year.
- * They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
-* How can one generalize SLR to incoporate lots of regressors for
-the purpose of prediction?
-* What are the consequences of adding lots of regressors?
- * Surely there must be consequences to throwing variables in that aren't related to Y?
- * Surely there must be consequences to omitting variables that are?
-
----
-## The linear model
-* The general linear model extends simple linear regression (SLR)
-by adding terms linearly into the model.
-$$
-Y_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots +
-\beta_{p} X_{pi} + \epsilon_{i}
-= \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}
-$$
-* Here $X_{1i}=1$ typically, so that an intercept is included.
-* Least squares (and hence ML estimates under iid Gaussianity
-of the errors) minimizes
-$$
-\sum_{i=1}^n \left(Y_i - \sum_{k=1}^p X_{ki} \beta_j\right)^2
-$$
-* Note, the important linearity is linearity in the coefficients.
-Thus
-$$
-Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots +
-\beta_{p} X_{pi}^2 + \epsilon_{i}
-$$
-is still a linear model. (We've just squared the elements of the
-predictor variables.)
-
----
-## How to get estimates
-* The real way requires linear algebra. We'll go over an intuitive development instead.
-* Recall that the LS estimate for regression through the origin, $E[Y_i]=X_{1i}\beta_1$, was $\sum X_i Y_i / \sum X_i^2$.
-* Let's consider two regressors, $E[Y_i] = X_{1i}\beta_1 + X_{2i}\beta_2 = \mu_i$.
-* Also, recall, that if $\hat \mu_i$ satisfies
-$$
-\sum_{i=1} (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0
-$$
-for all possible values of $\mu_i$, then we've found the LS estimates.
-
----
-$$
-\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = \sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i})
-\left\{X_{1i}(\hat \beta_1 - \beta_1) + X_{2i}(\hat \beta_2 - \beta_2) \right\}
-$$
-* Thus we need
- 1. $\sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i}) X_{1i} = 0$
- 2. $\sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i}) X_{2i} = 0$
-* Hold $\hat \beta_1$ fixed in 2. and solve and we get that
-$$
-\hat \beta_2 = \frac{\sum_{i=1} (Y_i - X_{1i}\hat \beta_1)X_{2i}}{\sum_{i=1}^n X_{2i}^2}
-$$
-* Plugging this into 1. we get that
-$$
-0 = \sum_{i=1}^n \left\{Y_i - \frac{\sum_j X_{2j}Y_j}{\sum_j X_{2j}^2}X_{2i} +
-\beta_1 \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right)\right\} X_{1i}
-$$
-
----
-## Continued
-* Re writing this we get
-$$
-0 = \sum_{i=1}^n \left\{ e_{i, Y | X_2} - \hat \beta_1 e_{i, X_1 | X_2}
-\right\} X_{1i}
-$$
-where $e_{i, a | b} = a_i - \frac{\sum_{j=1}^n a_j b_j }{\sum_{i=1}^n b_j^2} b_i$ is the residual when regressing $b$ from $a$ without an intercept.
-* We get the solution
-$$
-\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2} X_1}
-$$
-
----
-* But note that
-$$
-\sum_{i=1}^n e_{i, X_1 | X_2}^2
-= \sum_{i=1}^n e_{i, X_1 | X_2} \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right)
-$$
-$$
-= \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} \sum_{i=1}^n e_{i, X_1 | X_2} X_{2i}
-$$
-But $\sum_{i=1}^n e_{i, X_1 | X_2} X_{2i} = 0$. So we get that
-$$
-\sum_{i=1}^n e_{i, X_1 | X_2}^2 = \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i}
-$$
-Thus we get that
-$$
-\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}
-$$
-
----
-## Summing up fitting with two regressors
-$$\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}$$
-* That is, the regression estimate for $\beta_1$ is the regression
-through the origin estimate having regressed $X_2$ out of both
-the response and the predictor.
-* (Similarly, the regression estimate for $\beta_2$ is the regression through the origin estimate having regressed $X_1$ out of both the response and the predictor.)
-* More generally, multivariate regression estimates are exactly those having removed the linear relationship of the other variables
-from both the regressor and response.
-
----
-## Example with two variables, simple linear regression
-* $Y_{i} = \beta_1 X_{1i} + \beta_2 X_{2i}$ where $X_{2i} = 1$ is an intercept term.
-* Then $\frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2}X_{2i} =
-\frac{\sum_j X_{1j}}{n} = \bar X_1$.
-* $e_{i, X_1 | X_2} = X_{1i} - \bar X_1$.
-* Simiarly $e_{i, Y | X_2} = Y_i - \bar Y$.
-* Thus
-$$
-\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2} = \frac{\sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y)}{\sum_{i=1}^n (X_i - \bar X)^2}
-= Cor(X, Y) \frac{Sd(Y)}{Sd(X)}
-$$
-
----
-## The general case
-* The equations
-$$
-\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{ip}\hat \beta_p) X_k = 0
-$$
-for $k = 1, \ldots, p$ yields $p$ equations with $p$ unknowns.
-* Solving them yields the least squares estimates. (With obtaining a good, fast, general solution requiring some knowledge of linear algebra.)
-* The least squares estimate for the coefficient of a multivariate regression model is exactly regression through the origin with the linear relationships with the other regressors removed from both the regressor and outcome by taking residuals.
-* In this sense, multivariate regression "adjusts" a coefficient for the linear impact of the other variables.
-
----
-## Fitting LS equations
-Just so I don't leave you hanging, let's show a way to get estimates. Recall the equations:
-$$
-\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{ip}\hat \beta_p) X_k = 0
-$$
-If I hold $\hat \beta_1, \ldots, \hat \beta_{p-1}$ fixed then
-we get that
-$$
-\hat \beta_p = \frac{\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{i,p-1}\hat \beta_{p-1}) X_{ip} }{\sum_{i=1}^n X_{ip}^2}
-$$
-Plugging this back into the equations, we wind up with
-$$
-\sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) X_k = 0
-$$
-
----
-## We can tidy it up a bit more, though
-Note that
-$$
-X_k = e_{i,X_k|X_p} + \frac{\sum_{i=1}^n X_{ik} X_{ip}}{\sum_{i=1}^n X_{ip}^2} X_p
-$$
-and $\sum_{i=1}^n e_{i,X_j | X_p} X_{ip} = 0$.
-Thus
-$$
-\sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) X_k = 0
-$$
-is equal to
-$$
-\sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) e_{i,X_k|X_p} = 0
-$$
-
----
-## To sum up
-* We've reduced $p$ LS equations and $p$ unknowns to $p-1$ LS equations and $p-1$ unknowns.
- * Every variable has been replaced by its residual with $X_p$.
- * This process can then be iterated until only Y and one
-variable remains.
-* Think of it as follows. If we want an adjusted relationship between y and x, keep taking residuals over confounders and do regression through the origin.
- * The order that you do the confounders doesn't matter.
- * (It can't because our choice of doing $p$ first
- was arbitrary.)
-* This isn't a terribly efficient way to get estimates. But, it's nice conceputally, as it shows how regression estimates are adjusted for the linear relationship with other variables.
-
----
-## Demonstration that it works using an example
-### Linear model with two variables and an intercept
-```{r}
-n <- 100; x <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n)
-y <- x + x2 + x3 + rnorm(n, sd = .1)
-e <- function(a, b) a - sum( a * b ) / sum( b ^ 2) * b
-ey <- e(e(y, x2), e(x3, x2))
-ex <- e(e(x, x2), e(x3, x2))
-sum(ey * ex) / sum(ex ^ 2)
-coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
-```
-
----
-## Showing that order doesn't matter
-```{r}
-ey <- e(e(y, x3), e(x2, x3))
-ex <- e(e(x, x3), e(x2, x3))
-sum(ey * ex) / sum(ex ^ 2)
-coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
-```
-
----
-## Residuals again
-```{r}
-ey <- resid(lm(y ~ x2 + x3 - 1))
-ex <- resid(lm(x ~ x2 + x3 - 1))
-sum(ey * ex) / sum(ex ^ 2)
-coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
-```
-
-
----
-## Interpretation of the coeficient
-$$E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k$$
-So that
-$$
-E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p]$$
-$$= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k}+ \sum_{k=1}^p x_{k} \beta_k = \beta_1 $$
-So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.
-
-In the next lecture, we'll do examples and go over context-specific
-interpretations.
-
----
-## Fitted values, residuals and residual variation
-All of our SLR quantities can be extended to linear models
-* Model $Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}$ where $\epsilon_i \sim N(0, \sigma^2)$
-* Fitted responses $\hat Y_i = \sum_{k=1}^p X_{ik} \hat \beta_{k}$
-* Residuals $e_i = Y_i - \hat Y_i$
-* Variance estimate $\hat \sigma^2 = \frac{1}{n-p} \sum_{i=1}^n e_i ^2$
-* To get predicted responses at new values, $x_1, \ldots, x_p$, simply plug them into the linear model $\sum_{k=1}^p x_{k} \hat \beta_{k}$
-* Coefficients have standard errors, $\hat \sigma_{\hat \beta_k}$, and
-$\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}$
-follows a $T$ distribution with $n-p$ degrees of freedom.
-* Predicted responses have standard errors and we can calculate predicted and expected response intervals.
-
----
-## Linear models
-* Linear models are the single most important applied statistical and machine learning techniqe, *by far*.
-* Some amazing things that you can accomplish with linear models
- * Decompose a signal into its harmonics.
- * Flexibly fit complicated functions.
- * Fit factor variables as predictors.
- * Uncover complex multivariate relationships with the response.
- * Build accurate prediction models.
-
+---
+title : Multivariable regression
+subtitle :
+author : Brian Caffo, Roger Peng and Jeff Leek
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../libraries
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+runif(1)
+```
+## Multivariable regression analyses
+* If I were to present evidence of a relationship between
+breath mint useage (mints per day, X) and pulmonary function
+(measured in FEV), you would be skeptical.
+ * Likely, you would say, 'smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That's probably the culprit.'
+ * If asked what would convince you, you would likely say, 'If non-smoking breath mint users had lower lung function than non-smoking non-breath mint users and, similarly, if smoking breath mint users had lower lung function than smoking non-breath mint users, I'd be more inclined to believe you'.
+* In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.
+* An insurance company is interested in how last year's claims can predict a person's time in the hospital this year.
+ * They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
+* How can one generalize SLR to incoporate lots of regressors for
+the purpose of prediction?
+* What are the consequences of adding lots of regressors?
+ * Surely there must be consequences to throwing variables in that aren't related to Y?
+ * Surely there must be consequences to omitting variables that are?
+
+## The linear model
+* The general linear model extends simple linear regression (SLR)
+by adding terms linearly into the model.
+$$
+Y_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots +
+\beta_{p} X_{pi} + \epsilon_{i}
+= \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}
+$$
+* Here $X_{1i}=1$ typically, so that an intercept is included.
+* Least squares (and hence ML estimates under iid Gaussianity
+of the errors) minimizes
+$$
+\sum_{i=1}^n \left(Y_i - \sum_{k=1}^p X_{ki} \beta_j\right)^2
+$$
+* Note, the important linearity is linearity in the coefficients.
+Thus
+$$
+Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots +
+\beta_{p} X_{pi}^2 + \epsilon_{i}
+$$
+is still a linear model. (We've just squared the elements of the
+predictor variables.)
+
+## How to get estimates
+* The real way requires linear algebra. We'll go over an intuitive development instead.
+* Recall that the LS estimate for regression through the origin, $E[Y_i]=X_{1i}\beta_1$, was $\sum X_i Y_i / \sum X_i^2$.
+* Let's consider two regressors, $E[Y_i] = X_{1i}\beta_1 + X_{2i}\beta_2 = \mu_i$.
+* Also, recall, that if $\hat \mu_i$ satisfies
+$$
+\sum_{i=1} (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0
+$$
+for all possible values of $\mu_i$, then we've found the LS estimates.
+
+$$
+\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = \sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i})
+\left\{X_{1i}(\hat \beta_1 - \beta_1) + X_{2i}(\hat \beta_2 - \beta_2) \right\}
+$$
+* Thus we need
+ 1. $\sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i}) X_{1i} = 0$
+ 2. $\sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i}) X_{2i} = 0$
+* Hold $\hat \beta_1$ fixed in 2. and solve and we get that
+$$
+\hat \beta_2 = \frac{\sum_{i=1} (Y_i - X_{1i}\hat \beta_1)X_{2i}}{\sum_{i=1}^n X_{2i}^2}
+$$
+* Plugging this into 1. we get that
+$$
+0 = \sum_{i=1}^n \left\{Y_i - \frac{\sum_j X_{2j}Y_j}{\sum_j X_{2j}^2}X_{2i} +
+\beta_1 \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right)\right\} X_{1i}
+$$
+
+## Continued
+* Re writing this we get
+$$
+0 = \sum_{i=1}^n \left\{ e_{i, Y | X_2} - \hat \beta_1 e_{i, X_1 | X_2}
+\right\} X_{1i}
+$$
+where $e_{i, a | b} = a_i - \frac{\sum_{j=1}^n a_j b_j }{\sum_{i=1}^n b_j^2} b_i$ is the residual when regressing $b$ from $a$ without an intercept.
+* We get the solution
+$$
+\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2} X_1}
+$$
+
+* But note that
+$$
+\sum_{i=1}^n e_{i, X_1 | X_2}^2
+= \sum_{i=1}^n e_{i, X_1 | X_2} \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right)
+$$
+$$
+= \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} \sum_{i=1}^n e_{i, X_1 | X_2} X_{2i}
+$$
+But $\sum_{i=1}^n e_{i, X_1 | X_2} X_{2i} = 0$. So we get that
+$$
+\sum_{i=1}^n e_{i, X_1 | X_2}^2 = \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i}
+$$
+Thus we get that
+$$
+\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}
+$$
+
+## Summing up fitting with two regressors
+$$\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}$$
+* That is, the regression estimate for $\beta_1$ is the regression
+through the origin estimate having regressed $X_2$ out of both
+the response and the predictor.
+* (Similarly, the regression estimate for $\beta_2$ is the regression through the origin estimate having regressed $X_1$ out of both the response and the predictor.)
+* More generally, multivariate regression estimates are exactly those having removed the linear relationship of the other variables
+from both the regressor and response.
+
+## Example with two variables, simple linear regression
+* $Y_{i} = \beta_1 X_{1i} + \beta_2 X_{2i}$ where $X_{2i} = 1$ is an intercept term.
+* Then $\frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2}X_{2i} =
+\frac{\sum_j X_{1j}}{n} = \bar X_1$.
+* $e_{i, X_1 | X_2} = X_{1i} - \bar X_1$.
+* Simiarly $e_{i, Y | X_2} = Y_i - \bar Y$.
+* Thus
+$$
+\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2} = \frac{\sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y)}{\sum_{i=1}^n (X_i - \bar X)^2}
+= Cor(X, Y) \frac{Sd(Y)}{Sd(X)}
+$$
+
+## The general case
+* The equations
+$$
+\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{ip}\hat \beta_p) X_k = 0
+$$
+for $k = 1, \ldots, p$ yields $p$ equations with $p$ unknowns.
+* Solving them yields the least squares estimates. (With obtaining a good, fast, general solution requiring some knowledge of linear algebra.)
+* The least squares estimate for the coefficient of a multivariate regression model is exactly regression through the origin with the linear relationships with the other regressors removed from both the regressor and outcome by taking residuals.
+* In this sense, multivariate regression "adjusts" a coefficient for the linear impact of the other variables.
+
+## Fitting LS equations
+Just so I don't leave you hanging, let's show a way to get estimates. Recall the equations:
+$$
+\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{ip}\hat \beta_p) X_k = 0
+$$
+If I hold $\hat \beta_1, \ldots, \hat \beta_{p-1}$ fixed then
+we get that
+$$
+\hat \beta_p = \frac{\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{i,p-1}\hat \beta_{p-1}) X_{ip} }{\sum_{i=1}^n X_{ip}^2}
+$$
+Plugging this back into the equations, we wind up with
+$$
+\sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) X_k = 0
+$$
+
+## We can tidy it up a bit more, though
+Note that
+$$
+X_k = e_{i,X_k|X_p} + \frac{\sum_{i=1}^n X_{ik} X_{ip}}{\sum_{i=1}^n X_{ip}^2} X_p
+$$
+and $\sum_{i=1}^n e_{i,X_j | X_p} X_{ip} = 0$.
+Thus
+$$
+\sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) X_k = 0
+$$
+is equal to
+$$
+\sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) e_{i,X_k|X_p} = 0
+$$
+
+## To sum up
+* We've reduced $p$ LS equations and $p$ unknowns to $p-1$ LS equations and $p-1$ unknowns.
+ * Every variable has been replaced by its residual with $X_p$.
+ * This process can then be iterated until only Y and one
+variable remains.
+* Think of it as follows. If we want an adjusted relationship between y and x, keep taking residuals over confounders and do regression through the origin.
+ * The order that you do the confounders doesn't matter.
+ * (It can't because our choice of doing $p$ first
+ was arbitrary.)
+* This isn't a terribly efficient way to get estimates. But, it's nice conceputally, as it shows how regression estimates are adjusted for the linear relationship with other variables.
+
+## Demonstration that it works using an example
+### Linear model with two variables and an intercept
+```{r}
+n <- 100; x <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n)
+y <- x + x2 + x3 + rnorm(n, sd = .1)
+e <- function(a, b) a - sum( a * b ) / sum( b ^ 2) * b
+ey <- e(e(y, x2), e(x3, x2))
+ex <- e(e(x, x2), e(x3, x2))
+sum(ey * ex) / sum(ex ^ 2)
+coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
+```
+
+## Showing that order doesn't matter
+```{r}
+ey <- e(e(y, x3), e(x2, x3))
+ex <- e(e(x, x3), e(x2, x3))
+sum(ey * ex) / sum(ex ^ 2)
+coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
+```
+## Residuals again
+```{r}
+ey <- resid(lm(y ~ x2 + x3 - 1))
+ex <- resid(lm(x ~ x2 + x3 - 1))
+sum(ey * ex) / sum(ex ^ 2)
+coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
+```
+
+## Interpretation of the coeficient
+$$E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k$$
+So that
+$$
+E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p]$$
+$$= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k}+ \sum_{k=1}^p x_{k} \beta_k = \beta_1 $$
+So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.
+
+In the next lecture, we'll do examples and go over context-specific
+interpretations.
+
+## Fitted values, residuals and residual variation
+All of our SLR quantities can be extended to linear models
+* Model $Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}$ where $\epsilon_i \sim N(0, \sigma^2)$
+* Fitted responses $\hat Y_i = \sum_{k=1}^p X_{ik} \hat \beta_{k}$
+* Residuals $e_i = Y_i - \hat Y_i$
+* Variance estimate $\hat \sigma^2 = \frac{1}{n-p} \sum_{i=1}^n e_i ^2$
+* To get predicted responses at new values, $x_1, \ldots, x_p$, simply plug them into the linear model $\sum_{k=1}^p x_{k} \hat \beta_{k}$
+* Coefficients have standard errors, $\hat \sigma_{\hat \beta_k}$, and
+$\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}$
+follows a $T$ distribution with $n-p$ degrees of freedom.
+* Predicted responses have standard errors and we can calculate predicted and expected response intervals.
+
+## Linear models
+* Linear models are the single most important applied statistical and machine learning techniqe, *by far*.
+* Some amazing things that you can accomplish with linear models
+ * Decompose a signal into its harmonics.
+ * Flexibly fit complicated functions.
+ * Fit factor variables as predictors.
+ * Uncover complex multivariate relationships with the response.
+ * Build accurate prediction models.
diff --git a/07_RegressionModels/02_01_multivariate/index.html b/07_RegressionModels/02_01_multivariate/index.html
index bb8b9aff5..f33140751 100644
--- a/07_RegressionModels/02_01_multivariate/index.html
+++ b/07_RegressionModels/02_01_multivariate/index.html
@@ -1,540 +1,343 @@
-
-
-
- Multivariable regression
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Multivariable regression
-
-
Brian Caffo, Roger Peng and Jeff Leek Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
Multivariable regression analyses
-
-
-
-
If I were to present evidence of a relationship between
-breath mint useage (mints per day, X) and pulmonary function
-(measured in FEV), you would be skeptical.
-
-
-
Likely, you would say, 'smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That's probably the culprit.'
-
If asked what would convince you, you would likely say, 'If non-smoking breath mint users had lower lung function than non-smoking non-breath mint users and, similarly, if smoking breath mint users had lower lung function than smoking non-breath mint users, I'd be more inclined to believe you'.
-
-
In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.
-
-
-
-
-
-
-
-
-
Multivariable regression analyses
-
-
-
-
An insurance company is interested in how last year's claims can predict a person's time in the hospital this year.
-
-
-
They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
-
-
How can one generalize SLR to incoporate lots of regressors for
-the purpose of prediction?
-
What are the consequences of adding lots of regressors?
-
-
-
Surely there must be consequences to throwing variables in that aren't related to Y?
-
Surely there must be consequences to omitting variables that are?
-
-
-
-
-
-
-
-
-
-
The linear model
-
-
-
-
The general linear model extends simple linear regression (SLR)
-by adding terms linearly into the model.
-\[
-Y_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots +
-\beta_{p} X_{pi} + \epsilon_{i}
-= \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}
-\]
-
Here \(X_{1i}=1\) typically, so that an intercept is included.
-
Least squares (and hence ML estimates under iid Gaussianity
-of the errors) minimizes
-\[
-\sum_{i=1}^n \left(Y_i - \sum_{k=1}^p X_{ki} \beta_j\right)^2
-\]
-
Note, the important linearity is linearity in the coefficients.
-Thus
-\[
-Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots +
-\beta_{p} X_{pi}^2 + \epsilon_{i}
-\]
-is still a linear model. (We've just squared the elements of the
-predictor variables.)
-
-
-
-
-
-
-
-
-
How to get estimates
-
-
-
-
The real way requires linear algebra. We'll go over an intuitive development instead.
-
Recall that the LS estimate for regression through the origin, \(E[Y_i]=X_{1i}\beta_1\), was \(\sum X_i Y_i / \sum X_i^2\).
Also, recall, that if \(\hat \mu_i\) satisfies
-\[
-\sum_{i=1} (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0
-\]
-for all possible values of \(\mu_i\), then we've found the LS estimates.
Hold \(\hat \beta_1\) fixed in 2. and solve and we get that
-\[
-\hat \beta_2 = \frac{\sum_{i=1} (Y_i - X_{1i}\hat \beta_1)X_{2i}}{\sum_{i=1}^n X_{2i}^2}
-\]
-
Plugging this into 1. we get that
-\[
-0 = \sum_{i=1}^n \left\{Y_i - \frac{\sum_j X_{2j}Y_j}{\sum_j X_{2j}^2}X_{2i} +
-\beta_1 \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right)\right\} X_{1i}
-\]
-
-
-
-
-
-
-
-
-
Continued
-
-
-
-
Re writing this we get
-\[
-0 = \sum_{i=1}^n \left\{ e_{i, Y | X_2} - \hat \beta_1 e_{i, X_1 | X_2}
-\right\} X_{1i}
-\]
-where \(e_{i, a | b} = a_i - \frac{\sum_{j=1}^n a_j b_j }{\sum_{i=1}^n b_j^2} b_i\) is the residual when regressing \(b\) from \(a\) without an intercept.
-
We get the solution
-\[
-\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2} X_1}
-\]
-
-
-
-
-
-
-
-
-
-
-
-
-
But note that
-\[
-\sum_{i=1}^n e_{i, X_1 | X_2}^2
-= \sum_{i=1}^n e_{i, X_1 | X_2} \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right)
-\]
-\[
-= \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} \sum_{i=1}^n e_{i, X_1 | X_2} X_{2i}
-\]
-But \(\sum_{i=1}^n e_{i, X_1 | X_2} X_{2i} = 0\). So we get that
-\[
-\sum_{i=1}^n e_{i, X_1 | X_2}^2 = \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i}
-\]
-Thus we get that
-\[
-\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}
-\]
That is, the regression estimate for \(\beta_1\) is the regression
-through the origin estimate having regressed \(X_2\) out of both
-the response and the predictor.
-
(Similarly, the regression estimate for \(\beta_2\) is the regression through the origin estimate having regressed \(X_1\) out of both the response and the predictor.)
-
More generally, multivariate regression estimates are exactly those having removed the linear relationship of the other variables
-from both the regressor and response.
-
-
-
-
-
-
-
-
-
Example with two variables, simple linear regression
-
-
-
-
\(Y_{i} = \beta_1 X_{1i} + \beta_2 X_{2i}\) where \(X_{2i} = 1\) is an intercept term.
-
Then \(\frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2}X_{2i} =
-\frac{\sum_j X_{1j}}{n} = \bar X_1\).
Solving them yields the least squares estimates. (With obtaining a good, fast, general solution requiring some knowledge of linear algebra.)
-
The least squares estimate for the coefficient of a multivariate regression model is exactly regression through the origin with the linear relationships with the other regressors removed from both the regressor and outcome by taking residuals.
-
In this sense, multivariate regression "adjusts" a coefficient for the linear impact of the other variables.
-
-
-
-
-
-
-
-
-
Fitting LS equations
-
-
-
Just so I don't leave you hanging, let's show a way to get estimates. Recall the equations:
-\[
-\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{ip}\hat \beta_p) X_k = 0
-\]
-If I hold \(\hat \beta_1, \ldots, \hat \beta_{p-1}\) fixed then
-we get that
-\[
-\hat \beta_p = \frac{\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{i,p-1}\hat \beta_{p-1}) X_{ip} }{\sum_{i=1}^n X_{ip}^2}
-\]
-Plugging this back into the equations, we wind up with
-\[
-\sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) X_k = 0
-\]
We've reduced \(p\) LS equations and \(p\) unknowns to \(p-1\) LS equations and \(p-1\) unknowns.
-
-
-
Every variable has been replaced by its residual with \(X_p\).
-
This process can then be iterated until only Y and one
-variable remains.
-
-
Think of it as follows. If we want an adjusted relationship between y and x, keep taking residuals over confounders and do regression through the origin.
-
-
-
The order that you do the confounders doesn't matter.
-
(It can't because our choice of doing \(p\) first
-was arbitrary.)
-
-
This isn't a terribly efficient way to get estimates. But, it's nice conceputally, as it shows how regression estimates are adjusted for the linear relationship with other variables.
-
-
-
-
-
-
-
-
-
Demonstration that it works using an example
-
-
-
Linear model with two variables and an intercept
-
-
n <- 100; x <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n)
-y <- x + x2 + x3 + rnorm(n, sd = .1)
-e <- function(a, b) a - sum( a * b ) / sum( b ^ 2) * b
-ey <- e(e(y, x2), e(x3, x2))
-ex <- e(e(x, x2), e(x3, x2))
-sum(ey * ex) / sum(ex ^ 2)
-
-
-
[1] 1.004
-
-
-
coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
-
coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
-
-
-
x x2 x3
-1.0040 0.9899 1.0078
-
-
-
-
-
-
-
-
-
Interpretation of the coeficient
-
-
-
\[E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k\]
-So that
-\[
-E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p]\]
-\[= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k}+ \sum_{k=1}^p x_{k} \beta_k = \beta_1 \]
-So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.
-
-
In the next lecture, we'll do examples and go over context-specific
-interpretations.
-
-
-
-
-
-
-
-
Fitted values, residuals and residual variation
-
-
-
All of our SLR quantities can be extended to linear models
-
-
-
Model \(Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}\) where \(\epsilon_i \sim N(0, \sigma^2)\)
To get predicted responses at new values, \(x_1, \ldots, x_p\), simply plug them into the linear model \(\sum_{k=1}^p x_{k} \hat \beta_{k}\)
-
Coefficients have standard errors, \(\hat \sigma_{\hat \beta_k}\), and
-\(\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}\)
-follows a \(T\) distribution with \(n-p\) degrees of freedom.
-
Predicted responses have standard errors and we can calculate predicted and expected response intervals.
-
-
-
-
-
-
-
-
-
Linear models
-
-
-
-
Linear models are the single most important applied statistical and machine learning techniqe, by far.
-
Some amazing things that you can accomplish with linear models
-
-
-
Decompose a signal into its harmonics.
-
Flexibly fit complicated functions.
-
Fit factor variables as predictors.
-
Uncover complex multivariate relationships with the response.
If I were to present evidence of a relationship between breath mint useage (mints per day, X) and pulmonary function (measured in FEV), you would be skeptical.
+
Likely, you would say, ‘smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That’s probably the culprit.’
+
If asked what would convince you, you would likely say, ‘If non-smoking breath mint users had lower lung function than non-smoking non-breath mint users and, similarly, if smoking breath mint users had lower lung function than smoking non-breath mint users, I’d be more inclined to believe you’.
+
In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.
+
An insurance company is interested in how last year’s claims can predict a person’s time in the hospital this year.
+
They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
+
How can one generalize SLR to incoporate lots of regressors for the purpose of prediction?
+
What are the consequences of adding lots of regressors?
+
Surely there must be consequences to throwing variables in that aren’t related to Y?
+
Surely there must be consequences to omitting variables that are?
+
+
+
+
The linear model
+
+
The general linear model extends simple linear regression (SLR) by adding terms linearly into the model. \[
+Y_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots +
+\beta_{p} X_{pi} + \epsilon_{i}
+= \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}
+\]
+
Here \(X_{1i}=1\) typically, so that an intercept is included.
+
Least squares (and hence ML estimates under iid Gaussianity of the errors) minimizes \[
+\sum_{i=1}^n \left(Y_i - \sum_{k=1}^p X_{ki} \beta_j\right)^2
+\]
+
Note, the important linearity is linearity in the coefficients. Thus \[
+Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots +
+\beta_{p} X_{pi}^2 + \epsilon_{i}
+\] is still a linear model. (We’ve just squared the elements of the predictor variables.)
+
+
+
+
How to get estimates
+
+
The real way requires linear algebra. We’ll go over an intuitive development instead.
+
Recall that the LS estimate for regression through the origin, \(E[Y_i]=X_{1i}\beta_1\), was \(\sum X_i Y_i / \sum X_i^2\).
Also, recall, that if \(\hat \mu_i\) satisfies \[
+\sum_{i=1} (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0
+\] for all possible values of \(\mu_i\), then we’ve found the LS estimates.
+
+
\[
+\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = \sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i})
+\left\{X_{1i}(\hat \beta_1 - \beta_1) + X_{2i}(\hat \beta_2 - \beta_2) \right\}
+\] * Thus we need 1. \(\sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i}) X_{1i} = 0\) 2. \(\sum_{i=1}^n (Y_i - \hat \beta_1 X_{1i} - \hat \beta_2 X_{2i}) X_{2i} = 0\) * Hold \(\hat \beta_1\) fixed in 2. and solve and we get that \[
+\hat \beta_2 = \frac{\sum_{i=1} (Y_i - X_{1i}\hat \beta_1)X_{2i}}{\sum_{i=1}^n X_{2i}^2}
+\] * Plugging this into 1. we get that \[
+0 = \sum_{i=1}^n \left\{Y_i - \frac{\sum_j X_{2j}Y_j}{\sum_j X_{2j}^2}X_{2i} +
+\beta_1 \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right)\right\} X_{1i}
+\]
+
+
+
Continued
+
+
Re writing this we get \[
+0 = \sum_{i=1}^n \left\{ e_{i, Y | X_2} - \hat \beta_1 e_{i, X_1 | X_2}
+\right\} X_{1i}
+\] where \(e_{i, a | b} = a_i - \frac{\sum_{j=1}^n a_j b_j }{\sum_{i=1}^n b_j^2} b_i\) is the residual when regressing \(b\) from \(a\) without an intercept.
+
We get the solution \[
+\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2} X_1}
+\]
+
But note that \[
+\sum_{i=1}^n e_{i, X_1 | X_2}^2
+= \sum_{i=1}^n e_{i, X_1 | X_2} \left(X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} X_{2i}\right)
+\]\[
+= \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i} - \frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2} \sum_{i=1}^n e_{i, X_1 | X_2} X_{2i}
+\] But \(\sum_{i=1}^n e_{i, X_1 | X_2} X_{2i} = 0\). So we get that \[
+\sum_{i=1}^n e_{i, X_1 | X_2}^2 = \sum_{i=1}^n e_{i, X_1 | X_2} X_{1i}
+\] Thus we get that \[
+\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}
+\]
+
+
+
+
Summing up fitting with two regressors
+
\[\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}\] * That is, the regression estimate for \(\beta_1\) is the regression through the origin estimate having regressed \(X_2\) out of both the response and the predictor. * (Similarly, the regression estimate for \(\beta_2\) is the regression through the origin estimate having regressed \(X_1\) out of both the response and the predictor.) * More generally, multivariate regression estimates are exactly those having removed the linear relationship of the other variables from both the regressor and response.
+
+
+
Example with two variables, simple linear regression
+
+
\(Y_{i} = \beta_1 X_{1i} + \beta_2 X_{2i}\) where \(X_{2i} = 1\) is an intercept term.
+
Then \(\frac{\sum_j X_{2j}X_{1j}}{\sum_j X_{2j}^2}X_{2i} = \frac{\sum_j X_{1j}}{n} = \bar X_1\).
The equations \[
+\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{ip}\hat \beta_p) X_k = 0
+\] for \(k = 1, \ldots, p\) yields \(p\) equations with \(p\) unknowns.
+
Solving them yields the least squares estimates. (With obtaining a good, fast, general solution requiring some knowledge of linear algebra.)
+
The least squares estimate for the coefficient of a multivariate regression model is exactly regression through the origin with the linear relationships with the other regressors removed from both the regressor and outcome by taking residuals.
+
In this sense, multivariate regression “adjusts” a coefficient for the linear impact of the other variables.
+
+
+
+
Fitting LS equations
+
Just so I don’t leave you hanging, let’s show a way to get estimates. Recall the equations: \[
+\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{ip}\hat \beta_p) X_k = 0
+\] If I hold \(\hat \beta_1, \ldots, \hat \beta_{p-1}\) fixed then we get that \[
+\hat \beta_p = \frac{\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{i,p-1}\hat \beta_{p-1}) X_{ip} }{\sum_{i=1}^n X_{ip}^2}
+\] Plugging this back into the equations, we wind up with \[
+\sum_{i=1}^n (e_{i,Y|X_p} - e_{i, X_{1} | X_p} \hat \beta_1 - \ldots - e_{i, X_{p-1} | X_{p}} \hat \beta_{p-1}) X_k = 0
+\]
We’ve reduced \(p\) LS equations and \(p\) unknowns to \(p-1\) LS equations and \(p-1\) unknowns.
+
Every variable has been replaced by its residual with \(X_p\).
+
This process can then be iterated until only Y and one variable remains.
+
Think of it as follows. If we want an adjusted relationship between y and x, keep taking residuals over confounders and do regression through the origin.
+
The order that you do the confounders doesn’t matter.
+
(It can’t because our choice of doing \(p\) first was arbitrary.)
+
This isn’t a terribly efficient way to get estimates. But, it’s nice conceputally, as it shows how regression estimates are adjusted for the linear relationship with other variables.
+
+
+
+
Demonstration that it works using an example
+
+
Linear model with two variables and an intercept
+
n <- 100; x <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n)
+y <- x + x2 + x3 + rnorm(n, sd = .1)
+e <- function(a, b) a - sum( a * b ) / sum( b ^ 2) * b
+ey <- e(e(y, x2), e(x3, x2))
+ex <- e(e(x, x2), e(x3, x2))
+sum(ey * ex) / sum(ex ^ 2)
+
[1] 0.9943411
+
coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
coef(lm(y ~ x + x2 + x3 - 1)) #the -1 removes the intercept term
+
x x2 x3
+0.9943411 0.9922047 1.0027025
+
+
+
Interpretation of the coeficient
+
\[E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k\] So that \[
+E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p]\]\[= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k}+ \sum_{k=1}^p x_{k} \beta_k = \beta_1 \] So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.
+
In the next lecture, we’ll do examples and go over context-specific interpretations.
+
+
+
Fitted values, residuals and residual variation
+
All of our SLR quantities can be extended to linear models * Model \(Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}\) where \(\epsilon_i \sim N(0, \sigma^2)\) * Fitted responses \(\hat Y_i = \sum_{k=1}^p X_{ik} \hat \beta_{k}\) * Residuals \(e_i = Y_i - \hat Y_i\) * Variance estimate \(\hat \sigma^2 = \frac{1}{n-p} \sum_{i=1}^n e_i ^2\) * To get predicted responses at new values, \(x_1, \ldots, x_p\), simply plug them into the linear model \(\sum_{k=1}^p x_{k} \hat \beta_{k}\) * Coefficients have standard errors, \(\hat \sigma_{\hat \beta_k}\), and \(\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}\) follows a \(T\) distribution with \(n-p\) degrees of freedom. * Predicted responses have standard errors and we can calculate predicted and expected response intervals.
+
+
+
Linear models
+
+
Linear models are the single most important applied statistical and machine learning techniqe, by far.
+
Some amazing things that you can accomplish with linear models
+
Decompose a signal into its harmonics.
+
Flexibly fit complicated functions.
+
Fit factor variables as predictors.
+
Uncover complex multivariate relationships with the response.
+
Build accurate prediction models.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/02_02_multivariateExamples/index.Rmd b/07_RegressionModels/02_02_multivariateExamples/index.Rmd
index 5a053aab9..87c736dc1 100644
--- a/07_RegressionModels/02_02_multivariateExamples/index.Rmd
+++ b/07_RegressionModels/02_02_multivariateExamples/index.Rmd
@@ -1,430 +1,399 @@
----
-title : Multivariable regression examples
-subtitle : Regression Models
-author : Brian Caffo, Jeff Leek and Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../librariesNew
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
----
-```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-runif(1)
-```
-## Swiss fertility data
-```{r, fig.height=4.5, fig.width=4.5}
-library(datasets); data(swiss); require(stats); require(graphics)
-pairs(swiss, panel = panel.smooth, main = "Swiss data", col = 3 + (swiss$Catholic > 50))
-```
-
----
-## `?swiss`
-### Description
-Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
-
-A data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].
-
-* [,1] Fertility Ig, ‘ common standardized fertility measure’
-* [,2] Agriculture % of males involved in agriculture as occupation
-* [,3] Examination % draftees receiving highest mark on army examination
-* [,4] Education % education beyond primary school for draftees.
-* [,5] Catholic % ‘catholic’ (as opposed to ‘protestant’).
-* [,6] Infant.Mortality live births who live less than 1 year.
-
-All variables but ‘Fertility’ give proportions of the population.
-
-
----
-## Calling `lm`
-`summary(lm(Fertility ~ . , data = swiss))`
-```{r, echo = FALSE}
-summary(lm(Fertility ~ . , data = swiss))$coefficients
-```
-
----
-## Example interpretation
-* Agriculture is expressed in percentages (0 - 100)
-* Estimate is -0.1721.
-* We estimate an expected 0.17 decrease in standardized fertility for every 1\% increase in percentage of males involved in agriculture in holding the remaining variables constant.
-* The t-test for $H_0: \beta_{Agri} = 0$ versus $H_a: \beta_{Agri} \neq 0$ is significant.
-* Interestingly, the unadjusted estimate is
-```{r}
-summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients
-```
-
----
-How can adjustment reverse the sign of an effect? Let's try a simulation.
-```{r, echo = TRUE}
-n <- 100; x2 <- 1 : n; x1 <- .01 * x2 + runif(n, -.1, .1); y = -x1 + x2 + rnorm(n, sd = .01)
-summary(lm(y ~ x1))$coef
-summary(lm(y ~ x1 + x2))$coef
-```
-
----
-```{r, echo = FALSE, fig.height=5, fig.width=10, results = 'show'}
-par(mfrow = c(1, 2))
-plot(x1, y, pch=21,col="black",bg=topo.colors(n)[x2], frame = FALSE, cex = 1.5)
-title('Unadjusted, color is X2')
-abline(lm(y ~ x1), lwd = 2)
-plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2)), pch = 21, col = "black", bg = "lightblue", frame = FALSE, cex = 1.5)
-title('Adjusted')
-abline(0, coef(lm(y ~ x1 + x2))[2], lwd = 2)
-```
-
----
-## Back to this data set
-* The sign reverses itself with the inclusion of Examination and Education, but of which are negatively correlated with Agriculture.
-* The percent of males in the province working in agriculture is negatively related to educational attainment (correlation of `r cor(swiss$Agriculture, swiss$Education)`) and Education and Examination (correlation of `r cor(swiss$Education, swiss$Examination)`) are obviously measuring similar things.
- * Is the positive marginal an artifact for not having accounted for, say, Education level? (Education does have a stronger effect, by the way.)
-* At the minimum, anyone claiming that provinces that are more agricultural have higher fertility rates would immediately be open to criticism.
-
----
-## What if we include an unnecessary variable?
-z adds no new linear information, since it's a linear
-combination of variables already included. R just drops
-terms that are linear combinations of other terms.
-```{r, echo = TRUE}
-z <- swiss$Agriculture + swiss$Education
-lm(Fertility ~ . + z, data = swiss)
-```
-
-
----
-## Dummy variables are smart
-* Consider the linear model
-$$
-Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i}
-$$
-where each $X_{i1}$ is binary so that it is a 1 if measurement $i$ is in a group and 0 otherwise. (Treated versus not in a clinical trial, for example.)
-* Then for people in the group $E[Y_i] = \beta_0 + \beta_1$
-* And for people not in the group $E[Y_i] = \beta_0$
-* The LS fits work out to be $\hat \beta_0 + \hat \beta_1$ is the mean for those in the group and $\hat \beta_0$ is the mean for those not in the group.
-* $\beta_1$ is interpretted as the increase or decrease in the mean comparing those in the group to those not.
-* Note including a binary variable that is 1 for those not in the group would be redundant. It would create three parameters to describe two means.
-
----
-## More than 2 levels
-* Consider a multilevel factor level. For didactic reasons, let's say a three level factor (example, US political party affiliation: Republican, Democrat, Independent)
-* $Y_i = \beta_0 + X_{i1} \beta_1 + X_{i2} \beta_2 + \epsilon_i$.
-* $X_{i1}$ is 1 for Republicans and 0 otherwise.
-* $X_{i2}$ is 1 for Democrats and 0 otherwise.
-* If $i$ is Republican $E[Y_i] = \beta_0 +\beta_1$
-* If $i$ is Democrat $E[Y_i] = \beta_0 + \beta_2$.
-* If $i$ is Independent $E[Y_i] = \beta_0$.
-* $\beta_1$ compares Republicans to Independents.
-* $\beta_2$ compares Democrats to Independents.
-* $\beta_1 - \beta_2$ compares Republicans to Democrats.
-* (Choice of reference category changes the interpretation.)
-
----
-## Insect Sprays
-```{r, echo = FALSE, fig.height=5, fig.width=5}
-require(datasets);data(InsectSprays)
-require(stats); require(graphics)
-boxplot(count ~ spray, data = InsectSprays,
- xlab = "Type of spray", ylab = "Insect count",
- main = "InsectSprays data", varwidth = TRUE, col = "lightgray")
-```
-
----
-## Linear model fit, group A is the reference
-```{r, echo= TRUE}
-summary(lm(count ~ spray, data = InsectSprays))$coef
-```
-
----
-## Hard coding the dummy variables
-```{r, echo= TRUE}
-summary(lm(count ~
- I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
- I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
- I(1 * (spray == 'F'))
- , data = InsectSprays))$coef
-```
-
----
-## What if we include all 6?
-```{r, echo= TRUE}
-lm(count ~
- I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
- I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
- I(1 * (spray == 'F')) + I(1 * (spray == 'A')), data = InsectSprays)
-```
-
----
-## What if we omit the intercept?
-```{r, echo= TRUE}
-summary(lm(count ~ spray - 1, data = InsectSprays))$coef
-unique(ave(InsectSprays$count, InsectSprays$spray))
-```
-
----
-## Summary
-* If we treat Spray as a factor, R includes an intercept and omits the alphabetically first level of the factor.
- * All t-tests are for comparisons of Sprays versus Spray A.
- * Emprirical mean for A is the intercept.
- * Other group means are the itc plus their coefficient.
-* If we omit an intercept, then it includes terms for all levels of the factor.
- * Group means are the coefficients.
- * Tests are tests of whether the groups are different than zero. (Are the expected counts zero for that spray.)
-* If we want comparisons between, Spray B and C, say we could refit the model with C (or B) as the reference level.
-
----
-## Reordering the levels
-```{r}
-spray2 <- relevel(InsectSprays$spray, "C")
-summary(lm(count ~ spray2, data = InsectSprays))$coef
-```
-
----
-## Doing it manually
-Equivalently
-$$Var(\hat \beta_B - \hat \beta_C) = Var(\hat \beta_B) + Var(\hat \beta_C) - 2 Cov(\hat \beta_B, \hat \beta_C)$$
-```{r}
-fit <- lm(count ~ spray, data = InsectSprays) #A is ref
-bbmbc <- coef(fit)[2] - coef(fit)[3] #B - C
-temp <- summary(fit)
-se <- temp$sigma * sqrt(temp$cov.unscaled[2, 2] + temp$cov.unscaled[3,3] - 2 *temp$cov.unscaled[2,3])
-t <- (bbmbc) / se
-p <- pt(-abs(t), df = fit$df)
-out <- c(bbmbc, se, t, p)
-names(out) <- c("B - C", "SE", "T", "P")
-round(out, 3)
-```
-
----
-## Other thoughts on this data
-* Counts are bounded from below by 0, violates the assumption of normality of the errors.
- * Also there are counts near zero, so both the actual assumption and the intent of the assumption are violated.
-* Variance does not appear to be constant.
-* Perhaps taking logs of the counts would help.
- * There are 0 counts, so maybe log(Count + 1)
-* Also, we'll cover Poisson GLMs for fitting count data.
-
----
-
-## Example - Millenium Development Goal 1
-
-[http://www.un.org/millenniumgoals/pdf/MDG_FS_1_EN.pdf](http://www.un.org/millenniumgoals/pdf/MDG_FS_1_EN.pdf)
-
-[http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:*;SEX:*](http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:*;SEX:*)
-
----
-
-## WHO childhood hunger data
-
-
-```{r whoDataLoad}
-#download.file("http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:*;SEX:*","hunger.csv",method="curl")
-hunger <- read.csv("hunger.csv")
-hunger <- hunger[hunger$Sex!="Both sexes",]
-head(hunger)
-```
-
----
-
-## Plot percent hungry versus time
-
-```{r, dependson="whoDataLoad",fig.height=4,fig.width=4}
-lm1 <- lm(hunger$Numeric ~ hunger$Year)
-plot(hunger$Year,hunger$Numeric,pch=19,col="blue")
-```
-
----
-
-## Remember the linear model
-
-$$Hu_i = b_0 + b_1 Y_i + e_i$$
-
-$b_0$ = percent hungry at Year 0
-
-$b_1$ = decrease in percent hungry per year
-
-$e_i$ = everything we didn't measure
-
----
-
-## Add the linear model
-
-```{r, dependson="whoDataLoad",fig.height=4,fig.width=4}
-lm1 <- lm(hunger$Numeric ~ hunger$Year)
-plot(hunger$Year,hunger$Numeric,pch=19,col="blue")
-lines(hunger$Year,lm1$fitted,lwd=3,col="darkgrey")
-```
-
-
----
-
-## Color by male/female
-
-```{r, dependson="whoDataLoad",fig.height=4,fig.width=4}
-plot(hunger$Year,hunger$Numeric,pch=19)
-points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
-```
-
----
-
-## Now two lines
-
-$$HuF_i = bf_0 + bf_1 YF_i + ef_i$$
-
-$bf_0$ = percent of girls hungry at Year 0
-
-$bf_1$ = decrease in percent of girls hungry per year
-
-$ef_i$ = everything we didn't measure
-
-
-$$HuM_i = bm_0 + bm_1 YM_i + em_i$$
-
-$bm_0$ = percent of boys hungry at Year 0
-
-$bm_1$ = decrease in percent of boys hungry per year
-
-$em_i$ = everything we didn't measure
-
-
-
----
-
-## Color by male/female
-
-```{r, dependson="whoDataLoad",fig.height=3.5,fig.width=4}
-lmM <- lm(hunger$Numeric[hunger$Sex=="Male"] ~ hunger$Year[hunger$Sex=="Male"])
-lmF <- lm(hunger$Numeric[hunger$Sex=="Female"] ~ hunger$Year[hunger$Sex=="Female"])
-plot(hunger$Year,hunger$Numeric,pch=19)
-points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
-lines(hunger$Year[hunger$Sex=="Male"],lmM$fitted,col="black",lwd=3)
-lines(hunger$Year[hunger$Sex=="Female"],lmF$fitted,col="red",lwd=3)
-```
-
-
----
-
-## Two lines, same slope
-
-$$Hu_i = b_0 + b_1 \mathbb{1}(Sex_i="Male") + b_2 Y_i + e^*_i$$
-
-$b_0$ - percent hungry at year zero for females
-
-$b_0 + b_1$ - percent hungry at year zero for males
-
-$b_2$ - change in percent hungry (for either males or females) in one year
-
-$e^*_i$ - everything we didn't measure
-
----
-
-## Two lines, same slope in R
-
-
-```{r, dependson="whoDataLoad",fig.height=4,fig.width=4}
-lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex)
-plot(hunger$Year,hunger$Numeric,pch=19)
-points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
-abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
-abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] ),col="black",lwd=3)
-```
-
----
-
-## Two lines, different slopes (interactions)
-
-$$Hu_i = b_0 + b_1 \mathbb{1}(Sex_i="Male") + b_2 Y_i + b_3 \mathbb{1}(Sex_i="Male")\times Y_i + e^+_i$$
-
-$b_0$ - percent hungry at year zero for females
-
-$b_0 + b_1$ - percent hungry at year zero for males
-
-$b_2$ - change in percent hungry (females) in one year
-
-$b_2 + b_3$ - change in percent hungry (males) in one year
-
-$e^+_i$ - everything we didn't measure
-
----
-
-## Two lines, different slopes in R
-
-
-```{r lmBothChunk, dependson="whoDataLoad",fig.height=4,fig.width=4}
-lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex*hunger$Year)
-plot(hunger$Year,hunger$Numeric,pch=19)
-points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
-abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
-abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] +lmBoth$coeff[4]),col="black",lwd=3)
-```
-
-
----
-
-## Two lines, different slopes in R
-
-
-```{r, dependson="lmBothChunk",fig.height=4,fig.width=4}
-summary(lmBoth)
-```
-
----
-## Interpretting a continuous interaction
-$$
-E[Y_i | X_{1i}=x_1, X_{2i}=x_2] = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{1}x_{2}
-$$
-Holding $X_2$ constant we have
-$$
-E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2]
-= \beta_1 + \beta_3 x_{2}
-$$
-And thus the expected change in $Y$ per unit change in $X_1$ holding all else constant is not constant. $\beta_1$ is the slope when $x_{2} = 0$. Note further that:
-$$
-E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2+1]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2+1]
-$$
-$$
--E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2]
-$$
-$$
-=\beta_3
-$$
-Thus, $\beta_3$ is the change in the expected change in $Y$ per unit change in $X_1$, per unit change in $X_2$.
-
-Or, the change in the slope relating $X_1$ and $Y$ per unit change in $X_2$.
-
----
-
-## Example
-
-$$Hu_i = b_0 + b_1 In_i + b_2 Y_i + b_3 In_i \times Y_i + e^+_i$$
-
-$b_0$ - percent hungry at year zero for children with whose parents have no income
-
-$b_1$ - change in percent hungry for each dollar of income in year zero
-
-$b_2$ - change in percent hungry in one year for children whose parents have no income
-
-$b_3$ - increased change in percent hungry by year for each dollar of income - e.g. if income is $10,000, then change in percent hungry in one year will be
-
-$$b_2 + 1e4 \times b_3$$
-
-$e^+_i$ - everything we didn't measure
-
-__Lot's of care/caution needed!__
-
-
+---
+title : Multivariable regression examples
+subtitle : Regression Models
+author : Brian Caffo, Jeff Leek and Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../librariesNew
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+runif(1)
+```
+## Swiss fertility data
+```{r, fig.height=4.5, fig.width=4.5}
+library(datasets); data(swiss); require(stats); require(graphics)
+pairs(swiss, panel = panel.smooth, main = "Swiss data", col = 3 + (swiss$Catholic > 50))
+```
+
+## `?swiss`
+### Description
+Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
+
+A data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].
+
+* [,1] Fertility Ig, ‘ common standardized fertility measure’
+* [,2] Agriculture % of males involved in agriculture as occupation
+* [,3] Examination % draftees receiving highest mark on army examination
+* [,4] Education % education beyond primary school for draftees.
+* [,5] Catholic % ‘catholic’ (as opposed to ‘protestant’).
+* [,6] Infant.Mortality live births who live less than 1 year.
+
+All variables but ‘Fertility’ give proportions of the population.
+
+
+## Calling `lm`
+`summary(lm(Fertility ~ . , data = swiss))`
+```{r, echo = FALSE}
+summary(lm(Fertility ~ . , data = swiss))$coefficients
+```
+
+## Example interpretation
+* Agriculture is expressed in percentages (0 - 100)
+* Estimate is -0.1721.
+* We estimate an expected 0.17 decrease in standardized fertility for every 1\% increase in percentage of males involved in agriculture in holding the remaining variables constant.
+* The t-test for $H_0: \beta_{Agri} = 0$ versus $H_a: \beta_{Agri} \neq 0$ is significant.
+* Interestingly, the unadjusted estimate is
+```{r}
+summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients
+```
+
+How can adjustment reverse the sign of an effect? Let's try a simulation.
+```{r, echo = TRUE}
+n <- 100; x2 <- 1 : n; x1 <- .01 * x2 + runif(n, -.1, .1); y = -x1 + x2 + rnorm(n, sd = .01)
+summary(lm(y ~ x1))$coef
+summary(lm(y ~ x1 + x2))$coef
+```
+
+```{r, echo = FALSE, fig.height=5, fig.width=10, results = 'show'}
+par(mfrow = c(1, 2))
+plot(x1, y, pch=21,col="black",bg=topo.colors(n)[x2], frame = FALSE, cex = 1.5)
+title('Unadjusted, color is X2')
+abline(lm(y ~ x1), lwd = 2)
+plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2)), pch = 21, col = "black", bg = "lightblue", frame = FALSE, cex = 1.5)
+title('Adjusted')
+abline(0, coef(lm(y ~ x1 + x2))[2], lwd = 2)
+```
+
+## Back to this data set
+* The sign reverses itself with the inclusion of Examination and Education, both of which are negatively correlated with Agriculture.
+* The percent of males in the province working in agriculture is negatively related to educational attainment (correlation of `r cor(swiss$Agriculture, swiss$Education)`) and Education and Examination (correlation of `r cor(swiss$Education, swiss$Examination)`) are obviously measuring similar things.
+ * Is the positive marginal an artifact for not having accounted for, say, Education level? (Education does have a stronger effect, by the way.)
+* At the minimum, anyone claiming that provinces that are more agricultural have higher fertility rates would immediately be open to criticism.
+
+## What if we include an unnecessary variable?
+z adds no new linear information, since it's a linear
+combination of variables already included. R just drops
+terms that are linear combinations of other terms.
+```{r, echo = TRUE}
+z <- swiss$Agriculture + swiss$Education
+lm(Fertility ~ . + z, data = swiss)
+```
+
+
+## Dummy variables are smart
+* Consider the linear model
+$$
+Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i}
+$$
+where each $X_{i1}$ is binary so that it is a 1 if measurement $i$ is in a group and 0 otherwise. (Treated versus not in a clinical trial, for example.)
+* Then for people in the group $E[Y_i] = \beta_0 + \beta_1$
+* And for people not in the group $E[Y_i] = \beta_0$
+* The LS fits work out to be $\hat \beta_0 + \hat \beta_1$ is the mean for those in the group and $\hat \beta_0$ is the mean for those not in the group.
+* $\beta_1$ is interpretted as the increase or decrease in the mean comparing those in the group to those not.
+* Note including a binary variable that is 1 for those not in the group would be redundant. It would create three parameters to describe two means.
+
+## More than 2 levels
+* Consider a multilevel factor level. For didactic reasons, let's say a three level factor (example, US political party affiliation: Republican, Democrat, Independent)
+* $Y_i = \beta_0 + X_{i1} \beta_1 + X_{i2} \beta_2 + \epsilon_i$.
+* $X_{i1}$ is 1 for Republicans and 0 otherwise.
+* $X_{i2}$ is 1 for Democrats and 0 otherwise.
+* If $i$ is Republican $E[Y_i] = \beta_0 +\beta_1$
+* If $i$ is Democrat $E[Y_i] = \beta_0 + \beta_2$.
+* If $i$ is Independent $E[Y_i] = \beta_0$.
+* $\beta_1$ compares Republicans to Independents.
+* $\beta_2$ compares Democrats to Independents.
+* $\beta_1 - \beta_2$ compares Republicans to Democrats.
+* (Choice of reference category changes the interpretation.)
+
+## Insect Sprays
+```{r, echo = FALSE, fig.height=5, fig.width=5}
+require(datasets);data(InsectSprays)
+require(stats); require(graphics)
+boxplot(count ~ spray, data = InsectSprays,
+ xlab = "Type of spray", ylab = "Insect count",
+ main = "InsectSprays data", varwidth = TRUE, col = "lightgray")
+```
+
+## Linear model fit, group A is the reference
+```{r, echo= TRUE}
+summary(lm(count ~ spray, data = InsectSprays))$coef
+```
+
+## Hard coding the dummy variables
+```{r, echo= TRUE}
+summary(lm(count ~
+ I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
+ I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
+ I(1 * (spray == 'F'))
+ , data = InsectSprays))$coef
+```
+
+## What if we include all 6?
+```{r, echo= TRUE}
+lm(count ~
+ I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
+ I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
+ I(1 * (spray == 'F')) + I(1 * (spray == 'A')), data = InsectSprays)
+```
+
+## What if we omit the intercept?
+```{r, echo= TRUE}
+summary(lm(count ~ spray - 1, data = InsectSprays))$coef
+unique(ave(InsectSprays$count, InsectSprays$spray))
+```
+
+## Summary
+* If we treat Spray as a factor, R includes an intercept and omits the alphabetically first level of the factor.
+ * All t-tests are for comparisons of Sprays versus Spray A.
+ * Emprirical mean for A is the intercept.
+ * Other group means are the itc plus their coefficient.
+* If we omit an intercept, then it includes terms for all levels of the factor.
+ * Group means are the coefficients.
+ * Tests are tests of whether the groups are different than zero. (Are the expected counts zero for that spray.)
+* If we want comparisons between, Spray B and C, say we could refit the model with C (or B) as the reference level.
+
+## Reordering the levels
+```{r}
+spray2 <- relevel(InsectSprays$spray, "C")
+summary(lm(count ~ spray2, data = InsectSprays))$coef
+```
+
+## Doing it manually
+Equivalently
+$$Var(\hat \beta_B - \hat \beta_C) = Var(\hat \beta_B) + Var(\hat \beta_C) - 2 Cov(\hat \beta_B, \hat \beta_C)$$
+```{r}
+fit <- lm(count ~ spray, data = InsectSprays) #A is ref
+bbmbc <- coef(fit)[2] - coef(fit)[3] #B - C
+temp <- summary(fit)
+se <- temp$sigma * sqrt(temp$cov.unscaled[2, 2] + temp$cov.unscaled[3,3] - 2 *temp$cov.unscaled[2,3])
+t <- (bbmbc) / se
+p <- pt(-abs(t), df = fit$df)
+out <- c(bbmbc, se, t, p)
+names(out) <- c("B - C", "SE", "T", "P")
+round(out, 3)
+```
+
+## Other thoughts on this data
+* Counts are bounded from below by 0, violates the assumption of normality of the errors.
+ * Also there are counts near zero, so both the actual assumption and the intent of the assumption are violated.
+* Variance does not appear to be constant.
+* Perhaps taking logs of the counts would help.
+ * There are 0 counts, so maybe log(Count + 1)
+* Also, we'll cover Poisson GLMs for fitting count data.
+
+
+## Example - Millenium Development Goal 1
+
+[http://www.un.org/millenniumgoals/pdf/MDG_FS_1_EN.pdf](http://www.un.org/millenniumgoals/pdf/MDG_FS_1_EN.pdf)
+
+[http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:*;SEX:*](http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:*;SEX:*)
+
+
+## WHO childhood hunger data
+
+
+```{r whoDataLoad}
+#download.file("http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:*;SEX:*","hunger.csv",method="curl")
+hunger <- read.csv("hunger.csv")
+hunger <- hunger[hunger$Sex!="Both sexes",]
+head(hunger)
+```
+
+
+## Plot percent hungry versus time
+
+```{r, dependson="whoDataLoad",fig.height=4,fig.width=4}
+lm1 <- lm(hunger$Numeric ~ hunger$Year)
+plot(hunger$Year,hunger$Numeric,pch=19,col="blue")
+```
+
+
+## Remember the linear model
+
+$$Hu_i = b_0 + b_1 Y_i + e_i$$
+
+$b_0$ = percent hungry at Year 0
+
+$b_1$ = decrease in percent hungry per year
+
+$e_i$ = everything we didn't measure
+
+
+## Add the linear model
+
+```{r, dependson="whoDataLoad",fig.height=4,fig.width=4}
+lm1 <- lm(hunger$Numeric ~ hunger$Year)
+plot(hunger$Year,hunger$Numeric,pch=19,col="blue")
+lines(hunger$Year,lm1$fitted,lwd=3,col="darkgrey")
+```
+
+
+
+## Color by male/female
+
+```{r, dependson="whoDataLoad",fig.height=4,fig.width=4}
+plot(hunger$Year,hunger$Numeric,pch=19)
+points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
+```
+
+
+## Now two lines
+
+$$HuF_i = bf_0 + bf_1 YF_i + ef_i$$
+
+$bf_0$ = percent of girls hungry at Year 0
+
+$bf_1$ = decrease in percent of girls hungry per year
+
+$ef_i$ = everything we didn't measure
+
+
+$$HuM_i = bm_0 + bm_1 YM_i + em_i$$
+
+$bm_0$ = percent of boys hungry at Year 0
+
+$bm_1$ = decrease in percent of boys hungry per year
+
+$em_i$ = everything we didn't measure
+
+
+
+
+## Color by male/female
+
+```{r, dependson="whoDataLoad",fig.height=3.5,fig.width=4}
+lmM <- lm(hunger$Numeric[hunger$Sex=="Male"] ~ hunger$Year[hunger$Sex=="Male"])
+lmF <- lm(hunger$Numeric[hunger$Sex=="Female"] ~ hunger$Year[hunger$Sex=="Female"])
+plot(hunger$Year,hunger$Numeric,pch=19)
+points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
+lines(hunger$Year[hunger$Sex=="Male"],lmM$fitted,col="black",lwd=3)
+lines(hunger$Year[hunger$Sex=="Female"],lmF$fitted,col="red",lwd=3)
+```
+
+
+
+## Two lines, same slope
+
+$$Hu_i = b_0 + b_1 \mathbb{1}(Sex_i="Male") + b_2 Y_i + e^*_i$$
+
+$b_0$ - percent hungry at year zero for females
+
+$b_0 + b_1$ - percent hungry at year zero for males
+
+$b_2$ - change in percent hungry (for either males or females) in one year
+
+$e^*_i$ - everything we didn't measure
+
+
+## Two lines, same slope in R
+
+
+```{r, dependson="whoDataLoad",fig.height=4,fig.width=4}
+lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex)
+plot(hunger$Year,hunger$Numeric,pch=19)
+points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
+abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
+abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] ),col="black",lwd=3)
+```
+
+
+## Two lines, different slopes (interactions)
+
+$$Hu_i = b_0 + b_1 \mathbb{1}(Sex_i="Male") + b_2 Y_i + b_3 \mathbb{1}(Sex_i="Male")\times Y_i + e^+_i$$
+
+$b_0$ - percent hungry at year zero for females
+
+$b_0 + b_1$ - percent hungry at year zero for males
+
+$b_2$ - change in percent hungry (females) in one year
+
+$b_2 + b_3$ - change in percent hungry (males) in one year
+
+$e^+_i$ - everything we didn't measure
+
+
+## Two lines, different slopes in R
+
+
+```{r lmBothChunk, dependson="whoDataLoad",fig.height=4,fig.width=4}
+lmBoth <- lm(hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex*hunger$Year)
+plot(hunger$Year,hunger$Numeric,pch=19)
+points(hunger$Year,hunger$Numeric,pch=19,col=((hunger$Sex=="Male")*1+1))
+abline(c(lmBoth$coeff[1],lmBoth$coeff[2]),col="red",lwd=3)
+abline(c(lmBoth$coeff[1] + lmBoth$coeff[3],lmBoth$coeff[2] +lmBoth$coeff[4]),col="black",lwd=3)
+```
+
+
+
+## Two lines, different slopes in R
+
+
+```{r, dependson="lmBothChunk",fig.height=4,fig.width=4}
+summary(lmBoth)
+```
+
+## Interpretting a continuous interaction
+$$
+E[Y_i | X_{1i}=x_1, X_{2i}=x_2] = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{1}x_{2}
+$$
+Holding $X_2$ constant we have
+$$
+E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2]
+= \beta_1 + \beta_3 x_{2}
+$$
+And thus the expected change in $Y$ per unit change in $X_1$ holding all else constant is not constant. $\beta_1$ is the slope when $x_{2} = 0$. Note further that:
+$$
+E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2+1]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2+1]
+$$
+$$
+-E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2]
+$$
+$$
+=\beta_3
+$$
+Thus, $\beta_3$ is the change in the expected change in $Y$ per unit change in $X_1$, per unit change in $X_2$.
+
+Or, the change in the slope relating $X_1$ and $Y$ per unit change in $X_2$.
+
+
+## Example
+
+$$Hu_i = b_0 + b_1 In_i + b_2 Y_i + b_3 In_i \times Y_i + e^+_i$$
+
+$b_0$ - percent hungry at year zero for children with whose parents have no income
+
+$b_1$ - change in percent hungry for each dollar of income in year zero
+
+$b_2$ - change in percent hungry in one year for children whose parents have no income
+
+$b_3$ - increased change in percent hungry by year for each dollar of income - e.g. if income is $10,000, then change in percent hungry in one year will be
+
+$$b_2 + 1e4 \times b_3$$
+
+$e^+_i$ - everything we didn't measure
+
+__Lot's of care/caution needed!__
+
+
diff --git a/07_RegressionModels/02_02_multivariateExamples/index.html b/07_RegressionModels/02_02_multivariateExamples/index.html
index ecc8a4389..b29337539 100644
--- a/07_RegressionModels/02_02_multivariateExamples/index.html
+++ b/07_RegressionModels/02_02_multivariateExamples/index.html
@@ -1,1024 +1,554 @@
-
-
-
- Multivariable regression examples
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Multivariable regression examples
-
Regression Models
-
Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
-
-
Swiss fertility data
-
-
-
library(datasets); data(swiss); require(stats); require(graphics)
-pairs(swiss, panel = panel.smooth, main = "Swiss data", col = 3 + (swiss$Catholic > 50))
-
-
-
-
-
-
-
-
-
-
-
?swiss
-
-
-
Description
-
-
Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
-
-
A data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100].
-
-
-
[,1] Fertility Ig, ‘ common standardized fertility measure’
-
[,2] Agriculture % of males involved in agriculture as occupation
-
[,3] Examination % draftees receiving highest mark on army examination
-
[,4] Education % education beyond primary school for draftees.
-
[,5] Catholic % ‘catholic’ (as opposed to ‘protestant’).
-
[,6] Infant.Mortality live births who live less than 1 year.
-
-
-
All variables but ‘Fertility’ give proportions of the population.
We estimate an expected 0.17 decrease in standardized fertility for every 1\% increase in percentage of males involved in agriculture in holding the remaining variables constant.
-
The t-test for \(H_0: \beta_{Agri} = 0\) versus \(H_a: \beta_{Agri} \neq 0\) is significant.
-
Interestingly, the unadjusted estimate is
-
-
-
summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients
-
-
-
Estimate Std. Error t value Pr(>|t|)
-(Intercept) 60.3044 4.25126 14.185 3.216e-18
-Agriculture 0.1942 0.07671 2.532 1.492e-02
-
-
-
-
-
-
-
-
-
How can adjustment reverse the sign of an effect? Let's try a simulation.
The sign reverses itself with the inclusion of Examination and Education, but of which are negatively correlated with Agriculture.
-
The percent of males in the province working in agriculture is negatively related to educational attainment (correlation of -0.6395) and Education and Examination (correlation of 0.6984) are obviously measuring similar things.
-
-
-
Is the positive marginal an artifact for not having accounted for, say, Education level? (Education does have a stronger effect, by the way.)
-
-
At the minimum, anyone claiming that provinces that are more agricultural have higher fertility rates would immediately be open to criticism.
-
-
-
-
-
-
-
-
-
What if we include an unnecessary variable?
-
-
-
z adds no new linear information, since it's a linear
-combination of variables already included. R just drops
-terms that are linear combinations of other terms.
-
-
z <- swiss$Agriculture + swiss$Education
-lm(Fertility ~ . + z, data = swiss)
-
Consider the linear model
-\[
-Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i}
-\]
-where each \(X_{i1}\) is binary so that it is a 1 if measurement \(i\) is in a group and 0 otherwise. (Treated versus not in a clinical trial, for example.)
-
Then for people in the group \(E[Y_i] = \beta_0 + \beta_1\)
-
And for people not in the group \(E[Y_i] = \beta_0\)
-
The LS fits work out to be \(\hat \beta_0 + \hat \beta_1\) is the mean for those in the group and \(\hat \beta_0\) is the mean for those not in the group.
-
\(\beta_1\) is interpretted as the increase or decrease in the mean comparing those in the group to those not.
-
Note including a binary variable that is 1 for those not in the group would be redundant. It would create three parameters to describe two means.
-
-
-
-
-
-
-
-
-
More than 2 levels
-
-
-
-
Consider a multilevel factor level. For didactic reasons, let's say a three level factor (example, US political party affiliation: Republican, Democrat, Independent)
Indicator Data.Source PUBLISH.STATES Year WHO.region
-1 Children aged <5 years underweight (%) NLIS_310044 Published 1986 Africa
-2 Children aged <5 years underweight (%) NLIS_310233 Published 1990 Americas
-3 Children aged <5 years underweight (%) NLIS_312902 Published 2005 Americas
-5 Children aged <5 years underweight (%) NLIS_312522 Published 2002 Eastern Mediterranean
-6 Children aged <5 years underweight (%) NLIS_312955 Published 2008 Africa
-8 Children aged <5 years underweight (%) NLIS_312963 Published 2008 Africa
- Country Sex Display.Value Numeric Low High Comments
-1 Senegal Male 19.3 19.3 NA NA NA
-2 Paraguay Male 2.2 2.2 NA NA NA
-3 Nicaragua Male 5.3 5.3 NA NA NA
-5 Jordan Female 3.2 3.2 NA NA NA
-6 Guinea-Bissau Female 17.0 17.0 NA NA NA
-8 Ghana Male 15.7 15.7 NA NA NA
-
-Call:
-lm(formula = hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex *
- hunger$Year)
-
-Residuals:
- Min 1Q Median 3Q Max
--25.91 -11.25 -1.85 7.09 46.15
-
-Coefficients:
- Estimate Std. Error t value Pr(>|t|)
-(Intercept) 603.5058 171.0552 3.53 0.00044 ***
-hunger$Year -0.2934 0.0855 -3.43 0.00062 ***
-hunger$SexMale 61.9477 241.9086 0.26 0.79795
-hunger$Year:hunger$SexMale -0.0300 0.1209 -0.25 0.80402
----
-Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-
-Residual standard error: 13.2 on 944 degrees of freedom
-Multiple R-squared: 0.0318, Adjusted R-squared: 0.0287
-F-statistic: 10.3 on 3 and 944 DF, p-value: 1.06e-06
-
-
-
-
-
-
-
-
-
Interpretting a continuous interaction
-
-
-
\[
-E[Y_i | X_{1i}=x_1, X_{2i}=x_2] = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{1}x_{2}
-\]
-Holding \(X_2\) constant we have
-\[
-E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2]
-= \beta_1 + \beta_3 x_{2}
-\]
-And thus the expected change in \(Y\) per unit change in \(X_1\) holding all else constant is not constant. \(\beta_1\) is the slope when \(x_{2} = 0\). Note further that:
-\[
-E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2+1]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2+1]
-\]
-\[
--E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2]
-\]
-\[
-=\beta_3
-\]
-Thus, \(\beta_3\) is the change in the expected change in \(Y\) per unit change in \(X_1\), per unit change in \(X_2\).
-
-
Or, the change in the slope relating \(X_1\) and \(Y\) per unit change in \(X_2\).
\(b_0\) - percent hungry at year zero for children with whose parents have no income
-
-
\(b_1\) - change in percent hungry for each dollar of income in year zero
-
-
\(b_2\) - change in percent hungry in one year for children whose parents have no income
-
-
\(b_3\) - increased change in percent hungry by year for each dollar of income - e.g. if income is $10,000, then change in percent hungry in one year will be
We estimate an expected 0.17 decrease in standardized fertility for every 1% increase in percentage of males involved in agriculture in holding the remaining variables constant.
+
The t-test for \(H_0: \beta_{Agri} = 0\) versus \(H_a: \beta_{Agri} \neq 0\) is significant.
+
Interestingly, the unadjusted estimate is
+
+
summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients
+
Estimate Std. Error t value Pr(>|t|)
+(Intercept) 60.3043752 4.25125562 14.185074 3.216304e-18
+Agriculture 0.1942017 0.07671176 2.531577 1.491720e-02
+
How can adjustment reverse the sign of an effect? Let’s try a simulation.
The sign reverses itself with the inclusion of Examination and Education, both of which are negatively correlated with Agriculture.
+
The percent of males in the province working in agriculture is negatively related to educational attainment (correlation of -0.6395225) and Education and Examination (correlation of 0.6984153) are obviously measuring similar things.
+
Is the positive marginal an artifact for not having accounted for, say, Education level? (Education does have a stronger effect, by the way.)
+
At the minimum, anyone claiming that provinces that are more agricultural have higher fertility rates would immediately be open to criticism.
+
+
+
+
What if we include an unnecessary variable?
+
z adds no new linear information, since it’s a linear combination of variables already included. R just drops terms that are linear combinations of other terms.
+
z <- swiss$Agriculture + swiss$Education
+lm(Fertility ~ . + z, data = swiss)
+
+Call:
+lm(formula = Fertility ~ . + z, data = swiss)
+
+Coefficients:
+ (Intercept) Agriculture Examination Education Catholic
+ 66.9152 -0.1721 -0.2580 -0.8709 0.1041
+Infant.Mortality z
+ 1.0770 NA
+
+
+
Dummy variables are smart
+
+
Consider the linear model \[
+Y_i = \beta_0 + X_{i1} \beta_1 + \epsilon_{i}
+\] where each \(X_{i1}\) is binary so that it is a 1 if measurement \(i\) is in a group and 0 otherwise. (Treated versus not in a clinical trial, for example.)
+
Then for people in the group \(E[Y_i] = \beta_0 + \beta_1\)
+
And for people not in the group \(E[Y_i] = \beta_0\)
+
The LS fits work out to be \(\hat \beta_0 + \hat \beta_1\) is the mean for those in the group and \(\hat \beta_0\) is the mean for those not in the group.
+
\(\beta_1\) is interpretted as the increase or decrease in the mean comparing those in the group to those not.
+
Note including a binary variable that is 1 for those not in the group would be redundant. It would create three parameters to describe two means.
+
+
+
+
More than 2 levels
+
+
Consider a multilevel factor level. For didactic reasons, let’s say a three level factor (example, US political party affiliation: Republican, Democrat, Independent)
Indicator Data.Source PUBLISH.STATES Year WHO.region
+1 Children aged <5 years underweight (%) NLIS_310044 Published 1986 Africa
+2 Children aged <5 years underweight (%) NLIS_310233 Published 1990 Americas
+3 Children aged <5 years underweight (%) NLIS_312902 Published 2005 Americas
+5 Children aged <5 years underweight (%) NLIS_312522 Published 2002 Eastern Mediterranean
+6 Children aged <5 years underweight (%) NLIS_312955 Published 2008 Africa
+8 Children aged <5 years underweight (%) NLIS_312963 Published 2008 Africa
+ Country Sex Display.Value Numeric Low High Comments
+1 Senegal Male 19.3 19.3 NA NA NA
+2 Paraguay Male 2.2 2.2 NA NA NA
+3 Nicaragua Male 5.3 5.3 NA NA NA
+5 Jordan Female 3.2 3.2 NA NA NA
+6 Guinea-Bissau Female 17.0 17.0 NA NA NA
+8 Ghana Male 15.7 15.7 NA NA NA
+Call:
+lm(formula = hunger$Numeric ~ hunger$Year + hunger$Sex + hunger$Sex *
+ hunger$Year)
+
+Residuals:
+ Min 1Q Median 3Q Max
+-25.913 -11.248 -1.853 7.087 46.146
+
+Coefficients:
+ Estimate Std. Error t value Pr(>|t|)
+(Intercept) 603.50580 171.05519 3.528 0.000439 ***
+hunger$Year -0.29340 0.08547 -3.433 0.000623 ***
+hunger$SexMale 61.94772 241.90858 0.256 0.797946
+hunger$Year:hunger$SexMale -0.03000 0.12087 -0.248 0.804022
+---
+Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+Residual standard error: 13.21 on 944 degrees of freedom
+Multiple R-squared: 0.03181, Adjusted R-squared: 0.02874
+F-statistic: 10.34 on 3 and 944 DF, p-value: 1.064e-06
+
+
+
Interpretting a continuous interaction
+
\[
+E[Y_i | X_{1i}=x_1, X_{2i}=x_2] = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{1}x_{2}
+\] Holding \(X_2\) constant we have \[
+E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2]
+= \beta_1 + \beta_3 x_{2}
+\] And thus the expected change in \(Y\) per unit change in \(X_1\) holding all else constant is not constant. \(\beta_1\) is the slope when \(x_{2} = 0\). Note further that: \[
+E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2+1]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2+1]
+\]\[
+-E[Y_i | X_{1i}=x_1+1, X_{2i}=x_2]-E[Y_i | X_{1i}=x_1, X_{2i}=x_2]
+\]\[
+=\beta_3
+\] Thus, \(\beta_3\) is the change in the expected change in \(Y\) per unit change in \(X_1\), per unit change in \(X_2\).
+
Or, the change in the slope relating \(X_1\) and \(Y\) per unit change in \(X_2\).
\(b_0\) - percent hungry at year zero for children with whose parents have no income
+
\(b_1\) - change in percent hungry for each dollar of income in year zero
+
\(b_2\) - change in percent hungry in one year for children whose parents have no income
+
\(b_3\) - increased change in percent hungry by year for each dollar of income - e.g. if income is $10,000, then change in percent hungry in one year will be
+
\[b_2 + 1e4 \times b_3\]
+
\(e^+_i\) - everything we didn’t measure
+
Lot’s of care/caution needed!
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/02_03_adjustment/index.Rmd b/07_RegressionModels/02_03_adjustment/index.Rmd
index 3755b45c9..a5e795360 100644
--- a/07_RegressionModels/02_03_adjustment/index.Rmd
+++ b/07_RegressionModels/02_03_adjustment/index.Rmd
@@ -14,6 +14,8 @@ widgets : [mathjax] # {mathjax, quiz, bootstrap}
mode : selfcontained # {standalone, draft}
---
```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
# make this an external chunk that can be included in any file
options(width = 100)
opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
@@ -48,7 +50,7 @@ points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon
```
----
+
## Simulation 1
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), runif(n/2));
@@ -65,7 +67,7 @@ points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", ce
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 2)
```
----
+
## Discussion
### Some things to note in this simulation
* The X variable is unrelated to group status
@@ -75,7 +77,7 @@ points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon
* The relationship between group status and Y is constant depending on X.
* The relationship between group and Y disregarding X is about the same as holding X constant
----
+
## Simulation 2
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), 1.5 + runif(n/2));
@@ -93,7 +95,7 @@ points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon
```
----
+
## Discussion
### Some things to note in this simulation
* The X variable is highly related to group status
@@ -106,7 +108,7 @@ points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon
group and Y.
* This conclusion is entirely based on the model.
----
+
## Simulation 3
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), .9 + runif(n/2));
@@ -123,7 +125,7 @@ points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", ce
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 2)
```
----
+
## Discussion
### Some things to note in this simulation
* Marginal association has red group higher than blue.
@@ -134,7 +136,7 @@ holding X fixed.
----
+
## Simulation 4
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(.5 + runif(n/2), runif(n/2));
@@ -151,7 +153,7 @@ points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", ce
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 2)
```
----
+
## Discussion
### Some things to note in this simulation
* No marginal association between group status and Y.
@@ -160,7 +162,7 @@ points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon
* There is lots of direct evidence for comparing red and blue
holding X fixed.
----
+
## Simulation 5
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2, -1, 1), runif(n/2, -1, 1));
@@ -177,7 +179,7 @@ points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", ce
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 2)
```
----
+
## Discussion
### Some things to note from this simulation
* There is no such thing as a group effect here.
@@ -186,7 +188,7 @@ points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon
* Group status and X unrelated.
* There's lots of information about group effects holding X fixed.
----
+
### Simulation 6
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
p <- 1
@@ -199,14 +201,14 @@ co.pal <- heat.colors(n)
points(x1, y, pch = 21, col = "black", bg = co.pal[round((n - 1) * x2 + 1)], cex = 2)
```
----
+
### Do this to investigate the bivariate relationship
```
library(rgl)
plot3d(x1, x2, y)
```
----
+
### Residual relationship
```{r, fig.height=5, fig.width=5, echo = FALSE, results='hide'}
plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2)), frame = FALSE, col = "black", bg = "lightblue", pch = 21, cex = 2)
@@ -214,7 +216,7 @@ abline(lm(I(resid(lm(x1 ~ x2))) ~ I(resid(lm(y ~ x2)))), lwd = 2)
```
----
+
## Discussion
### Some things to note from this simulation
@@ -224,7 +226,7 @@ abline(lm(I(resid(lm(x1 ~ x2))) ~ I(resid(lm(y ~ x2)))), lwd = 2)
by considering X2.
* Almost no residual variability after accounting for X2.
----
+
## Some final thoughts
* Modeling multivariate relationships is difficult.
* Play around with simulations to see how the
diff --git a/07_RegressionModels/02_03_adjustment/index.html b/07_RegressionModels/02_03_adjustment/index.html
index 433c18736..883cbc91f 100644
--- a/07_RegressionModels/02_03_adjustment/index.html
+++ b/07_RegressionModels/02_03_adjustment/index.html
@@ -1,61 +1,130 @@
-
+
+
+
- Multivariable regression
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+Multivariable regression
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
Multivariable regression
-
Regression
-
Brian Caffo, Jeff Leek, Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
Consider the following simulated data
-
-
-
Code for the first plot, rest omitted
-(See the git repo for the rest of the code.)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Multivariable regression
+
Regression
+
Brian Caffo, Jeff Leek, Roger Peng
+
+
+
+
+
+
Consider the following simulated data
+
Code for the first plot, rest omitted (See the git repo for the rest of the code.)
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), runif(n/2));
beta0 <- 0; beta1 <- 2; tau <- 1; sigma <- .2
y <- beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)
@@ -67,293 +136,165 @@
The X variable is related to Y, but the intercept depends
-on group status.
-
The group variable is related to Y.
-
-
+
The X variable is related to Y, but the intercept depends on group status.
+
The group variable is related to Y.
The relationship between group status and Y is constant depending on X.
The relationship between group and Y disregarding X is about the same as holding X constant
-
-
-
-
-
-
-
-
-
Simulation 2
-
-
-
-
-
-
-
-
-
-
-
Discussion
-
-
-
Some things to note in this simulation
-
+
+
+
+
Simulation 2
+
+
+
+
+
+
Discussion
+
+
Some things to note in this simulation
The X variable is highly related to group status
-
The X variable is related to Y, the intercept
-doesn't depend on the group variable.
-
-
+
The X variable is related to Y, the intercept doesn’t depend on the group variable.
The X variable remains related to Y holding group status constant
-
The group variable is marginally related to Y disregarding X.
-
The model would estimate no adjusted effect due to group.
-
-
-
There isn't any data to inform the relationship between
-group and Y.
-
This conclusion is entirely based on the model.
-
+
The model would estimate no adjusted effect due to group.
+
There isn’t any data to inform the relationship between group and Y.
+
This conclusion is entirely based on the model.
-
-
-
-
-
-
-
-
Simulation 3
-
-
-
-
-
-
-
-
-
-
-
Discussion
-
-
-
Some things to note in this simulation
-
+
+
+
+
Simulation 3
+
+
+
+
+
+
Discussion
+
+
Some things to note in this simulation
Marginal association has red group higher than blue.
Adjusted relationship has blue group higher than red.
Group status related to X.
-
There is some direct evidence for comparing red and blue
-holding X fixed.
+
There is some direct evidence for comparing red and blue holding X fixed.
-
-
-
-
-
-
-
-
Simulation 4
-
-
-
-
-
-
-
-
-
-
-
Discussion
-
-
-
Some things to note in this simulation
-
+
+
+
+
Simulation 4
+
+
+
+
+
+
Discussion
+
+
Some things to note in this simulation
No marginal association between group status and Y.
Strong adjusted relationship.
Group status not related to X.
-
There is lots of direct evidence for comparing red and blue
-holding X fixed.
+
There is lots of direct evidence for comparing red and blue holding X fixed.
-
-
-
-
-
-
-
-
Simulation 5
-
-
-
-
-
-
-
-
-
-
-
Discussion
-
-
-
Some things to note from this simulation
-
-
-
There is no such thing as a group effect here.
-
+
+
+
+
Simulation 5
+
+
+
+
+
+
Discussion
+
+
Some things to note from this simulation
+
There is no such thing as a group effect here.
The impact of group reverses itself depending on X.
Both intercept and slope depends on group.
-
-
Group status and X unrelated.
-
-
-
There's lots of information about group effects holding X fixed.
-
+
Group status and X unrelated.
+
There’s lots of information about group effects holding X fixed.
-
-
-
-
-
-
-
-
Simulation 6
-
-
-
-
-
-
-
-
-
-
-
Do this to investigate the bivariate relationship
-
-
-
library(rgl)
-plot3d(x1, x2, y)
-
-
-
-
-
-
-
-
-
Residual relationship
-
-
-
-
-
-
-
-
-
-
-
Discussion
-
-
-
Some things to note from this simulation
-
+
+
+
Simulation 6
+
+
+
+
+
+
Do this to investigate the bivariate relationship
+
library(rgl)
+plot3d(x1, x2, y)
+
+
+
Residual relationship
+
+
+
+
+
+
+
Discussion
+
+
Some things to note from this simulation
X1 unrelated to X2
X2 strongly related to Y
-
Adjusted relationship between X1 and Y largely unchanged
-by considering X2.
-
-
+
Adjusted relationship between X1 and Y largely unchanged by considering X2.
Almost no residual variability after accounting for X2.
-
+
+
+
+
Some final thoughts
+
+
Modeling multivariate relationships is difficult.
+
Play around with simulations to see how the inclusion or exclustion of another variable can change analyses.
+
The results of these analyses deal with the impact of variables on associations.
+
Ascertaining mechanisms or cause are difficult subjects to be added on top of difficulty in understanding multivariate associations.
+
+
-
-
-
-
-
-
Some final thoughts
-
-
-
-
Modeling multivariate relationships is difficult.
-
Play around with simulations to see how the
-inclusion or exclustion of another variable can
-change analyses.
-
The results of these analyses deal with the
-impact of variables on associations.
-
-
Ascertaining mechanisms or cause are difficult subjects
-to be added on top of difficulty in understanding multivariate associations.
-
-
-
-
-
+
-
-
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
\ No newline at end of file
+
+
+
diff --git a/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-1-1.png b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-1-1.png
new file mode 100644
index 000000000..1a7cc1b28
Binary files /dev/null and b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-1-1.png differ
diff --git a/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-10-1.png b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-10-1.png
new file mode 100644
index 000000000..1a7cc1b28
Binary files /dev/null and b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-10-1.png differ
diff --git a/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-2-1.png b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-2-1.png
new file mode 100644
index 000000000..dc1cd4783
Binary files /dev/null and b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-2-1.png differ
diff --git a/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-3-1.png b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-3-1.png
new file mode 100644
index 000000000..dfb959b80
Binary files /dev/null and b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-3-1.png differ
diff --git a/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-5-1.png b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-5-1.png
new file mode 100644
index 000000000..656dde1c6
Binary files /dev/null and b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-5-1.png differ
diff --git a/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-7-1.png b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-7-1.png
new file mode 100644
index 000000000..548321efc
Binary files /dev/null and b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-7-1.png differ
diff --git a/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-9-1.png b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-9-1.png
new file mode 100644
index 000000000..5934547ec
Binary files /dev/null and b/07_RegressionModels/02_04_residuals_variation_diagnostics/fig/unnamed-chunk-9-1.png differ
diff --git a/07_RegressionModels/02_04_residuals_variation_diagnostics/index.Rmd b/07_RegressionModels/02_04_residuals_variation_diagnostics/index.Rmd
index b1fead2bb..f2cba8c1b 100644
--- a/07_RegressionModels/02_04_residuals_variation_diagnostics/index.Rmd
+++ b/07_RegressionModels/02_04_residuals_variation_diagnostics/index.Rmd
@@ -1,163 +1,165 @@
----
-title : Residuals, diagnostics, variation
-subtitle : Regression
-author : Brian Caffo, Jeff Leek, Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../librariesNew
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
----
-```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-runif(1)
-```
-## The linear model
-* Specified as $Y_i = \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}$
-* We'll also assume here that $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$
-* Define the residuals as
-$e_i = Y_i - \hat Y_i = Y_i - \sum_{k=1}^p X_{ik} \hat \beta_j$
-* Our estimate of residual variation is $\hat \sigma^2 = \frac{\sum_{i=1}^n e_i^2}{n-p}$, the $n-p$ so that $E[\hat \sigma^2] = \sigma^2$
-
----
-```{r, fig.height = 5, fig.width = 5}
-data(swiss); par(mfrow = c(2, 2))
-fit <- lm(Fertility ~ . , data = swiss); plot(fit)
-```
-
----
-## Influential, high leverage and outlying points
-```{r, fig.height = 5, fig.width=5, echo = FALSE, results='hide'}
-n <- 100; x <- rnorm(n); y <- x + rnorm(n, sd = .3)
-plot(c(-3, 6), c(-3, 6), type = "n", frame = FALSE, xlab = "X", ylab = "Y")
-abline(lm(y ~ x), lwd = 2)
-points(x, y, cex = 2, bg = "lightblue", col = "black", pch = 21)
-points(0, 0, cex = 2, bg = "darkorange", col = "black", pch = 21)
-points(0, 5, cex = 2, bg = "darkorange", col = "black", pch = 21)
-points(5, 5, cex = 2, bg = "darkorange", col = "black", pch = 21)
-points(5, 0, cex = 2, bg = "darkorange", col = "black", pch = 21)
-```
-
----
-## Summary of the plot
-Calling a point an outlier is vague.
- * Outliers can be the result of spurious or real processes.
- * Outliers can have varying degrees of influence.
- * Outliers can conform to the regression relationship (i.e being marginally outlying in X or Y, but not outlying given the regression relationship).
-* Upper left hand point has low leverage, low influence, outlies in a way not conforming to the regression relationship.
-* Lower left hand point has low leverage, low influence and is not to be an outlier in any sense.
-* Upper right hand point has high leverage, but chooses not to extert it and thus would have low actual influence by conforming to the regresison relationship of the other points.
-* Lower right hand point has high leverage and would exert it if it were included in the fit.
-
----
-## Influence measures
-* Do `?influence.measures` to see the full suite of influence measures in stats. The measures include
- * `rstandard` - standardized residuals, residuals divided by their standard deviations)
- * `rstudent` - standardized residuals, residuals divided by their standard deviations, where the ith data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution
- * `hatvalues` - measures of leverage
- * `dffits` - change in the predicted response when the $i^{th}$ point is deleted in fitting the model.
- * `dfbetas` - change in individual coefficients when the $i^{th}$ point is deleted in fitting the model.
- * `cooks.distance` - overall change in the coefficients when the $i^{th}$ point is deleted.
- * `resid` - returns the ordinary residuals
- * `resid(fit) / (1 - hatvalues(fit))` where `fit` is the linear model fit returns the PRESS residuals, i.e. the leave one out cross validation residuals - the difference in the response and the predicted response at data point $i$, where it was not included in the model fitting.
-
----
-## How do I use all of these things?
-* Be wary of simplistic rules for diagnostic plots and measures. The use of these tools is context specific. It's better to understand what they are trying to accomplish and use them judiciously.
-* Not all of the measures have meaningful absolute scales. You can look at them relative to the values across the data.
-* They probe your data in different ways to diagnose different problems.
-* Patterns in your residual plots generally indicate some poor aspect of model fit. These can include:
- * Heteroskedasticity (non constant variance).
- * Missing model terms.
- * Temporal patterns (plot residuals versus collection order).
-* Residual QQ plots investigate normality of the errors.
-* Leverage measures (hat values) can be useful for diagnosing data entry errors.
-* Influence measures get to the bottom line, 'how does deleting or including this point impact a particular aspect of the model'.
-
----
-## Case 1
-```{r, fig.height=5, fig.width=5, echo=FALSE}
-x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
-plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
-abline(lm(y ~ x))
-```
-
----
-## The code
-```
-n <- 100; x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
-plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
-abline(lm(y ~ x))
-```
-* The point `c(10, 10)` has created a strong regression relationship where there shouldn't be one.
-
----
-## Showing a couple of the diagnostic values
-```{r}
-fit <- lm(y ~ x)
-round(dfbetas(fit)[1 : 10, 2], 3)
-round(hatvalues(fit)[1 : 10], 3)
-```
-
----
-## Case 2
-```{r, fig.height=5, fig.width=5, echo=FALSE}
-x <- rnorm(n); y <- x + rnorm(n, sd = .3)
-x <- c(5, x); y <- c(5, y)
-plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
-fit2 <- lm(y ~ x)
-abline(fit2)
-```
-
----
-## Looking at some of the diagnostics
-```{r, echo = TRUE}
-round(dfbetas(fit2)[1 : 10, 2], 3)
-round(hatvalues(fit2)[1 : 10], 3)
-```
-
----
-## Example described by Stefanski TAS 2007 Vol 61.
-```{r, fig.height=4, fig.width=4}
-## Don't everyone hit this server at once. Read the paper first.
-dat <- read.table('http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt', header = FALSE)
-pairs(dat)
-```
-
----
-## Got our P-values, should we bother to do a residual plot?
-```{r}
-summary(lm(V1 ~ . -1, data = dat))$coef
-```
-
----
-## Residual plot
-### P-values significant, O RLY?
-```{r, fig.height=4, fig.width=4, echo = TRUE}
-fit <- lm(V1 ~ . - 1, data = dat); plot(predict(fit), resid(fit), pch = '.')
-```
-
----
-## Back to the Swiss data
-```{r, fig.height = 5, fig.width = 5, echo=FALSE}
-data(swiss); par(mfrow = c(2, 2))
-fit <- lm(Fertility ~ . , data = swiss); plot(fit)
-```
-
+---
+title : Residuals, diagnostics, variation
+subtitle : Regression
+author : Brian Caffo, Jeff Leek, Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../librariesNew
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+runif(1)
+```
+## The linear model
+* Specified as $Y_i = \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}$
+* We'll also assume here that $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$
+* Define the residuals as
+$e_i = Y_i - \hat Y_i = Y_i - \sum_{k=1}^p X_{ik} \hat \beta_j$
+* Our estimate of residual variation is $\hat \sigma^2 = \frac{\sum_{i=1}^n e_i^2}{n-p}$, the $n-p$ so that $E[\hat \sigma^2] = \sigma^2$
+
+
+```{r, fig.height = 5, fig.width = 5}
+data(swiss); par(mfrow = c(2, 2))
+fit <- lm(Fertility ~ . , data = swiss); plot(fit)
+```
+
+
+## Influential, high leverage and outlying points
+```{r, fig.height = 5, fig.width=5, echo = FALSE, results='hide'}
+n <- 100; x <- rnorm(n); y <- x + rnorm(n, sd = .3)
+plot(c(-3, 6), c(-3, 6), type = "n", frame = FALSE, xlab = "X", ylab = "Y")
+abline(lm(y ~ x), lwd = 2)
+points(x, y, cex = 2, bg = "lightblue", col = "black", pch = 21)
+points(0, 0, cex = 2, bg = "darkorange", col = "black", pch = 21)
+points(0, 5, cex = 2, bg = "darkorange", col = "black", pch = 21)
+points(5, 5, cex = 2, bg = "darkorange", col = "black", pch = 21)
+points(5, 0, cex = 2, bg = "darkorange", col = "black", pch = 21)
+```
+
+
+## Summary of the plot
+Calling a point an outlier is vague.
+ * Outliers can be the result of spurious or real processes.
+ * Outliers can have varying degrees of influence.
+ * Outliers can conform to the regression relationship (i.e being marginally outlying in X or Y, but not outlying given the regression relationship).
+* Upper left hand point has low leverage, low influence, outlies in a way not conforming to the regression relationship.
+* Lower left hand point has low leverage, low influence and is not to be an outlier in any sense.
+* Upper right hand point has high leverage, but chooses not to extert it and thus would have low actual influence by conforming to the regresison relationship of the other points.
+* Lower right hand point has high leverage and would exert it if it were included in the fit.
+
+
+## Influence measures
+* Do `?influence.measures` to see the full suite of influence measures in stats. The measures include
+ * `rstandard` - standardized residuals, residuals divided by their standard deviations)
+ * `rstudent` - standardized residuals, residuals divided by their standard deviations, where the ith data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution
+ * `hatvalues` - measures of leverage
+ * `dffits` - change in the predicted response when the $i^{th}$ point is deleted in fitting the model.
+ * `dfbetas` - change in individual coefficients when the $i^{th}$ point is deleted in fitting the model.
+ * `cooks.distance` - overall change in the coefficients when the $i^{th}$ point is deleted.
+ * `resid` - returns the ordinary residuals
+ * `resid(fit) / (1 - hatvalues(fit))` where `fit` is the linear model fit returns the PRESS residuals, i.e. the leave one out cross validation residuals - the difference in the response and the predicted response at data point $i$, where it was not included in the model fitting.
+
+
+## How do I use all of these things?
+* Be wary of simplistic rules for diagnostic plots and measures. The use of these tools is context specific. It's better to understand what they are trying to accomplish and use them judiciously.
+* Not all of the measures have meaningful absolute scales. You can look at them relative to the values across the data.
+* They probe your data in different ways to diagnose different problems.
+* Patterns in your residual plots generally indicate some poor aspect of model fit. These can include:
+ * Heteroskedasticity (non constant variance).
+ * Missing model terms.
+ * Temporal patterns (plot residuals versus collection order).
+* Residual QQ plots investigate normality of the errors.
+* Leverage measures (hat values) can be useful for diagnosing data entry errors.
+* Influence measures get to the bottom line, 'how does deleting or including this point impact a particular aspect of the model'.
+
+
+## Case 1
+```{r, fig.height=5, fig.width=5, echo=FALSE}
+x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
+plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
+abline(lm(y ~ x))
+```
+
+
+## The code
+```
+n <- 100; x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
+plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
+abline(lm(y ~ x))
+```
+* The point `c(10, 10)` has created a strong regression relationship where there shouldn't be one.
+
+
+## Showing a couple of the diagnostic values
+```{r}
+fit <- lm(y ~ x)
+round(dfbetas(fit)[1 : 10, 2], 3)
+round(hatvalues(fit)[1 : 10], 3)
+```
+
+
+## Case 2
+```{r, fig.height=5, fig.width=5, echo=FALSE}
+x <- rnorm(n); y <- x + rnorm(n, sd = .3)
+x <- c(5, x); y <- c(5, y)
+plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
+fit2 <- lm(y ~ x)
+abline(fit2)
+```
+
+
+## Looking at some of the diagnostics
+```{r, echo = TRUE}
+round(dfbetas(fit2)[1 : 10, 2], 3)
+round(hatvalues(fit2)[1 : 10], 3)
+```
+
+
+## Example described by Stefanski TAS 2007 Vol 61.
+```{r, fig.height=4, fig.width=4}
+## Don't everyone hit this server at once. Read the paper first.
+dat <- read.table('http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt', header = FALSE)
+pairs(dat)
+```
+
+
+## Got our P-values, should we bother to do a residual plot?
+```{r}
+summary(lm(V1 ~ . -1, data = dat))$coef
+```
+
+
+## Residual plot
+### P-values significant, O RLY?
+```{r, fig.height=4, fig.width=4, echo = TRUE}
+fit <- lm(V1 ~ . - 1, data = dat); plot(predict(fit), resid(fit), pch = '.')
+```
+
+
+## Back to the Swiss data
+```{r, fig.height = 5, fig.width = 5, echo=FALSE}
+data(swiss); par(mfrow = c(2, 2))
+fit <- lm(Fertility ~ . , data = swiss); plot(fit)
+```
+
diff --git a/07_RegressionModels/02_04_residuals_variation_diagnostics/index.html b/07_RegressionModels/02_04_residuals_variation_diagnostics/index.html
index 8dcc1f168..3f73bcffd 100644
--- a/07_RegressionModels/02_04_residuals_variation_diagnostics/index.html
+++ b/07_RegressionModels/02_04_residuals_variation_diagnostics/index.html
@@ -1,431 +1,278 @@
-
-
-
- Residuals, diagnostics, variation
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Residuals, diagnostics, variation
-
Regression
-
Brian Caffo, Jeff Leek, Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
-
-
The linear model
-
-
-
-
Specified as \(Y_i = \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}\)
-
We'll also assume here that \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)
-
Define the residuals as
-\(e_i = Y_i - \hat Y_i = Y_i - \sum_{k=1}^p X_{ik} \hat \beta_j\)
-
Our estimate of residual variation is \(\hat \sigma^2 = \frac{\sum_{i=1}^n e_i^2}{n-p}\), the \(n-p\) so that \(E[\hat \sigma^2] = \sigma^2\)
Outliers can be the result of spurious or real processes.
-
Outliers can have varying degrees of influence.
-
Outliers can conform to the regression relationship (i.e being marginally outlying in X or Y, but not outlying given the regression relationship).
-
-
-
Upper left hand point has low leverage, low influence, outlies in a way not conforming to the regression relationship.
-
Lower left hand point has low leverage, low influence and is not to be an outlier in any sense.
-
Upper right hand point has high leverage, but chooses not to extert it and thus would have low actual influence by conforming to the regresison relationship of the other points.
-
Lower right hand point has high leverage and would exert it if it were included in the fit.
-
-
-
-
-
-
-
-
-
-
Influence measures
-
-
-
-
Do ?influence.measures to see the full suite of influence measures in stats. The measures include
-
-
-
rstandard - standardized residuals, residuals divided by their standard deviations)
-
rstudent - standardized residuals, residuals divided by their standard deviations, where the ith data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution
-
hatvalues - measures of leverage
-
dffits - change in the predicted response when the \(i^{th}\) point is deleted in fitting the model.
-
dfbetas - change in individual coefficients when the \(i^{th}\) point is deleted in fitting the model.
-
cooks.distance - overall change in teh coefficients when the \(i^{th}\) point is deleted.
-
resid - returns the ordinary residuals
-
resid(fit) / (1 - hatvalues(fit)) where fit is the linear model fit returns the PRESS residuals, i.e. the leave one out cross validation residuals - the difference in the response and the predicted response at data point \(i\), where it was not included in the model fitting.
-
-
-
-
-
-
-
-
-
-
How do I use all of these things?
-
-
-
-
Be wary of simplistic rules for diagnostic plots and measures. The use of these tools is context specific. It's better to understand what they are trying to accomplish and use them judiciously.
-
Not all of the measures have meaningful absolute scales. You can look at them relative to the values across the data.
-
They probe your data in different ways to diagnose different problems.
-
Patterns in your residual plots generally indicate some poor aspect of model fit. These can include:
-
-
-
Heteroskedasticity (non constant variance).
-
Missing model terms.
-
Temporal patterns (plot residuals versus collection order).
-
-
Residual QQ plots investigate normality of the errors.
-
Leverage measures (hat values) can be useful for diagnosing data entry errors.
-
Influence measures get to the bottom line, 'how does deleting or including this point impact a particular aspect of the model'.
-
-
-
-
-
-
-
-
-
Case 1
-
-
-
-
-
-
-
-
-
-
-
The code
-
-
-
n <- 100; x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
-plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
-abline(lm(y ~ x))
-
-
-
-
The point c(10, 10) has created a strong regression relationship where there shouldn't be one.
## Don't everyone hit this server at once. Read the paper first.
-dat <- read.table('http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt', header = FALSE)
-pairs(dat)
-
-
-
-
-
-
-
-
-
-
-
Got our P-values, should we bother to do a residual plot?
Calling a point an outlier is vague. * Outliers can be the result of spurious or real processes. * Outliers can have varying degrees of influence. * Outliers can conform to the regression relationship (i.e being marginally outlying in X or Y, but not outlying given the regression relationship). * Upper left hand point has low leverage, low influence, outlies in a way not conforming to the regression relationship. * Lower left hand point has low leverage, low influence and is not to be an outlier in any sense. * Upper right hand point has high leverage, but chooses not to extert it and thus would have low actual influence by conforming to the regresison relationship of the other points. * Lower right hand point has high leverage and would exert it if it were included in the fit.
+
+
+
Influence measures
+
+
Do ?influence.measures to see the full suite of influence measures in stats. The measures include
+
rstandard - standardized residuals, residuals divided by their standard deviations)
+
rstudent - standardized residuals, residuals divided by their standard deviations, where the ith data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution
+
hatvalues - measures of leverage
+
dffits - change in the predicted response when the \(i^{th}\) point is deleted in fitting the model.
+
dfbetas - change in individual coefficients when the \(i^{th}\) point is deleted in fitting the model.
+
cooks.distance - overall change in the coefficients when the \(i^{th}\) point is deleted.
+
resid - returns the ordinary residuals
+
resid(fit) / (1 - hatvalues(fit)) where fit is the linear model fit returns the PRESS residuals, i.e. the leave one out cross validation residuals - the difference in the response and the predicted response at data point \(i\), where it was not included in the model fitting.
+
+
+
+
How do I use all of these things?
+
+
Be wary of simplistic rules for diagnostic plots and measures. The use of these tools is context specific. It’s better to understand what they are trying to accomplish and use them judiciously.
+
Not all of the measures have meaningful absolute scales. You can look at them relative to the values across the data.
+
They probe your data in different ways to diagnose different problems.
+
Patterns in your residual plots generally indicate some poor aspect of model fit. These can include:
+
Heteroskedasticity (non constant variance).
+
Missing model terms.
+
Temporal patterns (plot residuals versus collection order).
+
Residual QQ plots investigate normality of the errors.
+
Leverage measures (hat values) can be useful for diagnosing data entry errors.
+
Influence measures get to the bottom line, ‘how does deleting or including this point impact a particular aspect of the model’.
+
+
+
+
Case 1
+
+
+
+
+
+
The code
+
n <- 100; x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
+plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")
+abline(lm(y ~ x))
+
+
The point c(10, 10) has created a strong regression relationship where there shouldn’t be one.
+
+
+
+
Showing a couple of the diagnostic values
+
fit <- lm(y ~ x)
+round(dfbetas(fit)[1 : 10, 2], 3)
## Don't everyone hit this server at once. Read the paper first.
+dat <- read.table('http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt', header = FALSE)
+pairs(dat)
+
+
+
+
+
+
Got our P-values, should we bother to do a residual plot?
fit <- lm(V1 ~ . - 1, data = dat); plot(predict(fit), resid(fit), pch = '.')
+
+
+
+
+
+
+
Back to the Swiss data
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/02_05_multipleVariables/index.Rmd b/07_RegressionModels/02_05_multipleVariables/index.Rmd
index 043b7dbf3..a95950d41 100644
--- a/07_RegressionModels/02_05_multipleVariables/index.Rmd
+++ b/07_RegressionModels/02_05_multipleVariables/index.Rmd
@@ -1,176 +1,178 @@
----
-title : Multiple variables
-subtitle : Regression
-author : Brian Caffo, Jeff Leek, Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../librariesNew
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
----
-```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-runif(1)
-```
-## Multivariable regression
-* We have an entire class on prediction and machine learning, so we'll focus on modeling.
- * Prediction has a different set of criteria, needs for interpretability and standards for generalizability.
- * In modeling, our interest lies in parsimonious, interpretable representations of the data that enhance our understanding of the phenomena under study.
- * A model is a lense through which to look at your data. (I attribute this quote to Scott Zeger)
- * Under this philosophy, what's the right model? Whatever model connects the data to a true, parsimonious statement about what you're studying.
-* There are nearly uncontable ways that a model can be wrong, in this lecture, we'll focus on variable inclusion and exclusion.
-* Like nearly all aspects of statistics, good modeling decisions are context dependent.
- * A good model for prediction versus one for studying mechanisms versus one for trying to establish causal effects may not be the same.
-
----
-## The Rumsfeldian triplet
-
-*There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know.* Donald Rumsfeld
-
-In our context
-* (Known knowns) Regressors that we know we should check to include in the model and have.
-* (Known Unknowns) Regressors that we would like to include in the model, but don't have.
-* (Unknown Unknowns) Regressors that we don't even know about that we should have included in the model.
-
----
-## General rules
-* Omitting variables results in bias in the coeficients of interest - unless their regressors are uncorrelated with the omitted ones.
- * This is why we randomize treatments, it attempts to uncorrelate our treatment indicator with variables that we don't have to put in the model.
- * (If there's too many unobserved confounding variables, even randomization won't help you.)
-* Including variables that we shouldn't have increases standard errors of the regression variables.
- * Actually, including any new variables increasese (actual, not estimated) standard errors of other regressors. So we don't want to idly throw variables into the model.
-* The model must tend toward perfect fit as the number of non-redundant regressors approaches $n$.
-* $R^2$ increases monotonically as more regressors are included.
-* The SSE decreases monotonically as more regressors are included.
-
----
-## Plot of $R^2$ versus $n$
-For simulations as the number of variables included equals increases to $n=100$.
-No actual regression relationship exist in any simulation
-```{r, fig.height=5, fig.width=5, echo=FALSE}
- n <- 100
-plot(c(1, n), 0 : 1, type = "n", frame = FALSE, xlab = "p", ylab = "R^2")
-r <- sapply(1 : n, function(p)
- {
- y <- rnorm(n); x <- matrix(rnorm(n * p), n, p)
- summary(lm(y ~ x))$r.squared
- }
- )
-lines(1 : n, r, lwd = 2)
-abline(h = 1)
-```
-
----
-## Variance inflation
-```{r, echo = TRUE}
-n <- 100; nosim <- 1000
-x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n);
-betas <- sapply(1 : nosim, function(i){
- y <- x1 + rnorm(n, sd = .3)
- c(coef(lm(y ~ x1))[2],
- coef(lm(y ~ x1 + x2))[2],
- coef(lm(y ~ x1 + x2 + x3))[2])
-})
-round(apply(betas, 1, sd), 5)
-```
-
----
-## Variance inflation
-```{r, echo = TRUE}
-n <- 100; nosim <- 1000
-x1 <- rnorm(n); x2 <- x1/sqrt(2) + rnorm(n) /sqrt(2)
-x3 <- x1 * 0.95 + rnorm(n) * sqrt(1 - 0.95^2);
-betas <- sapply(1 : nosim, function(i){
- y <- x1 + rnorm(n, sd = .3)
- c(coef(lm(y ~ x1))[2],
- coef(lm(y ~ x1 + x2))[2],
- coef(lm(y ~ x1 + x2 + x3))[2])
-})
-round(apply(betas, 1, sd), 5)
-```
-
-
----
-## Variance inflation factors
-* Notice variance inflation was much worse when we included a variable that
-was highly related to `x1`.
-* We don't know $\sigma$, so we can only estimate the increase in the actual standard error of the coefficients for including a regressor.
-* However, $\sigma$ drops out of the relative standard errors. If one sequentially adds variables, one can check the variance (or sd) inflation for including each one.
-* When the other regressors are actually orthogonal to the regressor of interest, then there is no variance inflation.
-* The variance inflation factor (VIF) is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors.
- * (The square root of the VIF is the increase in the sd ...)
-* Remember, variance inflation is only part of the picture. We want to include certain variables, even if they dramatically inflate our variance.
-
----
-## Revisting our previous simulation
-```{r, echo = TRUE}
-##doesn't depend on which y you use,
-y <- x1 + rnorm(n, sd = .3)
-a <- summary(lm(y ~ x1))$cov.unscaled[2,2]
-c(summary(lm(y ~ x1 + x2))$cov.unscaled[2,2],
- summary(lm(y~ x1 + x2 + x3))$cov.unscaled[2,2]) / a
-temp <- apply(betas, 1, var); temp[2 : 3] / temp[1]
-```
----
-## Swiss data
-```{r}
-data(swiss);
-fit1 <- lm(Fertility ~ Agriculture, data = swiss)
-a <- summary(fit1)$cov.unscaled[2,2]
-fit2 <- update(fit, Fertility ~ Agriculture + Examination)
-fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
- c(summary(fit2)$cov.unscaled[2,2],
- summary(fit3)$cov.unscaled[2,2]) / a
-```
-
----
-## Swiss data VIFs,
-```{r}
-library(car)
-fit <- lm(Fertility ~ . , data = swiss)
-vif(fit)
-sqrt(vif(fit)) #I prefer sd
-```
-
----
-## What about residual variance estimation?
-* Assuming that the model is linear with additive iid errors (with finite variance), we can mathematically describe the impact of omitting necessary variables or including unnecessary ones.
- * If we underfit the model, the variance estimate is biased.
- * If we correctly or overfit the model, including all necessary covariates and/or unnecessary covariates, the variance estimate is unbiased.
- * However, the variance of the variance is larger if we include unnecessary variables.
-
----
-## Covariate model selection
-* Automated covariate selection is a difficult topic. It depends heavily on how rich of a covariate space one wants to explore.
- * The space of models explodes quickly as you add interactions and polynomial terms.
-* In the prediction class, we'll cover many modern methods for traversing large model spaces for the purposes of prediction.
-* Principal components or factor analytic models on covariates are often useful for reducing complex covariate spaces.
-* Good design can often eliminate the need for complex model searches at analyses; though often control over the design is limited.
-* If the models of interest are nested and without lots of parameters differentiating them, it's fairly uncontroversial to use nested likelihood ratio tests. (Example to follow.)
-* My favoriate approach is as follows. Given a coefficient that I'm interested in, I like to use covariate adjustment and multiple models to probe that effect to evaluate it for robustness and to see what other covariates knock it out. This isn't a terribly systematic approach, but it tends to teach you a lot about the the data as you get your hands dirty.
-
----
-## How to do nested model testing in R
-```{r}
-fit1 <- lm(Fertility ~ Agriculture, data = swiss)
-fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
-fit5 <- update(fit, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
-anova(fit1, fit3, fit5)
-```
+---
+title : Multiple variables
+subtitle : Regression
+author : Brian Caffo, Jeff Leek, Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../librariesNew
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+runif(1)
+```
+## Multivariable regression
+* We have an entire class on prediction and machine learning, so we'll focus on modeling.
+ * Prediction has a different set of criteria, needs for interpretability and standards for generalizability.
+ * In modeling, our interest lies in parsimonious, interpretable representations of the data that enhance our understanding of the phenomena under study.
+ * A model is a lens through which to look at your data. (I attribute this quote to Scott Zeger)
+ * Under this philosophy, what's the right model? Whatever model connects the data to a true, parsimonious statement about what you're studying.
+* There are nearly uncontable ways that a model can be wrong, in this lecture, we'll focus on variable inclusion and exclusion.
+* Like nearly all aspects of statistics, good modeling decisions are context dependent.
+ * A good model for prediction versus one for studying mechanisms versus one for trying to establish causal effects may not be the same.
+
+
+## The Rumsfeldian triplet
+
+*There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know.* Donald Rumsfeld
+
+In our context
+* (Known knowns) Regressors that we know we should check to include in the model and have.
+* (Known Unknowns) Regressors that we would like to include in the model, but don't have.
+* (Unknown Unknowns) Regressors that we don't even know about that we should have included in the model.
+
+
+## General rules
+* Omitting variables results in bias in the coeficients of interest - unless their regressors are uncorrelated with the omitted ones.
+ * This is why we randomize treatments, it attempts to uncorrelate our treatment indicator with variables that we don't have to put in the model.
+ * (If there's too many unobserved confounding variables, even randomization won't help you.)
+* Including variables that we shouldn't have increases standard errors of the regression variables.
+ * Actually, including any new variables increases (actual, not estimated) standard errors of other regressors. So we don't want to idly throw variables into the model.
+* The model must tend toward perfect fit as the number of non-redundant regressors approaches $n$.
+* $R^2$ increases monotonically as more regressors are included.
+* The SSE decreases monotonically as more regressors are included.
+
+
+## Plot of $R^2$ versus $n$
+For simulations as the number of variables included equals increases to $n=100$.
+No actual regression relationship exist in any simulation
+```{r, fig.height=5, fig.width=5, echo=FALSE}
+ n <- 100
+plot(c(1, n), 0 : 1, type = "n", frame = FALSE, xlab = "p", ylab = "R^2")
+r <- sapply(1 : n, function(p)
+ {
+ y <- rnorm(n); x <- matrix(rnorm(n * p), n, p)
+ summary(lm(y ~ x))$r.squared
+ }
+ )
+lines(1 : n, r, lwd = 2)
+abline(h = 1)
+```
+
+
+## Variance inflation
+```{r, echo = TRUE}
+n <- 100; nosim <- 1000
+x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n);
+betas <- sapply(1 : nosim, function(i){
+ y <- x1 + rnorm(n, sd = .3)
+ c(coef(lm(y ~ x1))[2],
+ coef(lm(y ~ x1 + x2))[2],
+ coef(lm(y ~ x1 + x2 + x3))[2])
+})
+round(apply(betas, 1, sd), 5)
+```
+
+
+## Variance inflation
+```{r, echo = TRUE}
+n <- 100; nosim <- 1000
+x1 <- rnorm(n); x2 <- x1/sqrt(2) + rnorm(n) /sqrt(2)
+x3 <- x1 * 0.95 + rnorm(n) * sqrt(1 - 0.95^2);
+betas <- sapply(1 : nosim, function(i){
+ y <- x1 + rnorm(n, sd = .3)
+ c(coef(lm(y ~ x1))[2],
+ coef(lm(y ~ x1 + x2))[2],
+ coef(lm(y ~ x1 + x2 + x3))[2])
+})
+round(apply(betas, 1, sd), 5)
+```
+
+
+
+## Variance inflation factors
+* Notice variance inflation was much worse when we included a variable that
+was highly related to `x1`.
+* We don't know $\sigma$, so we can only estimate the increase in the actual standard error of the coefficients for including a regressor.
+* However, $\sigma$ drops out of the relative standard errors. If one sequentially adds variables, one can check the variance (or sd) inflation for including each one.
+* When the other regressors are actually orthogonal to the regressor of interest, then there is no variance inflation.
+* The variance inflation factor (VIF) is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors.
+ * (The square root of the VIF is the increase in the sd ...)
+* Remember, variance inflation is only part of the picture. We want to include certain variables, even if they dramatically inflate our variance.
+
+
+## Revisting our previous simulation
+```{r, echo = TRUE}
+##doesn't depend on which y you use,
+y <- x1 + rnorm(n, sd = .3)
+a <- summary(lm(y ~ x1))$cov.unscaled[2,2]
+c(summary(lm(y ~ x1 + x2))$cov.unscaled[2,2],
+ summary(lm(y~ x1 + x2 + x3))$cov.unscaled[2,2]) / a
+temp <- apply(betas, 1, var); temp[2 : 3] / temp[1]
+```
+
+## Swiss data
+```{r}
+data(swiss);
+fit <- fit1 <- lm(Fertility ~ Agriculture, data = swiss)
+a <- summary(fit1)$cov.unscaled[2,2]
+fit2 <- update(fit, Fertility ~ Agriculture + Examination)
+fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
+ c(summary(fit2)$cov.unscaled[2,2],
+ summary(fit3)$cov.unscaled[2,2]) / a
+```
+
+
+## Swiss data VIFs,
+```{r}
+library(car)
+fit <- lm(Fertility ~ . , data = swiss)
+vif(fit)
+sqrt(vif(fit)) #I prefer sd
+```
+
+
+## What about residual variance estimation?
+* Assuming that the model is linear with additive iid errors (with finite variance), we can mathematically describe the impact of omitting necessary variables or including unnecessary ones.
+ * If we underfit the model, the variance estimate is biased.
+ * If we correctly or overfit the model, including all necessary covariates and/or unnecessary covariates, the variance estimate is unbiased.
+ * However, the variance of the variance is larger if we include unnecessary variables.
+
+
+## Covariate model selection
+* Automated covariate selection is a difficult topic. It depends heavily on how rich of a covariate space one wants to explore.
+ * The space of models explodes quickly as you add interactions and polynomial terms.
+* In the prediction class, we'll cover many modern methods for traversing large model spaces for the purposes of prediction.
+* Principal components or factor analytic models on covariates are often useful for reducing complex covariate spaces.
+* Good design can often eliminate the need for complex model searches at analyses; though often control over the design is limited.
+* If the models of interest are nested and without lots of parameters differentiating them, it's fairly uncontroversial to use nested likelihood ratio tests. (Example to follow.)
+* My favoriate approach is as follows. Given a coefficient that I'm interested in, I like to use covariate adjustment and multiple models to probe that effect to evaluate it for robustness and to see what other covariates knock it out. This isn't a terribly systematic approach, but it tends to teach you a lot about the the data as you get your hands dirty.
+
+
+## How to do nested model testing in R
+```{r}
+fit1 <- lm(Fertility ~ Agriculture, data = swiss)
+fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
+fit5 <- update(fit, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
+anova(fit1, fit3, fit5)
+```
diff --git a/07_RegressionModels/02_05_multipleVariables/index.html b/07_RegressionModels/02_05_multipleVariables/index.html
index f87a4d0e3..d410c0c99 100644
--- a/07_RegressionModels/02_05_multipleVariables/index.html
+++ b/07_RegressionModels/02_05_multipleVariables/index.html
@@ -1,456 +1,306 @@
-
-
-
- Multiple variables
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Multiple variables
-
Regression
-
Brian Caffo, Jeff Leek, Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
-
-
Multivariable regression
-
-
-
-
We have an entire class on prediction and machine learning, so we'll focus on modeling.
-
-
-
Prediction has a different set of criteria, needs for interpretability and standards for generalizability.
-
In modeling, our interest lies in parsimonious, interpretable representations of the data that enhance our understanding of the phenomena under study.
-
A model is a lense through which to look at your data. (I attribute this quote to Scott Zeger)
-
Under this philosophy, what's the right model? Whatever model connects the data to a true, parsimonious statement about what you're studying.
-
-
There are nearly uncontable ways that a model can be wrong, in this lecture, we'll focus on variable inclusion and exclusion.
-
Like nearly all aspects of statistics, good modeling decisions are context dependent.
-
-
-
A good model for prediction versus one for studying mechanisms versus one for trying to establish causal effects may not be the same.
-
-
-
-
-
-
-
-
-
-
The Rumsfeldian triplet
-
-
-
There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know. Donald Rumsfeld
-
-
In our context
-
-
-
(Known knowns) Regressors that we know we should check to include in the model and have.
-
(Known Unknowns) Regressors that we would like to include in the model, but don't have.
-
(Unknown Unknowns) Regressors that we don't even know about that we should have included in the model.
-
-
-
-
-
-
-
-
-
General rules
-
-
-
-
Omitting variables results in bias in the coeficients of interest - unless their regressors are uncorrelated with the omitted ones.
-
-
-
This is why we randomize treatments, it attempts to uncorrelate our treatment indicator with variables that we don't have to put in the model.
-
(If there's too many unobserved confounding variables, even randomization won't help you.)
-
-
Including variables that we shouldn't have increases standard errors of the regression variables.
-
-
-
Actually, including any new variables increasese (actual, not estimated) standard errors of other regressors. So we don't want to idly throw variables into the model.
-
-
The model must tend toward perfect fit as the number of non-redundant regressors approaches \(n\).
-
\(R^2\) increases monotonically as more regressors are included.
-
The SSE decreases monotonically as more regressors are included.
-
-
-
-
-
-
-
-
-
Plot of \(R^2\) versus \(n\)
-
-
-
For simulations as the number of variables included equals increases to \(n=100\).
-No actual regression relationship exist in any simulation
Notice variance inflation was much worse when we included a variable that
-was highly related to x1.
-
We don't know \(\sigma\), so we can only estimate the increase in the actual standard error of the coefficients for including a regressor.
-
However, \(\sigma\) drops out of the relative standard errors. If one sequentially adds variables, one can check the variance (or sd) inflation for including each one.
-
When the other regressors are actually orthogonal to the regressor of interest, then there is no variance inflation.
-
The variance inflation factor (VIF) is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors.
-
-
-
(The square root of the VIF is the increase in the sd ...)
-
-
Remember, variance inflation is only part of the picture. We want to include certain variables, even if they dramatically inflate our variance.
-
-
-
-
-
-
-
-
-
Revisting our previous simulation
-
-
-
##doesn't depend on which y you use,
-y <- x1 + rnorm(n, sd = .3)
-a <- summary(lm(y ~ x1))$cov.unscaled[2,2]
-c(summary(lm(y ~ x1 + x2))$cov.unscaled[2,2],
- summary(lm(y~ x1 + x2 + x3))$cov.unscaled[2,2]) / a
-
Assuming that the model is linear with additive iid errors (with finite variance), we can mathematically describe the impact of omitting necessary variables or including unnecessary ones.
-
-
-
If we underfit the model, the variance estimate is biased.
-
If we correctly or overfit the model, including all necessary covariates and/or unnecessary covariates, the variance estimate is unbiased.
-
However, the variance of the variance is larger if we include unnecessary variables.
-
-
-
-
-
-
-
-
-
-
Covariate model selection
-
-
-
-
Automated covariate selection is a difficult topic. It depends heavily on how rich of a covariate space one wants to explore.
-
-
-
The space of models explodes quickly as you add interactions and polynomial terms.
-
-
In the prediction class, we'll cover many modern methods for traversing large model spaces for the purposes of prediction.
-
Principal components or factor analytic models on covariates are often useful for reducing complex covariate spaces.
-
Good design can often eliminate the need for complex model searches at analyses; though often control over the design is limited.
-
If the models of interest are nested and without lots of parameters differentiating them, it's fairly uncontroversial to use nested likelihood ratio tests. (Example to follow.)
-
My favoriate approach is as follows. Given a coefficient that I'm interested in, I like to use covariate adjustment and multiple models to probe that effect to evaluate it for robustness and to see what other covariates knock it out. This isn't a terribly systematic approach, but it tends to teach you a lot about the the data as you get your hands dirty.
We have an entire class on prediction and machine learning, so we’ll focus on modeling.
+
Prediction has a different set of criteria, needs for interpretability and standards for generalizability.
+
In modeling, our interest lies in parsimonious, interpretable representations of the data that enhance our understanding of the phenomena under study.
+
A model is a lens through which to look at your data. (I attribute this quote to Scott Zeger)
+
Under this philosophy, what’s the right model? Whatever model connects the data to a true, parsimonious statement about what you’re studying.
+
There are nearly uncontable ways that a model can be wrong, in this lecture, we’ll focus on variable inclusion and exclusion.
+
Like nearly all aspects of statistics, good modeling decisions are context dependent.
+
A good model for prediction versus one for studying mechanisms versus one for trying to establish causal effects may not be the same.
+
+
+
+
The Rumsfeldian triplet
+
There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know. Donald Rumsfeld
+
In our context * (Known knowns) Regressors that we know we should check to include in the model and have. * (Known Unknowns) Regressors that we would like to include in the model, but don’t have. * (Unknown Unknowns) Regressors that we don’t even know about that we should have included in the model.
+
+
+
General rules
+
+
Omitting variables results in bias in the coeficients of interest - unless their regressors are uncorrelated with the omitted ones.
+
This is why we randomize treatments, it attempts to uncorrelate our treatment indicator with variables that we don’t have to put in the model.
+
(If there’s too many unobserved confounding variables, even randomization won’t help you.)
+
Including variables that we shouldn’t have increases standard errors of the regression variables.
+
Actually, including any new variables increases (actual, not estimated) standard errors of other regressors. So we don’t want to idly throw variables into the model.
+
The model must tend toward perfect fit as the number of non-redundant regressors approaches \(n\).
+
\(R^2\) increases monotonically as more regressors are included.
+
The SSE decreases monotonically as more regressors are included.
+
+
+
+
Plot of \(R^2\) versus \(n\)
+For simulations as the number of variables included equals increases to \(n=100\). No actual regression relationship exist in any simulation
+
Notice variance inflation was much worse when we included a variable that was highly related to x1.
+
We don’t know \(\sigma\), so we can only estimate the increase in the actual standard error of the coefficients for including a regressor.
+
However, \(\sigma\) drops out of the relative standard errors. If one sequentially adds variables, one can check the variance (or sd) inflation for including each one.
+
When the other regressors are actually orthogonal to the regressor of interest, then there is no variance inflation.
+
The variance inflation factor (VIF) is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors.
+
(The square root of the VIF is the increase in the sd …)
+
Remember, variance inflation is only part of the picture. We want to include certain variables, even if they dramatically inflate our variance.
+
+
+
+
Revisting our previous simulation
+
##doesn't depend on which y you use,
+y <- x1 + rnorm(n, sd = .3)
+a <- summary(lm(y ~ x1))$cov.unscaled[2,2]
+c(summary(lm(y ~ x1 + x2))$cov.unscaled[2,2],
+ summary(lm(y~ x1 + x2 + x3))$cov.unscaled[2,2]) / a
Assuming that the model is linear with additive iid errors (with finite variance), we can mathematically describe the impact of omitting necessary variables or including unnecessary ones.
+
If we underfit the model, the variance estimate is biased.
+
If we correctly or overfit the model, including all necessary covariates and/or unnecessary covariates, the variance estimate is unbiased.
+
+
However, the variance of the variance is larger if we include unnecessary variables.
+
+
+
+
+
Covariate model selection
+
+
Automated covariate selection is a difficult topic. It depends heavily on how rich of a covariate space one wants to explore.
+
The space of models explodes quickly as you add interactions and polynomial terms.
+
In the prediction class, we’ll cover many modern methods for traversing large model spaces for the purposes of prediction.
+
Principal components or factor analytic models on covariates are often useful for reducing complex covariate spaces.
+
Good design can often eliminate the need for complex model searches at analyses; though often control over the design is limited.
+
If the models of interest are nested and without lots of parameters differentiating them, it’s fairly uncontroversial to use nested likelihood ratio tests. (Example to follow.)
+
My favoriate approach is as follows. Given a coefficient that I’m interested in, I like to use covariate adjustment and multiple models to probe that effect to evaluate it for robustness and to see what other covariates knock it out. This isn’t a terribly systematic approach, but it tends to teach you a lot about the the data as you get your hands dirty.
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/03_01_glms/index.Rmd b/07_RegressionModels/03_01_glms/index.Rmd
index ce2e4410c..74ef1f0db 100644
--- a/07_RegressionModels/03_01_glms/index.Rmd
+++ b/07_RegressionModels/03_01_glms/index.Rmd
@@ -1,140 +1,142 @@
----
-title : Generalized linear models
-subtitle : Regression Models
-author : Brian Caffo, Jeff Leek, Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../librariesNew
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
----
-```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-runif(1)
-```
-## Linear models
-* Linear models are the most useful applied statistical technique. However, they are not without their limitations.
- * Additive response models don't make much sense if the response is discrete, or stricly positive.
- * Additive error models often don't make sense, for example if the outcome has to be positive.
- * Transformations are often hard to interpret.
- * There's value in modeling the data on the scale that it was collected.
- * Particularly interpetable transformations, natural logarithms in specific, aren't applicable for negative or zero values.
-
-
-
----
-## Generalized linear models
-* Introduced in a 1972 RSSB paper by Nelder and Wedderburn.
-* Involves three components
- * An *exponential family* model for the response.
- * A systematic component via a linear predictor.
- * A link function that connects the means of the response to the linear predictor.
-
-
----
-## Example, linear models
-* Assume that $Y_i \sim N(\mu_i, \sigma^2)$ (the Gaussian distribution is an exponential family distribution.)
-* Define the linear predictor to be $\eta_i = \sum_{k=1}^p X_{ik} \beta_k$.
-* The link function as $g$ so that $g(\mu) = \eta$.
- * For linear models $g(\mu) = \mu$ so that $\mu_i = \eta_i$
-* This yields the same likelihood model as our additive error Gaussian linear model
-$$
-Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i}
-$$
-where $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$
-
----
-## Example, logistic regression
-* Assume that $Y_i \sim Bernoulli(\mu_i)$ so that $E[Y_i] = \mu_i$ where $0\leq \mu_i \leq 1$.
-* Linear predictor $\eta_i = \sum_{k=1}^p X_{ik} \beta_k$
-* Link function
-$g(\mu) = \eta = \log\left( \frac{\mu}{1 - \mu}\right)$
-$g$ is the (natural) log odds, referred to as the **logit**.
-* Note then we can invert the logit function as
-$$
-\mu_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} ~~~\mbox{and}~~~
-1 - \mu_i = \frac{1}{1 + \exp(\eta_i)}
-$$
-Thus the likelihood is
-$$
-\prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1-y_i}
-= \exp\left(\sum_{i=1}^n y_i \eta_i \right)
-\prod_{i=1}^n (1 + \eta_i)^{-1}
-$$
-
----
-## Example, Poisson regression
-* Assume that $Y_i \sim Poisson(\mu_i)$ so that $E[Y_i] = \mu_i$ where $0\leq \mu_i$
-* Linear predictor $\eta_i = \sum_{k=1}^p X_{ik} \beta_k$
-* Link function
-$g(\mu) = \eta = \log(\mu)$
-* Recall that $e^x$ is the inverse of $\log(x)$ so that
-$$
-\mu_i = e^{\eta_i}
-$$
-Thus, the likelihood is
-$$
-\prod_{i=1}^n (y_i !)^{-1} \mu_i^{y_i}e^{-\mu_i}
-\propto \exp\left(\sum_{i=1}^n y_i \eta_i - \sum_{i=1}^n \mu_i\right)
-$$
-
----
-## Some things to note
-* In each case, the only way in which the likelihood depends on the data is through
-$$\sum_{i=1}^n y_i \eta_i =
-\sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k =
-\sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i
-$$
-Thus if we don't need the full data, only $\sum_{i=1}^n X_{ik} y_i$. This simplification is a consequence of chosing so-called 'canonical' link functions.
-* (This has to be derived). All models acheive their maximum at the root of the so called normal equations
-$$
-0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i
-$$
-where $W_i$ are the derivative of the inverse of the link function.
-
----
-## About variances
-$$
-0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i
-$$
-* For the linear model $Var(Y_i) = \sigma^2$ is constant.
-* For Bernoulli case $Var(Y_i) = \mu_i (1 - \mu_i)$
-* For the Poisson case $Var(Y_i) = \mu_i$.
-* In the latter cases, it is often relevant to have a more flexible variance model, even if it doesn't correspond to an actual likelihood
-$$
-0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i (1 - \mu_i ) } W_i ~~~\mbox{and}~~~
-0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i} W_i
-$$
-* These are called 'quasi-likelihood' normal equations
-
----
-## Odds and ends
-* The normal equations have to be solved iteratively. Resulting in
-$\hat \beta_k$ and, if included, $\hat \phi$.
-* Predicted linear predictor responses can be obtained as $\hat \eta = \sum_{k=1}^p X_k \hat \beta_k$
-* Predicted mean responses as $\hat \mu = g^{-1}(\hat \eta)$
-* Coefficients are interpretted as
-$$
-g(E[Y | X_k = x_k + 1, X_{\sim k} = x_{\sim k}]) - g(E[Y | X_k = x_k, X_{\sim k}=x_{\sim k}]) = \beta_k
-$$
-or the change in the link function of the expected response per unit change in $X_k$ holding other regressors constant.
-* Variations on Newon/Raphson's algorithm are used to do it.
-* Asymptotics are used for inference usually.
-* Many of the ideas from linear models can be brought over to GLMs.
+---
+title : Generalized linear models
+subtitle : Regression Models
+author : Brian Caffo, Jeff Leek, Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../librariesNew
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+runif(1)
+```
+## Linear models
+* Linear models are the most useful applied statistical technique. However, they are not without their limitations.
+ * Additive response models don't make much sense if the response is discrete, or stricly positive.
+ * Additive error models often don't make sense, for example if the outcome has to be positive.
+ * Transformations are often hard to interpret.
+ * There's value in modeling the data on the scale that it was collected.
+ * Particularly interpetable transformations, natural logarithms in specific, aren't applicable for negative or zero values.
+
+
+
+
+## Generalized linear models
+* Introduced in a 1972 RSSB paper by Nelder and Wedderburn.
+* Involves three components
+ * An *exponential family* model for the response.
+ * A systematic component via a linear predictor.
+ * A link function that connects the means of the response to the linear predictor.
+
+
+
+## Example, linear models
+* Assume that $Y_i \sim N(\mu_i, \sigma^2)$ (the Gaussian distribution is an exponential family distribution.)
+* Define the linear predictor to be $\eta_i = \sum_{k=1}^p X_{ik} \beta_k$.
+* The link function as $g$ so that $g(\mu) = \eta$.
+ * For linear models $g(\mu) = \mu$ so that $\mu_i = \eta_i$
+* This yields the same likelihood model as our additive error Gaussian linear model
+$$
+Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i}
+$$
+where $\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$
+
+
+## Example, logistic regression
+* Assume that $Y_i \sim Bernoulli(\mu_i)$ so that $E[Y_i] = \mu_i$ where $0\leq \mu_i \leq 1$.
+* Linear predictor $\eta_i = \sum_{k=1}^p X_{ik} \beta_k$
+* Link function
+$g(\mu) = \eta = \log\left( \frac{\mu}{1 - \mu}\right)$
+$g$ is the (natural) log odds, referred to as the **logit**.
+* Note then we can invert the logit function as
+$$
+\mu_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} ~~~\mbox{and}~~~
+1 - \mu_i = \frac{1}{1 + \exp(\eta_i)}
+$$
+Thus the likelihood is
+$$
+\prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1-y_i}
+= \exp\left(\sum_{i=1}^n y_i \eta_i \right)
+\prod_{i=1}^n (1 + \eta_i)^{-1}
+$$
+
+
+## Example, Poisson regression
+* Assume that $Y_i \sim Poisson(\mu_i)$ so that $E[Y_i] = \mu_i$ where $0\leq \mu_i$
+* Linear predictor $\eta_i = \sum_{k=1}^p X_{ik} \beta_k$
+* Link function
+$g(\mu) = \eta = \log(\mu)$
+* Recall that $e^x$ is the inverse of $\log(x)$ so that
+$$
+\mu_i = e^{\eta_i}
+$$
+Thus, the likelihood is
+$$
+\prod_{i=1}^n (y_i !)^{-1} \mu_i^{y_i}e^{-\mu_i}
+\propto \exp\left(\sum_{i=1}^n y_i \eta_i - \sum_{i=1}^n \mu_i\right)
+$$
+
+
+## Some things to note
+* In each case, the only way in which the likelihood depends on the data is through
+$$\sum_{i=1}^n y_i \eta_i =
+\sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k =
+\sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i
+$$
+Thus if we don't need the full data, only $\sum_{i=1}^n X_{ik} y_i$. This simplification is a consequence of chosing so-called 'canonical' link functions.
+* (This has to be derived). All models acheive their maximum at the root of the so called normal equations
+$$
+0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i
+$$
+where $W_i$ are the derivative of the inverse of the link function.
+
+
+## About variances
+$$
+0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i
+$$
+* For the linear model $Var(Y_i) = \sigma^2$ is constant.
+* For Bernoulli case $Var(Y_i) = \mu_i (1 - \mu_i)$
+* For the Poisson case $Var(Y_i) = \mu_i$.
+* In the latter cases, it is often relevant to have a more flexible variance model, even if it doesn't correspond to an actual likelihood
+$$
+0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i (1 - \mu_i ) } W_i ~~~\mbox{and}~~~
+0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i} W_i
+$$
+* These are called 'quasi-likelihood' normal equations
+
+
+## Odds and ends
+* The normal equations have to be solved iteratively. Resulting in
+$\hat \beta_k$ and, if included, $\hat \phi$.
+* Predicted linear predictor responses can be obtained as $\hat \eta = \sum_{k=1}^p X_k \hat \beta_k$
+* Predicted mean responses as $\hat \mu = g^{-1}(\hat \eta)$
+* Coefficients are interpretted as
+$$
+g(E[Y | X_k = x_k + 1, X_{\sim k} = x_{\sim k}]) - g(E[Y | X_k = x_k, X_{\sim k}=x_{\sim k}]) = \beta_k
+$$
+or the change in the link function of the expected response per unit change in $X_k$ holding other regressors constant.
+* Variations on Newon/Raphson's algorithm are used to do it.
+* Asymptotics are used for inference usually.
+* Many of the ideas from linear models can be brought over to GLMs.
diff --git a/07_RegressionModels/03_01_glms/index.html b/07_RegressionModels/03_01_glms/index.html
index 2bdb47aca..947240219 100644
--- a/07_RegressionModels/03_01_glms/index.html
+++ b/07_RegressionModels/03_01_glms/index.html
@@ -1,317 +1,251 @@
-
-
-
- Generalized linear models
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Generalized linear models
-
Regression Models
-
Brian Caffo, Jeff Leek, Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
-
-
Linear models
-
-
-
-
Linear models are the most useful applied statistical technique. However, they are not without their limitations.
-
-
-
Additive response models don't make much sense if the response is discrete, or stricly positive.
-
Additive error models often don't make sense, for example if the outcome has to be positive.
-
Transformations are often hard to interpret.
-
There's value in modeling the data on the scale that it was collected.
-
Particularly interpetable transformations, natural logarithms in specific, aren't applicable for negative or zero values.
-
-
-
-
-
-
-
-
-
-
Generalized linear models
-
-
-
-
Introduced in a 1972 RSSB paper by Nelder and Wedderburn.
-
Involves three components
-
-
-
An exponential family model for the response.
-
A systematic component via a linear predictor.
-
A link function that connects the means of the response to the linear predictor.
-
-
-
-
-
-
-
-
-
-
Example, linear models
-
-
-
-
Assume that \(Y_i \sim N(\mu_i, \sigma^2)\) (the Gaussian distribution is an exponential family distribution.)
-
Define the linear predictor to be \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\).
-
The link function as \(g\) so that \(g(\mu) = \eta\).
-
-
-
For linear models \(g(\mu) = \mu\) so that \(\mu_i = \eta_i\)
-
-
This yields the same likelihood model as our additive error Gaussian linear model
-\[
-Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i}
-\]
-where \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)
-
-
-
-
-
-
-
-
-
Example, logistic regression
-
-
-
-
Assume that \(Y_i \sim Bernoulli(\mu_i)\) so that \(E[Y_i] = \mu_i\) where \(0\leq \mu_i \leq 1\).
-
Linear predictor \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\)
-
Link function
-\(g(\mu) = \eta = \log\left( \frac{\mu}{1 - \mu}\right)\)
-\(g\) is the (natural) log odds, referred to as the logit.
-
Note then we can invert the logit function as
-\[
-\mu_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} ~~~\mbox{and}~~~
-1 - \mu_i = \frac{1}{1 + \exp(\eta_i)}
-\]
-Thus the likelihood is
-\[
-\prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1-y_i}
-= \exp\left(\sum_{i=1}^n y_i \eta_i \right)
-\prod_{i=1}^n (1 + \eta_i)^{-1}
-\]
-
-
-
-
-
-
-
-
-
Example, Poisson regression
-
-
-
-
Assume that \(Y_i \sim Poisson(\mu_i)\) so that \(E[Y_i] = \mu_i\) where \(0\leq \mu_i\)
-
Linear predictor \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\)
-
Link function
-\(g(\mu) = \eta = \log(\mu)\)
-
Recall that \(e^x\) is the inverse of \(\log(x)\) so that
-\[
-\mu_i = e^{\eta_i}
-\]
-Thus, the likelihood is
-\[
-\prod_{i=1}^n (y_i !)^{-1} \mu_i^{y_i}e^{-\mu_i}
-\propto \exp\left(\sum_{i=1}^n y_i \eta_i - \sum_{i=1}^n \mu_i\right)
-\]
-
-
-
-
-
-
-
-
-
Some things to note
-
-
-
-
In each case, the only way in which the likelihood depends on the data is through
-\[\sum_{i=1}^n y_i \eta_i =
-\sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k =
-\sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i
-\]
-Thus if we don't need the full data, only \(\sum_{i=1}^n X_{ik} y_i\). This simplification is a consequence of chosing so-called 'canonical' link functions.
-
(This has to be derived). All models acheive their maximum at the root of the so called normal equations
-\[
-0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i
-\]
-where \(W_i\) are the derivative of the inverse of the link function.
For the linear model \(Var(Y_i) = \sigma^2\) is constant.
-
For Bernoulli case \(Var(Y_i) = \mu_i (1 - \mu_i)\)
-
For the Poisson case \(Var(Y_i) = \mu_i\).
-
In the latter cases, it is often relevant to have a more flexible variance model, even if it doesn't correspond to an actual likelihood
-\[
-0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i (1 - \mu_i ) } W_i ~~~\mbox{and}~~~
-0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i} W_i
-\]
-
These are called 'quasi-likelihood' normal equations
-
-
-
-
-
-
-
-
-
Odds and ends
-
-
-
-
The normal equations have to be solved iteratively. Resulting in
-\(\hat \beta_k\) and, if included, \(\hat \phi\).
-
Predicted linear predictor responses can be obtained as \(\hat \eta = \sum_{k=1}^p X_k \hat \beta_k\)
-
Predicted mean responses as \(\hat \mu = g^{-1}(\hat \eta)\)
-
Coefficients are interpretted as
-\[
-g(E[Y | X_k = x_k + 1, X_{\sim k} = x_{\sim k}]) - g(E[Y | X_k = x_k, X_{\sim k}=x_{\sim k}]) = \beta_k
-\]
-or the change in the link function of the expected response per unit change in \(X_k\) holding other regressors constant.
-
Variations on Newon/Raphson's algorithm are used to do it.
-
Asymptotics are used for inference usually.
-
Many of the ideas from linear models can be brought over to GLMs.
Linear models are the most useful applied statistical technique. However, they are not without their limitations.
+
Additive response models don’t make much sense if the response is discrete, or stricly positive.
+
Additive error models often don’t make sense, for example if the outcome has to be positive.
+
Transformations are often hard to interpret.
+
+
There’s value in modeling the data on the scale that it was collected.
+
Particularly interpetable transformations, natural logarithms in specific, aren’t applicable for negative or zero values.
+
+
+
+
+
Generalized linear models
+
+
Introduced in a 1972 RSSB paper by Nelder and Wedderburn.
+
Involves three components
+
An exponential family model for the response.
+
A systematic component via a linear predictor.
+
A link function that connects the means of the response to the linear predictor.
+
+
+
+
Example, linear models
+
+
Assume that \(Y_i \sim N(\mu_i, \sigma^2)\) (the Gaussian distribution is an exponential family distribution.)
+
Define the linear predictor to be \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\).
+
The link function as \(g\) so that \(g(\mu) = \eta\).
+
For linear models \(g(\mu) = \mu\) so that \(\mu_i = \eta_i\)
+
This yields the same likelihood model as our additive error Gaussian linear model \[
+Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i}
+\] where \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)
+
+
+
+
Example, logistic regression
+
+
Assume that \(Y_i \sim Bernoulli(\mu_i)\) so that \(E[Y_i] = \mu_i\) where \(0\leq \mu_i \leq 1\).
+
Linear predictor \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\)
+
Link function \(g(\mu) = \eta = \log\left( \frac{\mu}{1 - \mu}\right)\)\(g\) is the (natural) log odds, referred to as the logit.
+
Note then we can invert the logit function as \[
+\mu_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} ~~~\mbox{and}~~~
+1 - \mu_i = \frac{1}{1 + \exp(\eta_i)}
+\] Thus the likelihood is \[
+\prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1-y_i}
+= \exp\left(\sum_{i=1}^n y_i \eta_i \right)
+\prod_{i=1}^n (1 + \eta_i)^{-1}
+\]
+
+
+
+
Example, Poisson regression
+
+
Assume that \(Y_i \sim Poisson(\mu_i)\) so that \(E[Y_i] = \mu_i\) where \(0\leq \mu_i\)
+
Linear predictor \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\)
+
Link function \(g(\mu) = \eta = \log(\mu)\)
+
Recall that \(e^x\) is the inverse of \(\log(x)\) so that \[
+\mu_i = e^{\eta_i}
+\] Thus, the likelihood is \[
+\prod_{i=1}^n (y_i !)^{-1} \mu_i^{y_i}e^{-\mu_i}
+\propto \exp\left(\sum_{i=1}^n y_i \eta_i - \sum_{i=1}^n \mu_i\right)
+\]
+
+
+
+
Some things to note
+
+
In each case, the only way in which the likelihood depends on the data is through \[\sum_{i=1}^n y_i \eta_i =
+\sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k =
+\sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i
+\] Thus if we don’t need the full data, only \(\sum_{i=1}^n X_{ik} y_i\). This simplification is a consequence of chosing so-called ‘canonical’ link functions.
+
(This has to be derived). All models acheive their maximum at the root of the so called normal equations \[
+0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i
+\] where \(W_i\) are the derivative of the inverse of the link function.
+
+
+
+
About variances
+
\[
+0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i
+\] * For the linear model \(Var(Y_i) = \sigma^2\) is constant. * For Bernoulli case \(Var(Y_i) = \mu_i (1 - \mu_i)\) * For the Poisson case \(Var(Y_i) = \mu_i\). * In the latter cases, it is often relevant to have a more flexible variance model, even if it doesn’t correspond to an actual likelihood \[
+0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i (1 - \mu_i ) } W_i ~~~\mbox{and}~~~
+0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i} W_i
+\] * These are called ‘quasi-likelihood’ normal equations
+
+
+
Odds and ends
+
+
The normal equations have to be solved iteratively. Resulting in \(\hat \beta_k\) and, if included, \(\hat \phi\).
+
Predicted linear predictor responses can be obtained as \(\hat \eta = \sum_{k=1}^p X_k \hat \beta_k\)
+
Predicted mean responses as \(\hat \mu = g^{-1}(\hat \eta)\)
+
Coefficients are interpretted as \[
+g(E[Y | X_k = x_k + 1, X_{\sim k} = x_{\sim k}]) - g(E[Y | X_k = x_k, X_{\sim k}=x_{\sim k}]) = \beta_k
+\] or the change in the link function of the expected response per unit change in \(X_k\) holding other regressors constant.
+
Variations on Newon/Raphson’s algorithm are used to do it.
+
Asymptotics are used for inference usually.
+
Many of the ideas from linear models can be brought over to GLMs.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/03_02_binaryOutcomes/index.Rmd b/07_RegressionModels/03_02_binaryOutcomes/index.Rmd
index 6945c3e54..2df88e0b1 100644
--- a/07_RegressionModels/03_02_binaryOutcomes/index.Rmd
+++ b/07_RegressionModels/03_02_binaryOutcomes/index.Rmd
@@ -1,220 +1,222 @@
----
-title : Generalized linear models, binary data
-subtitle : Regression models
-author : Brian Caffo, Jeff Leek and Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../librariesNew
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
----
-```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-```
-
-## Key ideas
-
-* Frequently we care about outcomes that have two values
- * Alive/dead
- * Win/loss
- * Success/Failure
- * etc
-* Called binary, Bernoulli or 0/1 outcomes
-* Collection of exchangeable binary outcomes for the same covariate data are called binomial outcomes.
-
----
-
-## Example Baltimore Ravens win/loss
-### Ravens Data
-
-```{r loadRavens,cache=TRUE}
-download.file("https://dl.dropboxusercontent.com/u/7710864/data/ravensData.rda"
- , destfile="./data/ravensData.rda",method="curl")
-load("./data/ravensData.rda")
-head(ravensData)
-```
-
----
-
-## Linear regression
-
-$$ RW_i = b_0 + b_1 RS_i + e_i $$
-
-$RW_i$ - 1 if a Ravens win, 0 if not
-
-$RS_i$ - Number of points Ravens scored
-
-$b_0$ - probability of a Ravens win if they score 0 points
-
-$b_1$ - increase in probability of a Ravens win for each additional point
-
-$e_i$ - residual variation due
-
----
-
-## Linear regression in R
-
-```{r linearReg, dependson = "loadRavens", cache=TRUE}
-lmRavens <- lm(ravensData$ravenWinNum ~ ravensData$ravenScore)
-summary(lmRavens)$coef
-```
----
-
-## Odds
-
-__Binary Outcome 0/1__
-
-$$RW_i$$
-
-__Probability (0,1)__
-
-$$\rm{Pr}(RW_i | RS_i, b_0, b_1 )$$
-
-
-__Odds $(0,\infty)$__
-$$\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}$$
-
-__Log odds $(-\infty,\infty)$__
-
-$$\log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right)$$
-
-
----
-
-## Linear vs. logistic regression
-
-__Linear__
-
-$$ RW_i = b_0 + b_1 RS_i + e_i $$
-
-or
-
-$$ E[RW_i | RS_i, b_0, b_1] = b_0 + b_1 RS_i$$
-
-__Logistic__
-
-$$ \rm{Pr}(RW_i | RS_i, b_0, b_1) = \frac{\exp(b_0 + b_1 RS_i)}{1 + \exp(b_0 + b_1 RS_i)}$$
-
-or
-
-$$ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i $$
-
----
-
-## Interpreting Logistic Regression
-
-$$ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i $$
-
-
-$b_0$ - Log odds of a Ravens win if they score zero points
-
-$b_1$ - Log odds ratio of win probability for each point scored (compared to zero points)
-
-$\exp(b_1)$ - Odds ratio of win probability for each point scored (compared to zero points)
-
----
-## Odds
-- Imagine that you are playing a game where you flip a coin with success probability $p$.
-- If it comes up heads, you win $X$. If it comes up tails, you lose $Y$.
-- What should we set $X$ and $Y$ for the game to be fair?
-
- $$E[earnings]= X p - Y (1 - p) = 0$$
-- Implies
- $$\frac{Y}{X} = \frac{p}{1 - p}$$
-- The odds can be said as "How much should you be willing to pay for a $p$ probability of winning a dollar?"
- - (If $p > 0.5$ you have to pay more if you lose than you get if you win.)
- - (If $p < 0.5$ you have to pay less if you lose than you get if you win.)
-
----
-## Visualizing fitting logistic regression curves
-```
-x <- seq(-10, 10, length = 1000)
-manipulate(
- plot(x, exp(beta0 + beta1 * x) / (1 + exp(beta0 + beta1 * x)),
- type = "l", lwd = 3, frame = FALSE),
- beta1 = slider(-2, 2, step = .1, initial = 2),
- beta0 = slider(-2, 2, step = .1, initial = 0)
- )
-```
-
----
-
-## Ravens logistic regression
-
-```{r logReg, dependson = "loadRavens"}
-logRegRavens <- glm(ravensData$ravenWinNum ~ ravensData$ravenScore,family="binomial")
-summary(logRegRavens)
-```
-
-
----
-
-## Ravens fitted values
-
-```{r dependson = "logReg",fig.height=4,fig.width=4}
-plot(ravensData$ravenScore,logRegRavens$fitted,pch=19,col="blue",xlab="Score",ylab="Prob Ravens Win")
-```
-
-
----
-
-## Odds ratios and confidence intervals
-
-```{r dependson = "logReg",fig.height=4,fig.width=4}
-exp(logRegRavens$coeff)
-exp(confint(logRegRavens))
-```
-
-
----
-
-## ANOVA for logistic regression
-
-```{r dependson = "logReg",fig.height=4,fig.width=4}
-anova(logRegRavens,test="Chisq")
-```
-
-
-
----
-
-## Interpreting Odds Ratios
-
-* Not probabilities
-* Odds ratio of 1 = no difference in odds
-* Log odds ratio of 0 = no difference in odds
-* Odds ratio < 0.5 or > 2 commonly a "moderate effect"
-* Relative risk $\frac{\rm{Pr}(RW_i | RS_i = 10)}{\rm{Pr}(RW_i | RS_i = 0)}$ often easier to interpret, harder to estimate
-* For small probabilities RR $\approx$ OR but __they are not the same__!
-
-[Wikipedia on Odds Ratio](http://en.wikipedia.org/wiki/Odds_ratio)
-
-
----
-
-## Further resources
-
-* [Wikipedia on Logistic Regression](http://en.wikipedia.org/wiki/Logistic_regression)
-* [Logistic regression and glms in R](http://data.princeton.edu/R/glms.html)
-* Brian Caffo's lecture notes on: [Simpson's paradox](http://ocw.jhsph.edu/courses/MethodsInBiostatisticsII/PDFs/lecture23.pdf), [Case-control studies](http://ocw.jhsph.edu/courses/MethodsInBiostatisticsII/PDFs/lecture24.pdf)
-* [Open Intro Chapter on Logistic Regression](http://www.openintro.org/stat/down/oiStat2_08.pdf)
-
-
-
+---
+title : Generalized linear models, binary data
+subtitle : Regression models
+author : Brian Caffo, Jeff Leek and Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../librariesNew
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+```
+
+## Key ideas
+
+* Frequently we care about outcomes that have two values
+ * Alive/dead
+ * Win/loss
+ * Success/Failure
+ * etc
+* Called binary, Bernoulli or 0/1 outcomes
+* Collection of exchangeable binary outcomes for the same covariate data are called binomial outcomes.
+
+
+
+## Example Baltimore Ravens win/loss
+### Ravens Data
+
+```{r loadRavens,cache=TRUE}
+download.file("https://dl.dropboxusercontent.com/u/7710864/data/ravensData.rda"
+ , destfile="./data/ravensData.rda",method="curl")
+load("./data/ravensData.rda")
+head(ravensData)
+```
+
+
+
+## Linear regression
+
+$$ RW_i = b_0 + b_1 RS_i + e_i $$
+
+$RW_i$ - 1 if a Ravens win, 0 if not
+
+$RS_i$ - Number of points Ravens scored
+
+$b_0$ - probability of a Ravens win if they score 0 points
+
+$b_1$ - increase in probability of a Ravens win for each additional point
+
+$e_i$ - residual variation due
+
+
+
+## Linear regression in R
+
+```{r linearReg, dependson = "loadRavens", cache=TRUE}
+lmRavens <- lm(ravensData$ravenWinNum ~ ravensData$ravenScore)
+summary(lmRavens)$coef
+```
+
+
+## Odds
+
+__Binary Outcome 0/1__
+
+$$RW_i$$
+
+__Probability (0,1)__
+
+$$\rm{Pr}(RW_i | RS_i, b_0, b_1 )$$
+
+
+__Odds $(0,\infty)$__
+$$\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}$$
+
+__Log odds $(-\infty,\infty)$__
+
+$$\log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right)$$
+
+
+
+
+## Linear vs. logistic regression
+
+__Linear__
+
+$$ RW_i = b_0 + b_1 RS_i + e_i $$
+
+or
+
+$$ E[RW_i | RS_i, b_0, b_1] = b_0 + b_1 RS_i$$
+
+__Logistic__
+
+$$ \rm{Pr}(RW_i | RS_i, b_0, b_1) = \frac{\exp(b_0 + b_1 RS_i)}{1 + \exp(b_0 + b_1 RS_i)}$$
+
+or
+
+$$ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i $$
+
+
+
+## Interpreting Logistic Regression
+
+$$ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i $$
+
+
+$b_0$ - Log odds of a Ravens win if they score zero points
+
+$b_1$ - Log odds ratio of win probability for each point scored (compared to zero points)
+
+$\exp(b_1)$ - Odds ratio of win probability for each point scored (compared to zero points)
+
+
+## Odds
+- Imagine that you are playing a game where you flip a coin with success probability $p$.
+- If it comes up heads, you win $X$. If it comes up tails, you lose $Y$.
+- What should we set $X$ and $Y$ for the game to be fair?
+
+ $$E[earnings]= X p - Y (1 - p) = 0$$
+- Implies
+ $$\frac{Y}{X} = \frac{p}{1 - p}$$
+- The odds can be said as "How much should you be willing to pay for a $p$ probability of winning a dollar?"
+ - (If $p > 0.5$ you have to pay more if you lose than you get if you win.)
+ - (If $p < 0.5$ you have to pay less if you lose than you get if you win.)
+
+
+## Visualizing fitting logistic regression curves
+```
+x <- seq(-10, 10, length = 1000)
+manipulate(
+ plot(x, exp(beta0 + beta1 * x) / (1 + exp(beta0 + beta1 * x)),
+ type = "l", lwd = 3, frame = FALSE),
+ beta1 = slider(-2, 2, step = .1, initial = 2),
+ beta0 = slider(-2, 2, step = .1, initial = 0)
+ )
+```
+
+
+
+## Ravens logistic regression
+
+```{r logReg, dependson = "loadRavens"}
+logRegRavens <- glm(ravensData$ravenWinNum ~ ravensData$ravenScore,family="binomial")
+summary(logRegRavens)
+```
+
+
+
+
+## Ravens fitted values
+
+```{r dependson = "logReg",fig.height=4,fig.width=4}
+plot(ravensData$ravenScore,logRegRavens$fitted,pch=19,col="blue",xlab="Score",ylab="Prob Ravens Win")
+```
+
+
+
+
+## Odds ratios and confidence intervals
+
+```{r dependson = "logReg",fig.height=4,fig.width=4}
+exp(logRegRavens$coeff)
+exp(confint(logRegRavens))
+```
+
+
+
+
+## ANOVA for logistic regression
+
+```{r dependson = "logReg",fig.height=4,fig.width=4}
+anova(logRegRavens,test="Chisq")
+```
+
+
+
+
+
+## Interpreting Odds Ratios
+
+* Not probabilities
+* Odds ratio of 1 = no difference in odds
+* Log odds ratio of 0 = no difference in odds
+* Odds ratio < 0.5 or > 2 commonly a "moderate effect"
+* Relative risk $\frac{\rm{Pr}(RW_i | RS_i = 10)}{\rm{Pr}(RW_i | RS_i = 0)}$ often easier to interpret, harder to estimate
+* For small probabilities RR $\approx$ OR but __they are not the same__!
+
+[Wikipedia on Odds Ratio](http://en.wikipedia.org/wiki/Odds_ratio)
+
+
+
+
+## Further resources
+
+* [Wikipedia on Logistic Regression](http://en.wikipedia.org/wiki/Logistic_regression)
+* [Logistic regression and glms in R](http://data.princeton.edu/R/glms.html)
+* Brian Caffo's lecture notes on: [Simpson's paradox](http://ocw.jhsph.edu/courses/MethodsInBiostatisticsII/PDFs/lecture23.pdf), [Case-control studies](http://ocw.jhsph.edu/courses/MethodsInBiostatisticsII/PDFs/lecture24.pdf)
+* [Open Intro Chapter on Logistic Regression](http://www.openintro.org/stat/down/oiStat2_08.pdf)
+
+
+
diff --git a/07_RegressionModels/03_02_binaryOutcomes/index.html b/07_RegressionModels/03_02_binaryOutcomes/index.html
index d1cffc5c9..8dda406bb 100644
--- a/07_RegressionModels/03_02_binaryOutcomes/index.html
+++ b/07_RegressionModels/03_02_binaryOutcomes/index.html
@@ -1,502 +1,331 @@
-
-
-
- Generalized linear models, binary data
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Generalized linear models, binary data
-
Regression models
-
Brian Caffo, Jeff Leek and Roger Peng Johns Hopkins Bloomberg School of Public Health
-
-
-
-
-
-
-
-
-
Key ideas
-
-
-
-
Frequently we care about outcomes that have two values
-
-
-
Alive/dead
-
Win/loss
-
Success/Failure
-
etc
-
-
Called binary, Bernoulli or 0/1 outcomes
-
Collection of exchangeable binary outcomes for the same covariate data are called binomial outcomes.
+Call:
+glm(formula = ravensData$ravenWinNum ~ ravensData$ravenScore,
+ family = "binomial")
+
+Deviance Residuals:
+ Min 1Q Median 3Q Max
+-1.7575 -1.0999 0.5305 0.8060 1.4947
+
+Coefficients:
+ Estimate Std. Error z value Pr(>|z|)
+(Intercept) -1.68001 1.55412 -1.081 0.28
+ravensData$ravenScore 0.10658 0.06674 1.597 0.11
+
+(Dispersion parameter for binomial family taken to be 1)
+
+ Null deviance: 24.435 on 19 degrees of freedom
+Residual deviance: 20.895 on 18 degrees of freedom
+AIC: 24.895
+
+Number of Fisher Scoring iterations: 5
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/03_03_countOutcomes/index.Rmd b/07_RegressionModels/03_03_countOutcomes/index.Rmd
index 15c5bf301..54ef75a9e 100644
--- a/07_RegressionModels/03_03_countOutcomes/index.Rmd
+++ b/07_RegressionModels/03_03_countOutcomes/index.Rmd
@@ -1,316 +1,318 @@
----
-title : Count outcomes, Poisson GLMs
-subtitle : Regression Models
-author : Brian Caffo, Jeffrey Leek, Roger Peng
-job : Johns Hopkins Bloomberg School of Public Health
-logo : bloomberg_shield.png
-framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
-highlighter : highlight.js # {highlight.js, prettify, highlight}
-hitheme : tomorrow #
-url:
- lib: ../../librariesNew
- assets: ../../assets
-widgets : [mathjax] # {mathjax, quiz, bootstrap}
-mode : selfcontained # {standalone, draft}
----
-
-
-```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F}
-# make this an external chunk that can be included in any file
-options(width = 100)
-opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
-
-options(xtable.type = 'html')
-knit_hooks$set(inline = function(x) {
- if(is.numeric(x)) {
- round(x, getOption('digits'))
- } else {
- paste(as.character(x), collapse = ', ')
- }
-})
-knit_hooks$set(plot = knitr:::hook_plot_html)
-```
-
-## Key ideas
-
-* Many data take the form of counts
- * Calls to a call center
- * Number of flu cases in an area
- * Number of cars that cross a bridge
-* Data may also be in the form of rates
- * Percent of children passing a test
- * Percent of hits to a website from a country
-* Linear regression with transformation is an option
-
----
-
-## Poisson distribution
-- The Poisson distribution is a useful model for counts and rates
-- Here a rate is count per some monitoring time
-- Some examples uses of the Poisson distribution
- - Modeling web traffic hits
- - Incidence rates
- - Approximating binomial probabilities with small $p$ and large $n$
- - Analyzing contigency table data
-
----
-## The Poisson mass function
-- $X \sim Poisson(t\lambda)$ if
-$$
-P(X = x) = \frac{(t\lambda)^x e^{-t\lambda}}{x!}
-$$
-For $x = 0, 1, \ldots$.
-- The mean of the Poisson is $E[X] = t\lambda$, thus $E[X / t] = \lambda$
-- The variance of the Poisson is $Var(X) = t\lambda$.
-- The Poisson tends to a normal as $t\lambda$ gets large.
-
----
-
-```{r simPois,fig.height=4,fig.width=8, cache=TRUE}
-par(mfrow = c(1, 3))
-plot(0 : 10, dpois(0 : 10, lambda = 2), type = "h", frame = FALSE)
-plot(0 : 20, dpois(0 : 20, lambda = 10), type = "h", frame = FALSE)
-plot(0 : 200, dpois(0 : 200, lambda = 100), type = "h", frame = FALSE)
-```
-
----
-
-## Poisson distribution
-### Sort of, showing that the mean and variance are equal
-```{r}
-x <- 0 : 10000; lambda = 3
-mu <- sum(x * dpois(x, lambda = lambda))
-sigmasq <- sum((x - mu)^2 * dpois(x, lambda = lambda))
-c(mu, sigmasq)
-```
-
----
-
-## Example: Leek Group Website Traffic
-* Consider the daily counts to Jeff Leek's web site
-
-[http://biostat.jhsph.edu/~jleek/](http://biostat.jhsph.edu/~jleek/)
-
-* Since the unit of time is always one day, set $t = 1$ and then
-the Poisson mean is interpretted as web hits per day. (If we set $t = 24$, it would
-be web hits per hour).
-
----
-
-## Website data
-
-```{r leekLoad,cache=TRUE}
-download.file("https://dl.dropboxusercontent.com/u/7710864/data/gaData.rda",destfile="./data/gaData.rda",method="curl")
-load("./data/gaData.rda")
-gaData$julian <- julian(gaData$date)
-head(gaData)
-```
-
-[http://skardhamar.github.com/rga/](http://skardhamar.github.com/rga/)
-
-
----
-
-## Plot data
-
-```{r, dependson="leekLoad",fig.height=4.5,fig.width=4.5}
-plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
-```
-
-
----
-
-## Linear regression
-
-$$ NH_i = b_0 + b_1 JD_i + e_i $$
-
-$NH_i$ - number of hits to the website
-
-$JD_i$ - day of the year (Julian day)
-
-$b_0$ - number of hits on Julian day 0 (1970-01-01)
-
-$b_1$ - increase in number of hits per unit day
-
-$e_i$ - variation due to everything we didn't measure
-
-
----
-
-## Linear regression line
-
-```{r linReg, dependson="leekLoad",fig.height=4,fig.width=4, cache=TRUE}
-plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
-lm1 <- lm(gaData$visits ~ gaData$julian)
-abline(lm1,col="red",lwd=3)
-```
----
-
-## Aside, taking the log of the outcome
-- Taking the natural log of the outcome has a specific interpretation.
-- Consider the model
-
-$$ \log(NH_i) = b_0 + b_1 JD_i + e_i $$
-
-$NH_i$ - number of hits to the website
-
-$JD_i$ - day of the year (Julian day)
-
-$b_0$ - log number of hits on Julian day 0 (1970-01-01)
-
-$b_1$ - increase in log number of hits per unit day
-
-$e_i$ - variation due to everything we didn't measure
-
----
-## Exponentiating coefficients
-- $e^{E[\log(Y)]}$ geometric mean of $Y$.
- - With no covariates, this is estimated by $e^{\frac{1}{n}\sum_{i=1}^n \log(y_i)} = (\prod_{i=1}^n y_i)^{1/n}$
-- When you take the natural log of outcomes and fit a regression model, your exponentiated coefficients
-estimate things about geometric means.
-- $e^{\beta_0}$ estimated geometric mean hits on day 0
-- $e^{\beta_1}$ estimated relative increase or decrease in geometric mean hits per day
-- There's a problem with logs with you have zero counts, adding a constant works
-```{r}
-round(exp(coef(lm(I(log(gaData$visits + 1)) ~ gaData$julian))), 5)
-```
-
----
-
-## Linear vs. Poisson regression
-
-__Linear__
-
-$$ NH_i = b_0 + b_1 JD_i + e_i $$
-
-or
-
-$$ E[NH_i | JD_i, b_0, b_1] = b_0 + b_1 JD_i$$
-
-__Poisson/log-linear__
-
-$$ \log\left(E[NH_i | JD_i, b_0, b_1]\right) = b_0 + b_1 JD_i $$
-
-or
-
-$$ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 + b_1 JD_i\right) $$
-
-
----
-
-## Multiplicative differences
-
-
-
-$$ \log\left(E[NHSS_i | JD_i, b_0, b_1]\right) = \log(NH_i) + b_0 + b_1 JD_i $$
-
----
-
-## Fitting rates in R
-
-```{r ratesFit,dependson="agnostic", cache=TRUE,fig.height=4,fig.width=4}
-glm2 <- glm(gaData$simplystats ~ julian(gaData$date),offset=log(visits+1),
- family="poisson",data=gaData)
-plot(julian(gaData$date),glm2$fitted,col="blue",pch=19,xlab="Date",ylab="Fitted Counts")
-points(julian(gaData$date),glm1$fitted,col="red",pch=19)
-```
-
----
-
-## Fitting rates in R
-
-```{r,dependson="ratesFit",fig.height=4,fig.width=4}
-glm2 <- glm(gaData$simplystats ~ julian(gaData$date),offset=log(visits+1),
- family="poisson",data=gaData)
-plot(julian(gaData$date),gaData$simplystats/(gaData$visits+1),col="grey",xlab="Date",
- ylab="Fitted Rates",pch=19)
-lines(julian(gaData$date),glm2$fitted/(gaData$visits+1),col="blue",lwd=3)
-```
-
----
-
-## More information
-
-* [Log-linear models and multiway tables](http://ww2.coastal.edu/kingw/statistics/R-tutorials/loglin.html)
-* [Wikipedia on Poisson regression](http://en.wikipedia.org/wiki/Poisson_regression), [Wikipedia on overdispersion](http://en.wikipedia.org/wiki/Overdispersion)
-* [Regression models for count data in R](http://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf)
-* [pscl package](http://cran.r-project.org/web/packages/pscl/index.html) - the function _zeroinfl_ fits zero inflated models.
+---
+title : Count outcomes, Poisson GLMs
+subtitle : Regression Models
+author : Brian Caffo, Jeffrey Leek, Roger Peng
+job : Johns Hopkins Bloomberg School of Public Health
+logo : bloomberg_shield.png
+framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
+highlighter : highlight.js # {highlight.js, prettify, highlight}
+hitheme : tomorrow #
+url:
+ lib: ../../librariesNew
+ assets: ../../assets
+widgets : [mathjax] # {mathjax, quiz, bootstrap}
+mode : selfcontained # {standalone, draft}
+---
+
+
+```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
+# make this an external chunk that can be included in any file
+options(width = 100)
+opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
+
+options(xtable.type = 'html')
+knit_hooks$set(inline = function(x) {
+ if(is.numeric(x)) {
+ round(x, getOption('digits'))
+ } else {
+ paste(as.character(x), collapse = ', ')
+ }
+})
+knit_hooks$set(plot = knitr:::hook_plot_html)
+```
+
+## Key ideas
+
+* Many data take the form of counts
+ * Calls to a call center
+ * Number of flu cases in an area
+ * Number of cars that cross a bridge
+* Data may also be in the form of rates
+ * Percent of children passing a test
+ * Percent of hits to a website from a country
+* Linear regression with transformation is an option
+
+
+
+## Poisson distribution
+- The Poisson distribution is a useful model for counts and rates
+- Here a rate is count per some monitoring time
+- Some examples uses of the Poisson distribution
+ - Modeling web traffic hits
+ - Incidence rates
+ - Approximating binomial probabilities with small $p$ and large $n$
+ - Analyzing contigency table data
+
+
+## The Poisson mass function
+- $X \sim Poisson(t\lambda)$ if
+$$
+P(X = x) = \frac{(t\lambda)^x e^{-t\lambda}}{x!}
+$$
+For $x = 0, 1, \ldots$.
+- The mean of the Poisson is $E[X] = t\lambda$, thus $E[X / t] = \lambda$
+- The variance of the Poisson is $Var(X) = t\lambda$.
+- The Poisson tends to a normal as $t\lambda$ gets large.
+
+
+
+```{r simPois,fig.height=4,fig.width=8, cache=TRUE}
+par(mfrow = c(1, 3))
+plot(0 : 10, dpois(0 : 10, lambda = 2), type = "h", frame = FALSE)
+plot(0 : 20, dpois(0 : 20, lambda = 10), type = "h", frame = FALSE)
+plot(0 : 200, dpois(0 : 200, lambda = 100), type = "h", frame = FALSE)
+```
+
+
+
+## Poisson distribution
+### Sort of, showing that the mean and variance are equal
+```{r}
+x <- 0 : 10000; lambda = 3
+mu <- sum(x * dpois(x, lambda = lambda))
+sigmasq <- sum((x - mu)^2 * dpois(x, lambda = lambda))
+c(mu, sigmasq)
+```
+
+
+
+## Example: Leek Group Website Traffic
+* Consider the daily counts to Jeff Leek's web site
+
+[http://biostat.jhsph.edu/~jleek/](http://biostat.jhsph.edu/~jleek/)
+
+* Since the unit of time is always one day, set $t = 1$ and then
+the Poisson mean is interpretted as web hits per day. (If we set $t = 24$, it would
+be web hits per hour).
+
+
+
+## Website data
+
+```{r leekLoad,cache=TRUE}
+download.file("https://dl.dropboxusercontent.com/u/7710864/data/gaData.rda",destfile="./data/gaData.rda",method="curl")
+load("./data/gaData.rda")
+gaData$julian <- julian(gaData$date)
+head(gaData)
+```
+
+[http://skardhamar.github.com/rga/](http://skardhamar.github.com/rga/)
+
+
+
+
+## Plot data
+
+```{r, dependson="leekLoad",fig.height=4.5,fig.width=4.5}
+plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
+```
+
+
+
+
+## Linear regression
+
+$$ NH_i = b_0 + b_1 JD_i + e_i $$
+
+$NH_i$ - number of hits to the website
+
+$JD_i$ - day of the year (Julian day)
+
+$b_0$ - number of hits on Julian day 0 (1970-01-01)
+
+$b_1$ - increase in number of hits per unit day
+
+$e_i$ - variation due to everything we didn't measure
+
+
+
+
+## Linear regression line
+
+```{r linReg, dependson="leekLoad",fig.height=4,fig.width=4, cache=TRUE}
+plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
+lm1 <- lm(gaData$visits ~ gaData$julian)
+abline(lm1,col="red",lwd=3)
+```
+
+
+## Aside, taking the log of the outcome
+- Taking the natural log of the outcome has a specific interpretation.
+- Consider the model
+
+$$ \log(NH_i) = b_0 + b_1 JD_i + e_i $$
+
+$NH_i$ - number of hits to the website
+
+$JD_i$ - day of the year (Julian day)
+
+$b_0$ - log number of hits on Julian day 0 (1970-01-01)
+
+$b_1$ - increase in log number of hits per unit day
+
+$e_i$ - variation due to everything we didn't measure
+
+
+## Exponentiating coefficients
+- $e^{E[\log(Y)]}$ geometric mean of $Y$.
+ - With no covariates, this is estimated by $e^{\frac{1}{n}\sum_{i=1}^n \log(y_i)} = (\prod_{i=1}^n y_i)^{1/n}$
+- When you take the natural log of outcomes and fit a regression model, your exponentiated coefficients
+estimate things about geometric means.
+- $e^{\beta_0}$ estimated geometric mean hits on day 0
+- $e^{\beta_1}$ estimated relative increase or decrease in geometric mean hits per day
+- There's a problem with logs with you have zero counts, adding a constant works
+```{r}
+round(exp(coef(lm(I(log(gaData$visits + 1)) ~ gaData$julian))), 5)
+```
+
+
+
+## Linear vs. Poisson regression
+
+__Linear__
+
+$$ NH_i = b_0 + b_1 JD_i + e_i $$
+
+or
+
+$$ E[NH_i | JD_i, b_0, b_1] = b_0 + b_1 JD_i$$
+
+__Poisson/log-linear__
+
+$$ \log\left(E[NH_i | JD_i, b_0, b_1]\right) = b_0 + b_1 JD_i $$
+
+or
+
+$$ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 + b_1 JD_i\right) $$
+
+
+
+
+## Multiplicative differences
+
+
Since the unit of time is always one day, set \(t = 1\) and then
-the Poisson mean is interpretted as web hits per day. (If we set \(t = 24\), it would
-be web hits per hour).
Since the unit of time is always one day, set \(t = 1\) and then the Poisson mean is interpretted as web hits per day. (If we set \(t = 24\), it would be web hits per hour).
People have spent a lot of time thinking about bases for this kind of problem. So, consider this as just a teaser.
+
Single knot point terms can fit hockey stick like processes.
+
These bases can be used in GLMs as well.
+
An issue with these approaches is the large number of parameters introduced.
+
Requires some method of so called regularization.
+
+
+
+
Harmonics using linear models
+
##Chord finder, playing the white keys on a piano from octave c4 - c5
+notes4 <- c(261.63, 293.66, 329.63, 349.23, 392.00, 440.00, 493.88, 523.25)
+t <- seq(0, 2, by = .001); n <- length(t)
+c4 <- sin(2 * pi * notes4[1] * t); e4 <- sin(2 * pi * notes4[3] * t);
+g4 <- sin(2 * pi * notes4[5] * t)
+chord <- c4 + e4 + g4 + rnorm(n, 0, 0.3)
+x <- sapply(notes4, function(freq) sin(2 * pi * freq * t))
+fit <- lm(chord ~ x - 1)
+
+
+
+
##(How you would really do it)
+a <- fft(chord); plot(Re(a)^2, type = "l")
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/07_RegressionModels/originalContent/002basicLeastSquares/index.Rmd b/07_RegressionModels/originalContent/002basicLeastSquares/index.Rmd
index 09ea8caf0..cb873069f 100644
--- a/07_RegressionModels/originalContent/002basicLeastSquares/index.Rmd
+++ b/07_RegressionModels/originalContent/002basicLeastSquares/index.Rmd
@@ -16,6 +16,8 @@ mode : selfcontained # {standalone, draft}
```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F}
+# Must explicitly include knitr or the chunks will fail, at least in Linux
+library(knitr)
# make this an external chunk that can be included in any file
options(width = 100)
opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
@@ -38,7 +40,7 @@ knit_hooks$set(plot = knitr:::hook_plot_html)
* Make inferences about distributions or relationships
----
+
## Example: Average parent and child heights
@@ -49,7 +51,7 @@ knit_hooks$set(plot = knitr:::hook_plot_html)
[http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html](http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html)
----
+
## Still relevant
@@ -62,7 +64,7 @@ knit_hooks$set(plot = knitr:::hook_plot_html)
[Predicting height: the Victorian approach beats modern genomics](http://www.wired.com/wiredscience/2009/03/predicting-height-the-victorian-approach-beats-modern-genomics/)
----
+
## Load Galton Data
@@ -75,7 +77,7 @@ hist(galton$child,col="blue",breaks=100)
hist(galton$parent,col="blue",breaks=100)
```
----
+
## The distribution of child heights
@@ -84,7 +86,7 @@ hist(galton$child,col="blue",breaks=100)
```
----
+
## Only know the child - average height
@@ -94,7 +96,7 @@ meanChild <- mean(galton$child)
lines(rep(meanChild,100),seq(0,150,length=100),col="red",lwd=5)
```
----
+
## Only know the child - why average?
@@ -102,7 +104,7 @@ If $C_i$ is the height of child $i$ then the average is the value of $\mu$ that
$$ \sum_{i=1}^{928}(C_i - \mu)^2 $$
----
+
## What if we plot child versus average parent
@@ -110,7 +112,7 @@ $$ \sum_{i=1}^{928}(C_i - \mu)^2 $$
plot(galton$parent,galton$child,pch=19,col="blue")
```
----
+
## Jittered plot
@@ -119,7 +121,7 @@ set.seed(1234)
plot(jitter(galton$parent,factor=2),jitter(galton$child,factor=2),pch=19,col="blue")
```
----
+
## Average parent = 65 inches tall
@@ -130,7 +132,7 @@ points(near65$parent,near65$child,pch=19,col="red")
lines(seq(64,66,length=100),rep(mean(near65$child),100),col="red",lwd=4)
```
----
+
## Average parent = 71 inches tall
@@ -142,7 +144,7 @@ lines(seq(70,72,length=100),rep(mean(near71$child),100),col="red",lwd=4)
```
----
+
## Fitting a line
@@ -154,7 +156,7 @@ lm1 <- lm(galton$child ~ galton$parent)
lines(galton$parent,lm1$fitted,col="red",lwd=3)
```
----
+
## Why not this line?
@@ -164,7 +166,7 @@ plot(galton$parent,galton$child,pch=19,col="blue")
lines(galton$parent, 26 + 0.646*galton$parent)
```
----
+
## The equation for a line
@@ -173,7 +175,7 @@ If $C_i$ is the height of child $i$ and $P_i$ is the height of the average paren
$$C_i = b_0 + b_1 P_i$$
----
+
## Not all points are on the line
@@ -183,7 +185,7 @@ lines(galton$parent,lm1$fitted,col="red",lwd=3)
```
----
+
## Allowing for variation
@@ -193,7 +195,7 @@ $$C_i = b_0 + b_1 P_i + e_i$$
$e_i$ is everything we didn't measure (how much they eat, where they live, do they stretch in the morning...)
----
+
## How do we pick best?
@@ -202,7 +204,7 @@ If $C_i$ is the height of child $i$ and $P_i$ is the height of the average paren
$$ \sum_{i=1}^{928}(C_i - \{b_0 + b_1 P_i\})^2 $$
----
+
## Plot what is leftover
diff --git a/07_RegressionModels/project/project.html b/07_RegressionModels/project/project.html
index 0d9330aa2..7dae5a82f 100644
--- a/07_RegressionModels/project/project.html
+++ b/07_RegressionModels/project/project.html
@@ -1,188 +1,144 @@
-
+
+
+
-
-Regression models final project
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-body {
- max-width: 800px;
- margin: auto;
- line-height: 20px;
-}
-tt, code, pre {
- font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
-}
-h1 {
- font-size:2.2em;
+
-pre code {
- display: block; padding: 0.5em;
-}
-code {
- font-size: 92%;
- border: 1px solid #ccc;
-}
+
-code[class] {
- background-color: #F8F8F8;
-}
+
-table, td, th {
- border: none;
+
+
-
+
+
+
-
-
Regression models final project
+
-
Context
-
You work for Motor Trend magazine. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and MPG (outcome). The are particurlarly interested
-in the questions
-
-
“Is an automatic or manual transmission better for MPG”
-
“How different is the MPG between automatic manual transmission?”
-
-
Question
-
Taket the mtcars data set and write up an analysis to answer their question using regression models and exploratory data analyses.
-
Form of report
+
-
Your report must be
-
-
Written as a pdf printout of a compiled (using knitr) R markdown document.
+
+
+
+
+
+
Regression models final project
+
+
Context
+
You work for Motor Trend magazine. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and MPG (outcome). The are particurlarly interested in the questions
-
Do not use any packages that are not in R-base or the library datasets.
-
-
Brief. roughly the equivalent of 2 pages long or less.
-
Include a first paragraph executive summary.
+
“Is an automatic or manual transmission better for MPG”
+
“How different is the MPG between automatic manual transmission?”
-
+
+
+
Question
+
Taket the mtcars data set and write up an analysis to answer their question using regression models and exploratory data analyses.
+
+
+
Form of report
+
Your report must be * Written as a pdf printout of a compiled (using knitr) R markdown document. * Do not use any packages that are not in R-base or the library datasets. * Brief. roughly the equivalent of 2 pages long or less. * Include a first paragraph executive summary.
+
Peer grading
-
Your report will be graded on the following equally weighted criteria:
-
-
+
Did you interpret your coefficients correctly?
Did you do some exploratory data analyses?
Did you fit multiple models and detail your strategy for model selection?