Due: 11:59pm, July 2, 2020
You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.
You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very basic R Markdown template: https://sta-360-602l-su20.github.io/Course-Website/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://sta-360-602l-su20.github.io/Course-Website/resources/ ) for links to help you learn how to use R markdown.
You MUST submit both your .Rmd and .pdf files to the course site on Gradescope here: https://www.gradescope.com/courses/141314//assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.
You will need the following R packages. If you do not already have them installed, please do so first using the install.packages
function.
While you will be expected to be able to write your own samplers for homework assignments (for the most part), we will rely on stan a lot for the labs. We will look at writing sampling methods in stan later on, but for now, you will be provided with the stan scripts needed for inference and the code needed to grab posterior samples from stan objects. Your focus then should be on using those posterior samples to answer several kinds of questions. For this lab, you will need two stan files lab-02-pool.stan
and lab-02-nopool.stan
, which you can download here:
Download both and make sure to save them in the same folder as the R script or R markdown file you are working from.
Also, start browsing some of the references on programming with Stan to gain some familiarity with stan.
By this point, you are familiar with binomial data: If \(Y \sim Binom(n,\theta)\), we assume the data are such that over \(n\) trials with success probability \(\theta\), we observe \(y\) successes. Let us consider multiple binomial realizations. We have data on rat tumor development from Tarone (1982). Specifically, we have the number of incidences of endometrial stromal polyps in 71 different groups of female lab rats of type F344. We begin by loading in the data:
tumors <- read.csv(file = url("http://www.stat.columbia.edu/~gelman/book/data/rats.asc"),
skip = 2, header = T, sep = " ")[,c(1,2)]
y <- tumors$y
N <- tumors$N
n <- length(y)
Each row represents a group, or a draw from a binomial distribution. The \(y\) variable denotes the number of succcesses and the \(N\) variable denotes the total number of rats in that control group. For example the first group consists of \(20\) rats, with \(0\) of these \(20\) having developed a tumor. \(n\) is the number of groups.
If we assume that the probability of developing a tumor is the same across groups, then for each of the \(i = 1,2,\ldots, n\) groups, we have \(y_i \sim Binom(N_i, \theta)\). We have learned that the Beta distribution is conjugate for Binomial data. For now, we place a \(Beta(1,1)\) prior on \(\theta\), which corresponds to a uniform density on the interval \([0,1]\).
plot(seq(0, 1, length.out = 1000),
dbeta(seq(0, 1, length.out = 1000), 1, 1),
type = 'l',
xlab = expression(theta), ylab = "Density",
main = "The Beta(1, 1) density")
stan_dat <- list(n = n, N = N, y =y, a = 1, b = 1)
fit_pool <- stan('lab-02-pool.stan', data = stan_dat, chains = 2, refresh = 0)
pool_output <- rstan::extract(fit_pool)
mean(pool_output$theta)
## [1] 0.15
rstan
object called pool_output
. Describe the distribution.Alternatively, we may not have reason to believe that the probability of a rat developing a tumor should be the same across groups. Then we have the model \(y_i \sim Binom(N_i, \theta_i)\) for \(i = 1,2,\ldots,n\). If we had expert knowledge about the different groups of rats, we might place different priors on each of the \(n\) \(\theta_i\)’s. However, for simplicity we choose to model the \(\theta_i\) as i.i.d. \(Beta(1,1)\).
stan_dat <- list(n = n, N = N, y =y, a = 1, b = 1)
fit_nopool <- stan('lab-02-nopool.stan', data = stan_dat, chains = 2, refresh = 0)
nopool_output <- rstan::extract(fit_nopool)
apply(nopool_output$theta,2,mean)
## [1] 0.045 0.046 0.046 0.047 0.044 0.045 0.047 0.048 0.048 0.048 0.048 0.050
## [13] 0.050 0.053 0.092 0.092 0.091 0.090 0.095 0.098 0.099 0.099 0.139 0.111
## [25] 0.113 0.119 0.136 0.135 0.138 0.135 0.138 0.136 0.168 0.116 0.144 0.123
## [37] 0.157 0.157 0.163 0.181 0.181 0.200 0.201 0.213 0.227 0.228 0.229 0.226
## [49] 0.224 0.228 0.227 0.219 0.239 0.239 0.238 0.249 0.248 0.255 0.272 0.274
## [61] 0.280 0.284 0.290 0.319 0.320 0.318 0.313 0.334 0.328 0.382 0.312
What is actually being plotted here? What does each point represent?
lab-02-pool.stan
and lab-02-nopool.stan
. How are they different?With the Beta-Binomial model, we know that the posterior is \(\theta | Y \sim Beta(a + \sum y_i, b + \sum N_i- \sum y_i)\). Therefore, the posterior mean is \[E[\theta | Y] = \frac{a + \sum y_i}{a + b + \sum N_i}\] We fit the above models with \(a=1, b=1\), but it is good practice to perform an analysis to determine how sensitive the posterior is to the choice of prior. Considering the first model where we assumed the same success probability \(\theta\) across groups, let us sample from the posterior distribution of \(\theta\) over a range of \(a\) and \(b\) values. These parameter settings produce very different pictures of our prior beliefs about \(\theta\):
par(mfrow = c(4, 4))
par(mar=c(2,2,2,2))
for(a_val in c(1, 10, 25, 100)){
for(b_val in rev(c(1, 10, 25, 100))){
plot(seq(0, 1, length.out = 1000),
dbeta(seq(0, 1, length.out = 1000), a_val, b_val),
type = 'l',
xlab = expression(theta), ylab = "Density",
main = paste0("Beta(", a_val, ", ", b_val, ")"))
}
}
To get samples from the posterior distribution of \(\theta\) for each one of the prior distributions above, we run:
output_list <- list()
for(a_val in c(1, 10, 25, 100)){
for(b_val in c(1, 10, 25, 100)){
stan_dat <- list(n = n, N = N, y = y, a = a_val, b = b_val)
fit_pool <- stan('lab-02-pool.stan', data = stan_dat, chains = 2, refresh = 0)
output_list[[paste0("a_", a_val, ":b_", b_val)]] <- rstan::extract(fit_pool)[["theta"]]
}
}
We then compile the samples from the different prior specifications into a data.frame, which will help us visualize the results.
output_list %>%
plyr::ldply(function(theta){
reshape2::melt(theta) %>%
dplyr::mutate(post_mean = mean(theta))
}, .id = "prior") %>%
tidyr::separate("prior", into = c("a", "b"), sep = ":") %>%
dplyr::mutate(a = as.numeric(gsub("._", "", a)),
b = as.numeric(gsub("._", "", b))) %>%
ggplot2::ggplot() +
geom_density(aes(x = value)) +
geom_vline(aes(xintercept = post_mean)) +
facet_grid(a~factor(b, levels = rev(c(1, 10, 25, 100)))) +
scale_colour_brewer(palette = "Set1") +
labs(x = expression(theta), y = "Density")
In the plot above, increasing values of the parameter \(a\) are displayed moving from top to bottom along the vertical direction. Decreasing values of the parameter \(b\) are displayed moving from left to right along the horizontal direction. We can see that all of the posterior distributions look roughly normal with roughly equal variance. They are all fairly concentrated on values of \(\theta\) within the range \([0.1, 0.2]\).
Here is a further observation that is specific to the concept of sensitivity analysis: for fixed \(a\), as \(b\) increases the posterior mean shifts slightly towards lower values of \(\theta\). But for fixed \(b\), as \(a\) increases the posterior mean shifts more dramatically towards higher values of \(\theta\). We might say that the posterior mean of \(\theta\) is more sensitive to our prior beliefs about \(a\) than it is to our prior beliefs about \(b\). Why might this be the case?
We can look at the formula for the posterior mean above to find an explicit answer. For that, recall that if \(Y \sim Binom(N, \theta)\) and \(\theta \sim Beta(a,b)\), then: \[E[\theta | Y] = \frac{a + Y}{a + b + N} = \frac{a+b}{a+b+N}\left(\frac{a}{a+b}\right) + \frac{N}{a+b+N}\left(\frac{Y}{N}\right)\]
So the posterior mean is a weighted average of the prior mean and the MLE.
What might be a more intuitive explanation for the posterior’s high sensitivity to the parameter \(a\)?
Returning to the initial exploration where we considered a single \(\theta\) versus allowing \(\theta_i\) to vary across groups: You should have noticed that if we allow the groups to have different success probabilities, then our estimates \(\hat{\theta}_i\) vary from 0.05 to 0.30. However when we assumed a single probability of success, we obtained \(\hat{\theta}\approx\) 0.15. In this first approach and assuming the 71 groups are independent, we essentially have one large binomial trial: \(Y^* = \sum y_i \sim Binom(\sum N_i, \theta)\). Applying ideas from the sensitivity analysis:
10 points: 1 point each for the exercises, 3 points for attempting the lab.
This lab was created by Jordan Bryan and Becky Tang.