Due: 11:59pm, July 10, 2020
You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.
You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very basic R Markdown template: https://sta-360-602l-su20.github.io/Course-Website/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://sta-360-602l-su20.github.io/Course-Website/resources/ ) for links to help you learn how to use R markdown.
You MUST submit both your .Rmd and .pdf files to the course site on Gradescope here: https://www.gradescope.com/courses/141314//assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.
You will need the following R packages. If you do not already have them installed, please do so first using the install.packages
function.
For this lab, you will need a few different stan files. Download all of them here:
Download and make sure to save them in the same folder as the R script or R markdown file you are working from.
Suppose we have data \(Y\), which we take to be normally distributed: \(y_i \sim \mathcal{N}(\mu_i, \sigma^2)\) for \(i=1,\ldots,n\). Suppose further that we want to model \(\mu_i\) as a linear function of a covariate/predictor \(x_i\), using an intercept \(\alpha\) and coefficient \(\beta\). That is, \[ \mu_i = \alpha + \beta x_i. \]
Let’s simulate data (with very small sample size) under this model.
create_df <- function(post_draws, prior_draws){
post_draws <- data.frame(post_draws)
post_draws$distribution <- "posterior"
prior_draws <- data.frame(prior_draws)
colnames(prior_draws) <- "alpha"
prior_draws$distribution <- "prior"
dat <- rbind(post_draws, prior_draws)
return(dat)
}
set.seed(689934)
alpha <- 1
beta <- -0.25
sigma <- 1
N <- 5
x <- array(runif(N, 0, 2), dim=N)
y <- array(rnorm(N, beta * x + alpha, sigma), dim=N)
Note the true values of \(\alpha\) and \(\beta\). Once we observe the data \(\{x_i,y_i\}\), both \(\alpha\) and \(\beta\) are unknown, so we would like to perform inference on them. However, we only have 5 data points. Given so few data points, we can be quite sure that the resulting posteriors will be sensitive to the choice of priors.
One possible choice of prior is a flat prior for \(\alpha\) and \(\beta\). That is, \(\pi(\alpha) \propto 1\) and \(\pi(\beta) \propto 1\). Let’s look at how our posterior beliefs about \(\alpha\) and \(\beta\) (especially \(\alpha\)) act under these priors.
stan_dat <- list(y = y, x=x, N=N)
fit.flat <- stan(file = "lab-04-flat_prior.stan", data = stan_dat, chains = 1, refresh = 0, iter = 2000, warmup = 500, seed=48)
alpha.flat <- as.matrix(fit.flat, pars = "alpha")
beta.flat <- as.matrix(fit.flat, pars = "beta")
## Inference for Stan model: lab-04-flat_prior.
## 1 chains, each with iter=2000; warmup=500; thin=1;
## post-warmup draws per chain=1500, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 25% 50% 75% 98% n_eff Rhat
## alpha 0.73 0.09 1.6 -2.5 -0.04 0.74 1.5 4 299 1
##
## Samples were drawn using NUTS(diag_e) at Sat Feb 8 21:02:33 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Notice how the posterior for \(\alpha\) is quite diffuse – there is much uncertainty about what \(\alpha\) is. While the true value for \(\alpha\) is 1, what is the posterior mean? What is the 95% credible interval?
By doing inference with a flat/diffuse prior, we might have thought we were using the least prior information possible. However, flat priors may actually bias our estimates for a parameter by allowing the posterior to be pulled towards extreme and unlikely values.
Diving a bit deeper, consider another flat prior: \(\alpha \sim Unif(a, b)\). Under this prior, we are saying that we believe \(a \leq \alpha \leq b\). We exhibit this in the following code, with the prior \(\alpha \sim Unif(-10, 10)\)
stan_dat <- list(y = y, x=x, N=N, lb = -10, ub = 10)
fit.unif <- stan(file = "lab-04-unif_prior.stan", data = stan_dat, chains = 1, refresh = 0, iter = 2000, warmup = 500, seed=48)
alpha.unif <- as.matrix(fit.unif, pars = c("alpha"))
beta.unif <- as.matrix(fit.unif, pars = c("beta"))
## Inference for Stan model: lab-04-unif_prior.
## 1 chains, each with iter=2000; warmup=500; thin=1;
## post-warmup draws per chain=1500, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 25% 50% 75% 98% n_eff Rhat
## alpha 0.79 0.08 1.4 -2 0.13 0.75 1.4 3.7 348 1
##
## Samples were drawn using NUTS(diag_e) at Sat Feb 8 21:03:04 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
While the posterior mean under this uniform prior is closer to the true value, the posterior is still very spread out. So we have seen that a diffuse or flat prior is necessarily non-informative, and in these cases is actually extremely informative! Diffuse priors inherently spread probability mass across large regions of parameter space. We often assume that the data will overwhelm the prior, so a diffuse prior will let the data dominate posterior inference. However as showcased here, having only a small amount of observed data may allow the diffuse prior to become informative. Therefore, it would be wise to make the conscious choice to have an informative prior. However, we might specify just how informative we would like our prior beliefs to be.
We often must consider the scale of the parameters we wish to estimate. In applied problems where we know how to interpret the parameters, the scale is easier to identify. We may consider a weakly informative prior to be such that if there is a reasonably large amount of data, the likelihood will dominate the posterior and the prior is not important. This sort of prior ought to rule out unreasonable parameter values, but is not so strong as to rule out possible values which might make sense. As a general rule, it is wise to not use hard constraints unless the bounds represent true constraints (ex. bounding a prior for a variance parameter below by 0). As an example, we might think that \(\alpha\) could be between 0 and 1. Instead of setting the prrior \(\alpha \sim Unif(0,1)\), it would be wise to use a Normal(0.5, 1) prior instead. In the following, we consider some weakly informative priors for the parameters. The data we have are on unit scale, so we consider priors also on the unit scale.
In this section, we consider a \(N(0,1)\) prior.
stan_dat <- list(y = y, x=x, N=N)
fit.norm <- stan(file = "lab-04-normal_prior.stan", data = stan_dat, chains = 1, refresh = 0, iter = 2000, warmup = 500, seed=49)
alpha.norm<- as.matrix(fit.norm, pars = c("alpha"))
## Inference for Stan model: lab-04-normal_prior.
## 1 chains, each with iter=2000; warmup=500; thin=1;
## post-warmup draws per chain=1500, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 25% 50% 75% 98% n_eff Rhat
## alpha 0.57 0.02 0.54 -0.57 0.25 0.59 0.91 1.6 656 1
##
## Samples were drawn using NUTS(diag_e) at Sat Feb 8 21:03:36 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The Cauchy distribution (the Cauchy(0,1) distribution is just the t-distribution with df=1) has heavier/fatter tails than the Normal distribution.
stan_dat <- list(y = y, x=x, N=N)
fit.cauchy <- stan(file = "lab-04-cauchy_prior.stan",data = stan_dat, chains = 1, refresh = 0, iter = 2000, warmup = 500, seed=55)
alpha.cauchy<- as.matrix(fit.cauchy, pars = c("alpha"))
## Inference for Stan model: lab-04-cauchy_prior.
## 1 chains, each with iter=2000; warmup=500; thin=1;
## post-warmup draws per chain=1500, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 25% 50% 75% 98% n_eff Rhat
## alpha 0.56 0.02 0.55 -0.48 0.2 0.55 0.9 1.7 683 1
##
## Samples were drawn using NUTS(diag_e) at Sat Feb 8 21:04:08 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The following plot displays the posteriors for \(\alpha\) under these two priors
plot_dat <- create_df(alpha.norm, alpha.cauchy) %>%
mutate(distribution = if_else(distribution == "posterior", "Normal","Cauchy"))
ggplot(plot_dat, aes(alpha, fill = distribution)) +
geom_histogram(binwidth = 0.25, alpha = 0.7, position = "identity")+
geom_vline(xintercept = alpha) +
scale_fill_brewer()
The Cauchy prior allocates higher probabiity mass to extreme values as compared to the Normal prior, while still concentrating most of the posterior mass for \(\alpha\) within a desired scale.
In the previous example, the Normal and the Cauchy priors for \(\alpha\) performed relatively similarly. The true value of \(\alpha\) was 1 and the priors we used were weakly centered around 1, so we happened to choose a good scale for the parameter. Let’s examine what happens when this is not the case. We will simulate new data now with \(\alpha=10\) instead of 1. We will also double the number of data points to 10.
Considering first the same Normal prior as above: \(\alpha \sim N(0,1)\).
stan_dat <- list(y = y, x=x, N=N)
fit.norm <- stan(file = "lab-04-normal_prior.stan", data = stan_dat, chains = 1, refresh = 0, iter = 2000, warmup = 500, seed=49)
alpha.norm<- as.matrix(fit.norm, pars = c("alpha"))
prior_draws <- rnorm(1000, 0, 1)
plot_dat <- create_df(alpha.norm, prior_draws)
ggplot(plot_dat, aes(alpha, fill = distribution)) +
geom_histogram(binwidth = 0.25, alpha = 0.7, position = "identity")+
geom_vline(xintercept = alpha) +
scale_fill_brewer()
Here the prior is once again weakly-informative, but notice how the prior is dominating the likelihood. The posterior is extremely sensitive to the choice of our prior, so much so that the we do not observe posterior values close to the true \(\alpha\) at all. Instead, the posterior is concentrated around the upper extremes of the prior.
What if we use a heavier-tailed distribution like the Cauchy?
stan_dat <- list(y = y, x=x, N=N)
fit.cauchy <- stan(file = "lab-04-cauchy_prior.stan",data = stan_dat, chains = 1, refresh = 0, iter = 2000, warmup = 500, seed=55)
alpha.cauchy<- as.matrix(fit.cauchy, pars = c("alpha"))
prior_draws <- rcauchy(1000, 0, 1)
prior_draws <- prior_draws[abs(prior_draws) < 25]
plot_dat <- create_df(alpha.cauchy, prior_draws)
ggplot(plot_dat, aes(alpha, fill = distribution)) +
geom_histogram(binwidth = .5, alpha = 0.7, position = "identity")+
geom_vline(xintercept = alpha) +
scale_fill_brewer()
Notice how under this Cauchy(0,1) prior, the posterior is able to concentrate around the true \(\alpha = 10\). The heavy tails of the Cauchy allow the posterior to move beyond the scale occupied by the prior. From this histogram, it is much clearer that the prior we chose was probably inappropriate and conflicts with the data.
Pierre-Simon, marquis de Laplace was interested in modeling the rate of female births in Paris in the 18th century: \(\theta \in [0,1]\). Laplace had access to birth records in Paris between 1745 and 1770 with a total of \(n\) entries, and so a reasonable model for \(X\)= number of female entries could be \(X \sim Binomial(n,\theta)\), with \(n\) known and \(\theta \in [0,1]\). Laplace had no reason to believe that a certain value of \(\theta\) was more likely than another, so he placed a Uniform(0,1) (or Beta(1,1)) prior on \(\theta\). We may think this prior \(\theta\sim Unif(0,1)\) to be uninformative–after all, it gives equal probability to every value in the range 0 and 1:
However, what if we consider the log-odds ratio of rate of female birth: \(\eta = \log \dfrac{\theta}{1-\theta}\). By the same logic as above, Laplace should not have any reason to believe one value for the log-odds \(\eta_1\) to be any more probable than another value \(\eta_2\). Is this the case?
logit <- function(x){
ret <- log(x/(1-x))# finish function for log odds
return(ret)
}
eta <- logit(theta)
hist(eta)
Now we see that the distribution of the log-odds of \(\theta\) is most certainly not uniform. We can map directly between values \(\theta\) and \(\eta\), but the prior distributions do not reflect the same type of uncertainty.
Let us pretend that the true birth rate of females in Paris was 0.3, and let us simulate some data with \(N=10\) observations.
We know that the MLE for \(\theta\) is the proportion of successes:
In this simple Beta-Binomial model, the parameter of interest is \(\theta\) and we want to generate samples from the posterior \(f(\theta | y)\). Take a look into the prob.stan
file and see that we explicitly declare \(\theta\) to be the parameter of interest with a Uniform(0,1) (or Beta(1,1)) prior. The following chunk of code runs performs posterior inference with this data model and prior.
stan_dat <- list(y = y,N=N)
fit.bayes.prob <- stan(file = "lab-04-prob.stan", data = stan_dat, refresh = 0, iter = 2000)
print(fit.bayes.prob, pars = c("theta", "eta"))
## Inference for Stan model: lab-04-prob.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 98% n_eff Rhat
## theta 0.42 0.00 0.14 0.17 0.32 0.41 0.51 0.68 1503 1
## eta -0.37 0.02 0.61 -1.62 -0.77 -0.34 0.04 0.77 1496 1
##
## Samples were drawn using NUTS(diag_e) at Sat Feb 8 21:04:43 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
So we get a posterior mean \(E[\theta|y]\) of 0.42 (verify this is what we would expect!).
What about a uniform prior belief on the log odds \(\eta\)? First, recognize that the support of \(\eta\) is the entire real line, so a uniform prior belief on \(\eta\) corresponds to an infinite-support flat prior (as detailed above).
If we transformed variables back to probabilities, we would see that the induced prior on \(\theta\) would be a \(\text{Beta}(0, 0)\).
Let’s try to sample from this model:
fit.logodds <- stan(file = "lab-04-log_odds.stan", data = stan_dat, refresh = 0, iter = 2000)
print(fit.logodds, pars = c("theta", "eta"))
## Inference for Stan model: lab-04-log_odds.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 98% n_eff Rhat
## theta 0.40 0.00 0.14 0.14 0.29 0.39 0.50 0.69 1557 1
## eta -0.45 0.02 0.67 -1.78 -0.87 -0.44 0.02 0.79 1501 1
##
## Samples were drawn using NUTS(diag_e) at Sat Feb 8 21:34:15 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Consider yet another parameterization of the binomial sampling model in terms of the (un-logged) odds \(\pi = \frac{\theta}{1 - \theta}\).
From our work above, we realize that a uniform prior belief about \(\pi\) does not necessarily translate into uniform priors for transformations of \(\pi\).
Jeffreys (1961) suggested a default procedure for specifying a prior distribution for any parameter in the sampling model, which he designed to be invariant under transformations:
\[ \pi(\theta) \propto \sqrt{I(\theta)} \]
where \(I(\theta)\) is the Fisher Information of the sampling model \(p(x | \theta)\):
\[ I(\theta) = - \text{E}\left[ \frac{\partial^2 \log \left( p(x | \theta) \right)}{\partial\theta^2} \right] \]
and the expectation is taken with respect to the sampling distribution \(p(x | \theta)\). It is important to recognize that because the Fisher Information is determined by the sampling model, the Jeffreys prior is also determined by the sampling model. Specifying a Jeffreys prior for this model would thus be useful.
This is meant as a quick introduction to Jeffreys prior; we will dive deeper into this in the main lectures.
10 points: 2 points for exercise 9, 1 point for the remaining exercises.
This lab was created by Jordan Bryan and Becky Tang. A part of it was adapted from this tutorial.