STA 360/602L: Module 5.1

class: center, middle, inverse, title-slide

# STA 360/602L: Module 5.1
## Hierarchical normal models with constant variance: two groups
### Dr. Olanrewaju Michael Akande

---

## Motivation

- Sometimes, we may have a natural grouping in our data, for example 
  + students within schools,
  + patients within hospitals,
  + voters within counties or states,
  + biology data, where animals are followed within natural populations organized geographically and, in some cases, socially.

- For such grouped data, we may want to do inference across all the groups, for example, comparison of the group means.

- Ideally, we should do so in a way that takes advantage of the relationship between observations in the same group, but we should also look to borrow information across groups when possible.

- .hlight[Hierarchical modeling] provides a principled way to do so.

---
## Bayes estimators and bias

- Recall the normal model:
.block[
.small[
`$$y_{i} | \mu, \sigma^2 \overset{iid}{\sim} \mathcal{N} \left( \mu, \sigma^2\right).$$`
]
]

- The MLE for the population mean `$\mu$` is just the sample mean `$\bar{y}$`.

- `$\bar{y}$` is unbiased for `$\mu$`. That is, for any data `$y_i \overset{iid}{\sim} \mathcal{N} \left( \mu, \sigma^2\right)$`, `$\mathbb{E}[\bar{y}] = \mu$`.

- However, recall that in the conjugate normal model with known variance for example, the posterior expectation is a **weighted average** of the prior mean and the sample mean.

- That is, the posterior mean is actually biased.

---
## Shrinkage

- Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of `$\mu$` toward the prior
mean.

- Of course, the magnitude of the pull depends on the sample size.

- This "pulling" phenomenon is referred to as .hlight[shrinkage].

- <div class="question">
Why would we ever want to do this? Why not just stick with the MLE?
</div>

- Well, in part, because shrinkage estimators are often "more accurate" in prediction problems -- i.e. they tend to do a
better job of predicting a future outcome or of recovering the actual parameter values. Remember variance-bias trade off!

- The fact that a biased estimator would do a better job in many prediction problems can be proven rigorously, and is referred to as .hlight[Stein's paradox].

---
## Modern relevance

- Stein's result implies, in particular, that the sample mean is an *inadmissible* estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.

- In fact, these are Bayes point estimators (the posterior expectation of the parameter `$\mu$`).

- Most of what we do now in high-dimensional statistics is develop biased estimators that perform better than unbiased ones.

- Examples: lasso regression, ridge regression, various kinds of hierarchical Bayesian models, etc.

- So, here we will get a very basic introduction to .hlight[Bayesian hierarchical models], which provide a formal and coherent framework for constructing shrinkage estimators.

---
## Why hierarchical models?

- **Bayesian hierarchical models** is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.

- Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible.

- Why use them? Several reasons:

+ We may want to exploit more complex dependence structures.

--
  + We may have many parameters relative to the amount of data that we have, and want to borrow information in estimating them.
  
--
  + We may want to shrink toward something other than a simple prior mean/hyper-parameter.

---
## Comparing two groups

- Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).

- BMI is known to often follow a normal distribution, so let's assume the same here.

- We should expect some relationship between the mean BMI for the two groups.

- We may also think the shape of the two distributions would be relatively the same (at least as a simplifying assumption for now).

- Thus, a reasonable model might be
.block[
.small[
$$
`\begin{split}
y_{i,male}  & \overset{iid}{\sim} \mathcal{N} \left(\theta_m, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\
y_{i,female}  & \overset{iid}{\sim} \mathcal{N} \left(\theta_f, \sigma^2\right); \ \ i = 1, \ldots, n_f.\\
\end{split}`
$$
]
]

but with some relationship between `$\theta_m$` and `$\theta_f$`.

---
## Bayesian inference

- One parameterization  that can reflect some relationship between `$\theta_m$` and `$\theta_f$` is
.block[
.small[
$$
`\begin{split}
y_{i,male}  & \overset{iid}{\sim} \mathcal{N} \left(\mu + \delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\
y_{i,female}  & \overset{iid}{\sim} \mathcal{N} \left(\mu - \delta, \sigma^2\right); \ \ i = 1, \ldots, n_f.\\
\end{split}`
$$
]
]

where
    + `$\theta_m = \mu + \delta$` and `$\theta_f = \mu - \delta$`,
    
    + `$\mu = \dfrac{\theta_m + \theta_f}{2}$` is the average of the population means, and 
    
    + `$2\delta = \theta_m - \theta_f$` is the difference in population means.

---
## Bayesian inference

- Convenient prior: 
  + `$\pi(\mu, \delta, \sigma^2) = \pi(\mu) \cdot \pi(\delta) \cdot \pi(\sigma^2)$`, where
      - `$\pi(\mu) = \mathcal{N}(\mu_0, \gamma_0^2)$`,
      
      - `$\pi(\delta) = \mathcal{N}(\delta_0, \tau_0^2)$`, and
      
      - `$\pi(\sigma^2) = \mathcal{IG}(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2})$`.

---
## Bayesian inference

- Note that we can rewrite
.block[
.small[
$$
`\begin{split}
y_{i,male}  & \overset{iid}{\sim} \mathcal{N} \left(\mu + \delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\
y_{i,female}  & \overset{iid}{\sim} \mathcal{N} \left(\mu - \delta, \sigma^2\right); \ \ i = 1, \ldots, n_f\\
\end{split}`
$$
]
]

as
.block[
.small[
$$
`\begin{split}
(y_{i,male} - \delta) & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\
(y_{i,female} + \delta)  & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_f\\
\end{split}`
$$
]
]

or
.block[
.small[
$$
`\begin{split}
(y_{i,male} - \mu) & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\
(-1) (y_{i,female} - \mu)  & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_f.\\
\end{split}`
$$
]
]

as needed, so we can leverage past results for the full conditionals.

---
## Full conditionals

- For the full conditionals we will derive here, we will take advantage of previous results from the regular univariate normal model.

- Recall that if we assume
.block[
.small[
`$$y_i \sim \mathcal{N}(\mu, \sigma^2), \ \ i=1, \ldots, n,$$`
]
]

and set our priors to be
.block[
.small[
$$
`\begin{split}
\pi(\mu) & = \mathcal{N}\left(\mu_0, \gamma_0^2\right).\\
\pi(\sigma^2) & = \mathcal{IG}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}\right),\\
\end{split}`
$$
]
]

then we have
.block[
.small[
$$
`\begin{split}
\pi(\mu, \sigma^2 | Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n} p(y_{i} | \mu, \sigma^2 ) \right\} \cdot \pi(\mu) \cdot \pi(\sigma^2) \\
\end{split}`
$$
]
]

---
## Full conditionals

- We have
.block[
.small[
$$
`\begin{split}
\pi(\mu | \sigma^2, Y) & = \mathcal{N}\left(\mu_n, \gamma_n^2\right).\\
\end{split}`
$$
]
]

where
.block[
.small[
$$
`\begin{split}
\gamma^2_n = \dfrac{1}{ \dfrac{n}{\sigma^2} + \dfrac{1}{\gamma_0^2} }; \ \ \ \ \ \ \ \ \mu_n & = \gamma^2_n \left[ \dfrac{n}{\sigma^2} \bar{y} + \dfrac{1}{\gamma_0^2} \mu_0 \right],
\end{split}`
$$
]
]

- and
.block[
.small[
$$
`\begin{split}
\pi(\sigma^2 | \mu,Y) \boldsymbol{=} \mathcal{IG}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right),
\end{split}`
$$
]
]

where
.block[
.small[
$$
`\begin{split}
\nu_n & = \nu_0 + n; \ \ \ \ \ \ \ \sigma_n^2  = \dfrac{1}{\nu_n} \left[ \nu_0 \sigma_0^2 + \sum^n_{i=1} (y_i - \mu)^2 \right].\\
\end{split}`
$$
]
]

---
## Full conditionals

- With `$\pi(\mu) = \mathcal{N}(\mu_0, \gamma_0^2)$`, and
.block[
.small[
$$
`\begin{split}
(y_{i,male} - \delta) & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\
(y_{i,female} + \delta)  & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_f,\\
\end{split}`
$$
]
]

we have
.block[
.small[
$$
`\begin{split}
\mu | Y, \delta, \sigma^2 & \sim \mathcal{N}(\mu_n, \gamma_n^2), \ \ \ \text{where}\\
\\
\gamma_n^2 & = \dfrac{1}{\dfrac{1}{\gamma_0^2} + \dfrac{n_m + n_f}{\sigma^2} }\\
\\
\mu_n & = \gamma_n^2 \left[\dfrac{\mu_0}{\gamma_0^2} + \dfrac{ \sum\limits_{i=1}^{n_m} (y_{i,male} - \delta) + \sum\limits_{i=1}^{n_f} (y_{i,female} + \delta) }{\sigma^2} \right].\\
\end{split}`
$$
]
]

---
## Full conditionals

- With `$\pi(\delta) = \mathcal{N}(\delta_0, \tau_0^2)$`, and
.block[
.small[
$$
`\begin{split}
(y_{i,male} - \mu) & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\
(-1) (y_{i,female} - \mu)  & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_f,\\
\end{split}`
$$
]
]

we have
.block[
.small[
$$
`\begin{split}
\delta | Y, \mu, \sigma^2 & \sim \mathcal{N}(\delta_n, \tau_n^2), \ \ \ \text{where}\\
\\
\tau_n^2 & = \dfrac{1}{\dfrac{1}{\tau_0^2} + \dfrac{n_m + n_f}{\sigma^2} }\\
\\
\delta_n & = \tau_n^2 \left[\dfrac{\delta_0}{\tau_0^2} + \dfrac{\sum\limits_{i=1}^{n_m} (y_{i,male} - \mu) + (-1) \sum\limits_{i=1}^{n_f} (y_{i,female} - \mu) }{\sigma^2} \right].\\
\end{split}`
$$
]
]

---
## Full conditionals

- With `$\pi(\sigma^2) = \mathcal{IG}(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2})$`, and
.block[
.small[
$$
`\begin{split}
y_{i,male}  & \overset{iid}{\sim} \mathcal{N} \left(\mu + \delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\
y_{i,female}  & \overset{iid}{\sim} \mathcal{N} \left(\mu - \delta, \sigma^2\right); \ \ i = 1, \ldots, n_f\\
\end{split}`
$$
]
]

we have
.block[
.small[
$$
`\begin{split}
\sigma^2 | Y, \mu, \delta & \sim \mathcal{IG}(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}), \ \ \ \text{where}\\
\\
\nu_n & = \nu_0 + n_m + n_f\\
\\
\sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{i=1}^{n_m} (y_{i,male} - [\mu + \delta])^2 + \sum\limits_{i=1}^{n_f} (y_{i,female} - [\mu - \delta])^2 \right].\\
\end{split}`
$$
]
]

- We will use write a Gibbs sampler for this model and fit the model to real data in the next module.

---

class: center, middle

# What's next?

### Move on to the readings for the next module!