STA 360/602L: Module 3.6

class: center, middle, inverse, title-slide

# STA 360/602L: Module 3.6
## Noninformative and improper priors
### Dr. Olanrewaju Michael Akande

---

## Noninformative and improper priors

- Generally, we must specify both `$\mu_0$` and `$\tau_0$` to do inference.

- When prior distributions have no population basis, that is, there is no justification of the prior as "prior data", prior distributions can be difficult to construct.

- To that end, there is often the desire to construct .hlight[noninformative priors], with the rationale being _"to let the data speak for themselves"_.

- For example, we could instead assume a uniform prior on `$\mu$` that is constant over the real line, i.e., `$\pi(\mu) \propto 1$` `$\Rightarrow$` all values on the real line are equally likely apriori.

- Clearly, this is not a valid pdf since it will not integrate to 1 over the real line. Such priors are known as .hlight[improper priors].

- An improper prior can still be very useful, we just need to ensure it results in a .hlight[proper posterior].

---
## Jeffreys' prior

- Question: is there a prior pdf (for a given model) that would be universally accepted as a noninformative prior?

- Laplace proposed the uniform distribution.  This proposal lacks invariance under monotone transformations of the parameter.

- For example, a uniform prior on the binomial proportion parameter `$\theta$` is not the same as a uniform prior on the odds parameter `$\phi = \dfrac{\theta}{1-\theta}$`.

- A more acceptable approach was introduced by Jeffreys. For single parameter models, the .hlight[Jeffreys' prior] defines a noninformative prior density of a parameter `$\theta$` as
.block[
.small[
`$$\pi(\theta) \propto \sqrt{\mathcal{I}(\theta)}$$`
]
]

where `$\mathcal{I}(\theta)$` is the .hlight[Fisher information] for `$\theta$`.

---
## Jeffreys' prior

- The Fisher information gives a way to measure the amount of information a random variable `$Y$` carries about an unknown parameter `$\theta$` of a distribution that describes `$Y$`.

- Formally, `$\mathcal{I}(\theta)$` is defined as
.block[
.small[
`$$\mathcal{I}(\theta) = \mathbb{E} \left[ \left( \dfrac{\partial}{\partial \theta} \textrm{log } p(y | \theta) \right)^2 \bigg| \theta \right] = \int_\mathcal{Y} \left( \dfrac{\partial}{\partial \theta} \textrm{log } p(y | \theta) \right)^2 p(y | \theta) dy.$$`
]
]

- Alternatively,
.block[
.small[
`$$\mathcal{I}(\theta) = - \mathbb{E} \left[ \dfrac{\partial^2}{\partial^2 \theta} \textrm{log } p(y | \theta) \bigg| \theta \right] = - \int_\mathcal{Y} \left( \dfrac{\partial^2}{\partial^2 \theta} \textrm{log } p(y | \theta) \right) p(y | \theta) dy.$$`
]
]

- Turns out that the Jeffreys' prior for `$\mu$` under the normal model, when `$\sigma^2$` is known, is
.block[
.small[
`$$\pi(\mu) \propto 1,$$`
]
]

the uniform prior over the real line. Let's derive this on the board.

---
## Inference for mean, conditional on variance using Jeffreys' prior

- Recall that for `$\sigma^2$` known, the normal likelihood simplifies to
.block[
.small[
`$$\propto \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\},$$`
]
]

ignoring everything else that does not depend on `$\mu$`.
  
- With the Jeffreys' prior `$\pi(\mu) \propto 1$`, can we derive the posterior distribution?

---
## Inference for mean, conditional on variance using Jeffreys' prior

- Posterior:
.block[
.small[
$$
`\begin{split}
\pi(\mu|Y,\tau) \ & \propto \  \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\} \pi(\mu)\\
& \propto \  \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\}.\\
\end{split}`
$$
]
]

- This is the kernel of a normal distribution with
    + mean `$\bar{y}$`, and
    + precision `$n\tau$` or variance `$\dfrac{1}{n\tau} = \dfrac{\sigma^2}{n}$`.

- Written differently, we have `$\mu|Y,\sigma^2 \sim \mathcal{N}(\bar{y},\dfrac{\sigma^2}{n})$`

- <div class="question">
This should look familiar to you. Does it?
</div>

---
## Improper prior

- Let's be very objective with the prior selection. In fact, let's be extreme!

--
  + If we let the normal variance `$\rightarrow \infty$` then our prior on `$\mu$` is `$\propto 1$` (recall the Jeffreys' prior on `$\mu$` for known `$\sigma^2$`).

--
  + If we let the gamma variance get very large (e.g., `$a,b \rightarrow 0$`), then the prior on `$\sigma^2$` is `$\propto \dfrac{1}{\sigma^2}$`.

- `$\pi(\mu,\sigma^2) \propto \dfrac{1}{\sigma^2}$` is improper (does not integrate to 1) but does lead to a proper posterior distribution that yields inferences similar to frequentist ones.

- For that choice, we have
.block[
.small[
$$
`\begin{split}
\mu|Y,\tau & \sim \mathcal{N}\left(\bar{y},\dfrac{1}{n \tau}\right)\\
\tau | Y & \sim \textrm{Gamma}\left(\dfrac{n-1}{2}, \dfrac{(n-1)s^2}{2}\right)\\
\end{split}`
$$
]
]

---
## Analysis with noninformative priors

- Recall the Pygmalion data:
  + Accelerated group (A): 20, 10, 19, 15, 9, 18.
  + No growth group (N): 3, 2, 6, 10, 11, 5.

- Summary statistics:
  + `$\bar{y}_A = 15.2$`; `$s_A = 4.71$`.
  + `$\bar{y}_N = 6.2$`; `$s_N = 3.65$`.
  
--

- So our joint posterior is
.block[
.small[
$$
`\begin{split}
\mu_A | Y_A, \tau_A & \sim \ \mathcal{N}\left(\bar{y}_A,\dfrac{1}{n_A \tau_A}\right) = \mathcal{N}\left(15.2, \dfrac{1}{6\tau_A} \right)\\
\tau_A | Y_A & \sim \textrm{Gamma}\left(\dfrac{n_A-1}{2}, \dfrac{(n_A-1)s^2_A}{2}\right) = \textrm{Gamma}\left(\dfrac{6-1}{2}, \dfrac{(6-1)(22.17)}{2}\right)\\
\mu_N | Y_N, \tau_N & \sim \ \mathcal{N}\left(\bar{y}_N,\dfrac{1}{n_N \tau_N}\right) = \mathcal{N}\left(6.2, \dfrac{1}{6\tau_N} \right)\\
\tau_N | Y_N & \sim \textrm{Gamma}\left(\dfrac{n_N-1}{2}, \dfrac{(n_N-1)s^2_A}{2}\right) = \textrm{Gamma}\left(\dfrac{6-1}{2}, \dfrac{(6-1)(13.37)}{2}\right)\\
\end{split}`
$$
]
]

---
## Monte Carlo sampling

It is easy to sample from these posteriors:

```r
aA <- (6-1)/2
aN <- (6-1)/2
bA <- (6-1)*22.17/2
bN <- (6-1)*13.37/2
muA <- 15.2
muN <- 6.2
tauA_postsample_impr <- rgamma(10000,aA,bA)
thetaA_postsample_impr <- rnorm(10000,muA,sqrt(1/(6*tauA_postsample_impr)))
tauN_postsample_impr <- rgamma(10000,aN,bN)
thetaN_postsample_impr <- rnorm(10000,muN,sqrt(1/(6*tauN_postsample_impr)))
sigma2A_postsample_impr <- 1/tauA_postsample_impr
sigma2N_postsample_impr <- 1/tauN_postsample_impr
```

---
## Monte Carlo sampling

- Is the average improvement for the accelerated group larger than that for the no growth group?
  + What is `$\Pr[\mu_A > \mu_N | Y_A, Y_N)$`?
  
  ```r
  mean(thetaA_postsample_impr > thetaN_postsample_impr)
  ```
  
  ```
  ## [1] 0.9955
  ```

- Is the variance of improvement scores for the accelerated group larger than that for the no growth group?
  + What is `$\Pr[\sigma^2_A > \sigma^2_N | Y_A, Y_N)$`?
  
  ```r
  mean(sigma2A_postsample_impr > sigma2N_postsample_impr)
  ```
  
  ```
  ## [1] 0.7007
  ```
      
--

- <div class="question">
How does the new choice of prior affect our conclusions?
</div>

---
## Recall the job training data

- Data:
  
  + No training group (N): sample size `$n_N = 429$`.
  
  + Training group (T): sample size `$n_A = 185$`.

- Summary statistics for change in annual earnings:
  
  + `$\bar{y}_N = 1364.93$`; `$s_N = 7460.05$`
  
  + `$\bar{y}_T = 4253.57$`; `$s_T = 8926.99$`
  
--

- <div class="question">
Using the same approach we used for the Pygmalion data, answer the questions of interest.
</div>

---

class: center, middle

# What's next?

### Move on to the readings for the next module!