Processing math: 76%
+ - 0:00:00
Notes for current slide
Notes for next slide

STA 360/602L: Module 5.1

Hierarchical normal models with constant variance: two groups

Dr. Olanrewaju Michael Akande

1 / 16

Motivation

  • Sometimes, we may have a natural grouping in our data, for example
    • students within schools,
    • patients within hospitals,
    • voters within counties or states,
    • biology data, where animals are followed within natural populations organized geographically and, in some cases, socially.
2 / 16

Motivation

  • Sometimes, we may have a natural grouping in our data, for example

    • students within schools,
    • patients within hospitals,
    • voters within counties or states,
    • biology data, where animals are followed within natural populations organized geographically and, in some cases, socially.
  • For such grouped data, we may want to do inference across all the groups, for example, comparison of the group means.

2 / 16

Motivation

  • Sometimes, we may have a natural grouping in our data, for example

    • students within schools,
    • patients within hospitals,
    • voters within counties or states,
    • biology data, where animals are followed within natural populations organized geographically and, in some cases, socially.
  • For such grouped data, we may want to do inference across all the groups, for example, comparison of the group means.

  • Ideally, we should do so in a way that takes advantage of the relationship between observations in the same group, but we should also look to borrow information across groups when possible.

2 / 16

Motivation

  • Sometimes, we may have a natural grouping in our data, for example

    • students within schools,
    • patients within hospitals,
    • voters within counties or states,
    • biology data, where animals are followed within natural populations organized geographically and, in some cases, socially.
  • For such grouped data, we may want to do inference across all the groups, for example, comparison of the group means.

  • Ideally, we should do so in a way that takes advantage of the relationship between observations in the same group, but we should also look to borrow information across groups when possible.

  • Hierarchical modeling provides a principled way to do so.

2 / 16

Bayes estimators and bias

  • Recall the normal model:

    yi|μ,σ2iidN(μ,σ2).

3 / 16

Bayes estimators and bias

  • Recall the normal model:

    yi|μ,σ2iidN(μ,σ2).

  • The MLE for the population mean μ is just the sample mean ˉy.

3 / 16

Bayes estimators and bias

  • Recall the normal model:

    yi|μ,σ2iidN(μ,σ2).

  • The MLE for the population mean μ is just the sample mean ˉy.

  • ˉy is unbiased for μ. That is, for any data yiiidN(μ,σ2), E[ˉy]=μ.

3 / 16

Bayes estimators and bias

  • Recall the normal model:

    yi|μ,σ2iidN(μ,σ2).

  • The MLE for the population mean μ is just the sample mean ˉy.

  • ˉy is unbiased for μ. That is, for any data yiiidN(μ,σ2), E[ˉy]=μ.

  • However, recall that in the conjugate normal model with known variance for example, the posterior expectation is a weighted average of the prior mean and the sample mean.

3 / 16

Bayes estimators and bias

  • Recall the normal model:

    yi|μ,σ2iidN(μ,σ2).

  • The MLE for the population mean μ is just the sample mean ˉy.

  • ˉy is unbiased for μ. That is, for any data yiiidN(μ,σ2), E[ˉy]=μ.

  • However, recall that in the conjugate normal model with known variance for example, the posterior expectation is a weighted average of the prior mean and the sample mean.

  • That is, the posterior mean is actually biased.

3 / 16

Shrinkage

  • Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.
4 / 16

Shrinkage

  • Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.

  • Of course, the magnitude of the pull depends on the sample size.

4 / 16

Shrinkage

  • Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.

  • Of course, the magnitude of the pull depends on the sample size.

  • This "pulling" phenomenon is referred to as shrinkage.

4 / 16

Shrinkage

  • Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.

  • Of course, the magnitude of the pull depends on the sample size.

  • This "pulling" phenomenon is referred to as shrinkage.

  • Why would we ever want to do this? Why not just stick with the MLE?
4 / 16

Shrinkage

  • Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.

  • Of course, the magnitude of the pull depends on the sample size.

  • This "pulling" phenomenon is referred to as shrinkage.

  • Why would we ever want to do this? Why not just stick with the MLE?
  • Well, in part, because shrinkage estimators are often "more accurate" in prediction problems -- i.e. they tend to do a better job of predicting a future outcome or of recovering the actual parameter values. Remember variance-bias trade off!

4 / 16

Shrinkage

  • Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.

  • Of course, the magnitude of the pull depends on the sample size.

  • This "pulling" phenomenon is referred to as shrinkage.

  • Why would we ever want to do this? Why not just stick with the MLE?
  • Well, in part, because shrinkage estimators are often "more accurate" in prediction problems -- i.e. they tend to do a better job of predicting a future outcome or of recovering the actual parameter values. Remember variance-bias trade off!

  • The fact that a biased estimator would do a better job in many prediction problems can be proven rigorously, and is referred to as Stein's paradox.

4 / 16

Modern relevance

  • Stein's result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.
5 / 16

Modern relevance

  • Stein's result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.

  • In fact, these are Bayes point estimators (the posterior expectation of the parameter μ).

5 / 16

Modern relevance

  • Stein's result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.

  • In fact, these are Bayes point estimators (the posterior expectation of the parameter μ).

  • Most of what we do now in high-dimensional statistics is develop biased estimators that perform better than unbiased ones.

5 / 16

Modern relevance

  • Stein's result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.

  • In fact, these are Bayes point estimators (the posterior expectation of the parameter μ).

  • Most of what we do now in high-dimensional statistics is develop biased estimators that perform better than unbiased ones.

  • Examples: lasso regression, ridge regression, various kinds of hierarchical Bayesian models, etc.

5 / 16

Modern relevance

  • Stein's result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.

  • In fact, these are Bayes point estimators (the posterior expectation of the parameter μ).

  • Most of what we do now in high-dimensional statistics is develop biased estimators that perform better than unbiased ones.

  • Examples: lasso regression, ridge regression, various kinds of hierarchical Bayesian models, etc.

  • So, here we will get a very basic introduction to Bayesian hierarchical models, which provide a formal and coherent framework for constructing shrinkage estimators.

5 / 16

Why hierarchical models?

  • Bayesian hierarchical models is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.
6 / 16

Why hierarchical models?

  • Bayesian hierarchical models is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.

  • Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible.

6 / 16

Why hierarchical models?

  • Bayesian hierarchical models is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.

  • Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible.

  • Why use them? Several reasons:

    • We may want to exploit more complex dependence structures.
6 / 16

Why hierarchical models?

  • Bayesian hierarchical models is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.

  • Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible.

  • Why use them? Several reasons:

    • We may want to exploit more complex dependence structures.
    • We may have many parameters relative to the amount of data that we have, and want to borrow information in estimating them.
6 / 16

Why hierarchical models?

  • Bayesian hierarchical models is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.

  • Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible.

  • Why use them? Several reasons:

    • We may want to exploit more complex dependence structures.
    • We may have many parameters relative to the amount of data that we have, and want to borrow information in estimating them.
    • We may want to shrink toward something other than a simple prior mean/hyper-parameter.
6 / 16

Comparing two groups

  • Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).
7 / 16

Comparing two groups

  • Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).

  • BMI is known to often follow a normal distribution, so let's assume the same here.

7 / 16

Comparing two groups

  • Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).

  • BMI is known to often follow a normal distribution, so let's assume the same here.

  • We should expect some relationship between the mean BMI for the two groups.

7 / 16

Comparing two groups

  • Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).

  • BMI is known to often follow a normal distribution, so let's assume the same here.

  • We should expect some relationship between the mean BMI for the two groups.

  • We may also think the shape of the two distributions would be relatively the same (at least as a simplifying assumption for now).

7 / 16

Comparing two groups

  • Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).

  • BMI is known to often follow a normal distribution, so let's assume the same here.

  • We should expect some relationship between the mean BMI for the two groups.

  • We may also think the shape of the two distributions would be relatively the same (at least as a simplifying assumption for now).

  • Thus, a reasonable model might be

    yi,maleiidN(θm,σ2);  i=1,,nm;yi,femaleiidN(θf,σ2);  i=1,,nf.

    but with some relationship between θm and θf.

7 / 16

Bayesian inference

  • One parameterization that can reflect some relationship between θm and θf is

    yi,maleiidN(μ+δ,σ2);  i=1,,nm;yi,femaleiidN(μδ,σ2);  i=1,,nf.

8 / 16

Bayesian inference

  • One parameterization that can reflect some relationship between θm and θf is

    yi,maleiidN(μ+δ,σ2);  i=1,,nm;yi,femaleiidN(μδ,σ2);  i=1,,nf.

    where

    • θm=μ+δ and θf=μδ,

    • μ=θm+θf2 is the average of the population means, and

    • 2δ=θmθf is the difference in population means.

8 / 16

Bayesian inference

  • Convenient prior:

    • π(μ,δ,σ2)=π(μ)π(δ)π(σ2), where

      • π(μ)=N(μ0,γ20),

      • π(δ)=N(δ0,τ20), and

      • π(σ2)=IG(ν02,ν0σ202).

9 / 16

Bayesian inference

  • Note that we can rewrite

    yi,maleiidN(μ+δ,σ2);  i=1,,nm;yi,femaleiidN(μδ,σ2);  i=1,,nf

    as

    (yi,maleδ)iidN(μ,σ2);  i=1,,nm;(yi,female+δ)iidN(μ,σ2);  i=1,,nf

10 / 16

Bayesian inference

  • Note that we can rewrite

    yi,maleiidN(μ+δ,σ2);  i=1,,nm;yi,femaleiidN(μδ,σ2);  i=1,,nf

    as

    (yi,maleδ)iidN(μ,σ2);  i=1,,nm;(yi,female+δ)iidN(μ,σ2);  i=1,,nf

    or

    (yi,maleμ)iidN(δ,σ2);  i=1,,nm;(1)(yi,femaleμ)iidN(δ,σ2);  i=1,,nf.

    as needed, so we can leverage past results for the full conditionals.

10 / 16

Full conditionals

  • For the full conditionals we will derive here, we will take advantage of previous results from the regular univariate normal model.
11 / 16

Full conditionals

  • For the full conditionals we will derive here, we will take advantage of previous results from the regular univariate normal model.

  • Recall that if we assume

    yiN(μ,σ2),  i=1,,n,

11 / 16

Full conditionals

  • For the full conditionals we will derive here, we will take advantage of previous results from the regular univariate normal model.

  • Recall that if we assume

    yiN(μ,σ2),  i=1,,n,

    and set our priors to be

    π(μ)=N(μ0,γ20).π(σ2)=IG(ν02,ν0σ202),

11 / 16

Full conditionals

  • For the full conditionals we will derive here, we will take advantage of previous results from the regular univariate normal model.

  • Recall that if we assume

    yiN(μ,σ2),  i=1,,n,

    and set our priors to be

    π(μ)=N(μ0,γ20).π(σ2)=IG(ν02,ν0σ202),

    then we have

    \begin{split} \pi(\mu, \sigma^2 | Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n} p(y_{i} | \mu, \sigma^2 ) \right\} \cdot \pi(\mu) \cdot \pi(\sigma^2) \\ \end{split}

11 / 16

Full conditionals

  • We have

    \begin{split} \pi(\mu | \sigma^2, Y) & = \mathcal{N}\left(\mu_n, \gamma_n^2\right).\\ \end{split}

    where

    \begin{split} \gamma^2_n = \dfrac{1}{ \dfrac{n}{\sigma^2} + \dfrac{1}{\gamma_0^2} }; \ \ \ \ \ \ \ \ \mu_n & = \gamma^2_n \left[ \dfrac{n}{\sigma^2} \bar{y} + \dfrac{1}{\gamma_0^2} \mu_0 \right], \end{split}

12 / 16

Full conditionals

  • We have

    \begin{split} \pi(\mu | \sigma^2, Y) & = \mathcal{N}\left(\mu_n, \gamma_n^2\right).\\ \end{split}

    where

    \begin{split} \gamma^2_n = \dfrac{1}{ \dfrac{n}{\sigma^2} + \dfrac{1}{\gamma_0^2} }; \ \ \ \ \ \ \ \ \mu_n & = \gamma^2_n \left[ \dfrac{n}{\sigma^2} \bar{y} + \dfrac{1}{\gamma_0^2} \mu_0 \right], \end{split}

  • and

    \begin{split} \pi(\sigma^2 | \mu,Y) \boldsymbol{=} \mathcal{IG}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right), \end{split}

    where

    \begin{split} \nu_n & = \nu_0 + n; \ \ \ \ \ \ \ \sigma_n^2 = \dfrac{1}{\nu_n} \left[ \nu_0 \sigma_0^2 + \sum^n_{i=1} (y_i - \mu)^2 \right].\\ \end{split}

12 / 16

Full conditionals

  • With \pi(\mu) = \mathcal{N}(\mu_0, \gamma_0^2), and

    \begin{split} (y_{i,male} - \delta) & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ (y_{i,female} + \delta) & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_f,\\ \end{split}

    we have

    \begin{split} \mu | Y, \delta, \sigma^2 & \sim \mathcal{N}(\mu_n, \gamma_n^2), \ \ \ \text{where}\\ \\ \gamma_n^2 & = \dfrac{1}{\dfrac{1}{\gamma_0^2} + \dfrac{n_m + n_f}{\sigma^2} }\\ \\ \mu_n & = \gamma_n^2 \left[\dfrac{\mu_0}{\gamma_0^2} + \dfrac{ \sum\limits_{i=1}^{n_m} (y_{i,male} - \delta) + \sum\limits_{i=1}^{n_f} (y_{i,female} + \delta) }{\sigma^2} \right].\\ \end{split}

13 / 16

Full conditionals

  • With \pi(\delta) = \mathcal{N}(\delta_0, \tau_0^2), and

    \begin{split} (y_{i,male} - \mu) & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ (-1) (y_{i,female} - \mu) & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_f,\\ \end{split}

    we have

    \begin{split} \delta | Y, \mu, \sigma^2 & \sim \mathcal{N}(\delta_n, \tau_n^2), \ \ \ \text{where}\\ \\ \tau_n^2 & = \dfrac{1}{\dfrac{1}{\tau_0^2} + \dfrac{n_m + n_f}{\sigma^2} }\\ \\ \delta_n & = \tau_n^2 \left[\dfrac{\delta_0}{\tau_0^2} + \dfrac{\sum\limits_{i=1}^{n_m} (y_{i,male} - \mu) + (-1) \sum\limits_{i=1}^{n_f} (y_{i,female} - \mu) }{\sigma^2} \right].\\ \end{split}

14 / 16

Full conditionals

  • With \pi(\sigma^2) = \mathcal{IG}(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}), and

    \begin{split} y_{i,male} & \overset{iid}{\sim} \mathcal{N} \left(\mu + \delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ y_{i,female} & \overset{iid}{\sim} \mathcal{N} \left(\mu - \delta, \sigma^2\right); \ \ i = 1, \ldots, n_f\\ \end{split}

    we have

    \begin{split} \sigma^2 | Y, \mu, \delta & \sim \mathcal{IG}(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}), \ \ \ \text{where}\\ \\ \nu_n & = \nu_0 + n_m + n_f\\ \\ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{i=1}^{n_m} (y_{i,male} - [\mu + \delta])^2 + \sum\limits_{i=1}^{n_f} (y_{i,female} - [\mu - \delta])^2 \right].\\ \end{split}

15 / 16

Full conditionals

  • With \pi(\sigma^2) = \mathcal{IG}(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}), and

    \begin{split} y_{i,male} & \overset{iid}{\sim} \mathcal{N} \left(\mu + \delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ y_{i,female} & \overset{iid}{\sim} \mathcal{N} \left(\mu - \delta, \sigma^2\right); \ \ i = 1, \ldots, n_f\\ \end{split}

    we have

    \begin{split} \sigma^2 | Y, \mu, \delta & \sim \mathcal{IG}(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}), \ \ \ \text{where}\\ \\ \nu_n & = \nu_0 + n_m + n_f\\ \\ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{i=1}^{n_m} (y_{i,male} - [\mu + \delta])^2 + \sum\limits_{i=1}^{n_f} (y_{i,female} - [\mu - \delta])^2 \right].\\ \end{split}

  • We will use write a Gibbs sampler for this model and fit the model to real data in the next module.

15 / 16

What's next?

Move on to the readings for the next module!

16 / 16

Motivation

  • Sometimes, we may have a natural grouping in our data, for example
    • students within schools,
    • patients within hospitals,
    • voters within counties or states,
    • biology data, where animals are followed within natural populations organized geographically and, in some cases, socially.
2 / 16
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow