Processing math: 61%
+ - 0:00:00
Notes for current slide
Notes for next slide

STA 360/602L: Module 5.3

Hierarchical normal models with constant variance: multiple groups

Dr. Olanrewaju Michael Akande

1 / 15

Comparing multiple groups

  • Suppose we wish to investigate the mean (and distribution) of test scores for students at J different high schools.
2 / 15

Comparing multiple groups

  • Suppose we wish to investigate the mean (and distribution) of test scores for students at J different high schools.

  • In each school j, where j=1,,J, suppose we test a random sample of nj students.

2 / 15

Comparing multiple groups

  • Suppose we wish to investigate the mean (and distribution) of test scores for students at J different high schools.

  • In each school j, where j=1,,J, suppose we test a random sample of nj students.

  • Let yij be the test score for the ith student in school j, with i=1,,nj, with


    where for each school j, θj is the school-wide average test score, and σ2j is the school-wide variance of individual test scores.

2 / 15

Comparing multiple groups

  • Suppose we wish to investigate the mean (and distribution) of test scores for students at J different high schools.

  • In each school j, where j=1,,J, suppose we test a random sample of nj students.

  • Let yij be the test score for the ith student in school j, with i=1,,nj, with


    where for each school j, θj is the school-wide average test score, and σ2j is the school-wide variance of individual test scores.

  • This is what we did for the the Pygmalion study and job training data.

2 / 15

School testing example

  • Option I: Classical inference for each school can be based on large sample 95% CI: ˉyj±1.96s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.
3 / 15

School testing example

  • Option I: Classical inference for each school can be based on large sample 95% CI: ˉyj±1.96s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.

  • Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? ˉyj can be a good estimate if nj is large but it may be poor if nj is small.

3 / 15

School testing example

  • Option I: Classical inference for each school can be based on large sample 95% CI: ˉyj±1.96s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.

  • Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? ˉyj can be a good estimate if nj is large but it may be poor if nj is small.

  • Option II: alternatively, we might believe that θj=μ for all j; that is, all schools have the same mean. This is the assumption (null hypothesis) in ANOVA models for example. We can also set σ2j=σ2 for all J.

3 / 15

School testing example

  • Option I: Classical inference for each school can be based on large sample 95% CI: ˉyj±1.96s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.

  • Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? ˉyj can be a good estimate if nj is large but it may be poor if nj is small.

  • Option II: alternatively, we might believe that θj=μ for all j; that is, all schools have the same mean. This is the assumption (null hypothesis) in ANOVA models for example. We can also set σ2j=σ2 for all J.

  • Option I ignores that the θj's should be reasonably similar, whereas option II ignores any differences between them.

3 / 15

School testing example

  • Option I: Classical inference for each school can be based on large sample 95% CI: ˉyj±1.96s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.

  • Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? ˉyj can be a good estimate if nj is large but it may be poor if nj is small.

  • Option II: alternatively, we might believe that θj=μ for all j; that is, all schools have the same mean. This is the assumption (null hypothesis) in ANOVA models for example. We can also set σ2j=σ2 for all J.

  • Option I ignores that the θj's should be reasonably similar, whereas option II ignores any differences between them.

  • It would be nice to find a compromise! Borrowing information across, and shrinking our estimate towards a grand mean could be very useful here.

3 / 15

School testing example

  • For the Pygmalion study and job training data, we focused on using priors that are independent between the groups.
4 / 15

School testing example

  • For the Pygmalion study and job training data, we focused on using priors that are independent between the groups.

  • For example, in the conjugate case, we would have


    for some hyperparameters (constants), μ0, κ0, ν0, and σ20.

4 / 15

School testing example

  • For the Pygmalion study and job training data, we focused on using priors that are independent between the groups.

  • For example, in the conjugate case, we would have


    for some hyperparameters (constants), μ0, κ0, ν0, and σ20.

  • In the semi-conjugate case,


    for some hyperparameters (constants), μ0, σ20, ν0, and γ20.

4 / 15

Hierarchical normal model

  • Instead, we can assume that the θj's are drawn from a distribution based on the following: conceive of the schools themselves as being a random sample from all possible schools.
5 / 15

Hierarchical normal model

  • Instead, we can assume that the θj's are drawn from a distribution based on the following: conceive of the schools themselves as being a random sample from all possible schools.

  • For now, assume the variance is constant across schools. The hierarchical normal model assumes normal sampling models both within and between groups:

    yij|θj,σ2N(θj,σ2);   i=1,,njθj|μ,τ2N(μ,τ2);   j=1,,J,

    which gives us an extra level in the prior on the means, and leads to sharing of information across the groups in estimating the group-specific means.

5 / 15

Hierarchical normal model

  • Instead, we can assume that the θj's are drawn from a distribution based on the following: conceive of the schools themselves as being a random sample from all possible schools.

  • For now, assume the variance is constant across schools. The hierarchical normal model assumes normal sampling models both within and between groups:

    yij|θj,σ2N(θj,σ2);   i=1,,njθj|μ,τ2N(μ,τ2);   j=1,,J,

    which gives us an extra level in the prior on the means, and leads to sharing of information across the groups in estimating the group-specific means.

  • We have an extra variance parameter τ2. Comparing τ2 to σ2 tells us how much of the variation in Y is due to within-group versus between-group variation.

5 / 15

Hierarchical normal model

  • Standard semi-conjugate priors are given by


6 / 15

Hierarchical normal model

  • Standard semi-conjugate priors are given by



    • μ0: best guess of average of school averages
    • γ20: set based on plausible ranges of values of μ
6 / 15

Hierarchical normal model

  • Standard semi-conjugate priors are given by



    • μ0: best guess of average of school averages
    • γ20: set based on plausible ranges of values of μ
    • τ20: best guess of variance of school averages
    • η0: set based on how tight prior for τ2 is around τ20
6 / 15

Hierarchical normal model

  • Standard semi-conjugate priors are given by



    • μ0: best guess of average of school averages
    • γ20: set based on plausible ranges of values of μ
    • τ20: best guess of variance of school averages
    • η0: set based on how tight prior for τ2 is around τ20
    • σ20: best guess of variance of individual test scores around respective school means
    • ν0: set based on how tight prior for σ2 is around σ20.
6 / 15


  • This model relies heavily on exchangeability across units at each level.
7 / 15


  • This model relies heavily on exchangeability across units at each level.

  • For example, we assume the schools are a random sample from the population of all schools, and the students within schools are a random sample of all the students in each school.

7 / 15


  • This model relies heavily on exchangeability across units at each level.

  • For example, we assume the schools are a random sample from the population of all schools, and the students within schools are a random sample of all the students in each school.

  • This is not always completely true.

7 / 15


  • This model relies heavily on exchangeability across units at each level.

  • For example, we assume the schools are a random sample from the population of all schools, and the students within schools are a random sample of all the students in each school.

  • This is not always completely true.

  • Note: we can allow the variance to vary across schools if desired (and we will soon in fact).

7 / 15


  • Turns out that conditional exchangeability would be enough if we control for relevant variables in our modeling.
8 / 15


  • Turns out that conditional exchangeability would be enough if we control for relevant variables in our modeling.

  • For example, the schools in Chapel Hill/Carrboro are not entirely exchangeable.

8 / 15


  • Turns out that conditional exchangeability would be enough if we control for relevant variables in our modeling.

  • For example, the schools in Chapel Hill/Carrboro are not entirely exchangeable.

  • For example, Phoenix Academy is for students on long-term out-of-school suspension or who need to make up work due to extended absences (e.g., pregnancy), and Memorial Hospital School is for children battling serious illnesses.

8 / 15


  • Turns out that conditional exchangeability would be enough if we control for relevant variables in our modeling.

  • For example, the schools in Chapel Hill/Carrboro are not entirely exchangeable.

  • For example, Phoenix Academy is for students on long-term out-of-school suspension or who need to make up work due to extended absences (e.g., pregnancy), and Memorial Hospital School is for children battling serious illnesses.

  • However, if we condition on school type (public, charter, private, special services, home), the schools may then be exchangeable.

8 / 15

Posterior inference

  • Recall the model is

    yij|θj,σ2N(θj,σ2);   i=1,,njθj|μ,τ2N(μ,τ2);   j=1,,J,

9 / 15

Posterior inference

  • Recall the model is

    yij|θj,σ2N(θj,σ2);   i=1,,njθj|μ,τ2N(μ,τ2);   j=1,,J,

  • Under our prior specification, we can factor the posterior as follows:

    \begin{split} \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) & \boldsymbol{\propto} p(y | \theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times \pi(\mu, \sigma^2,\tau^2)\\ \\ & \boldsymbol{=} p(y | \theta_1, \ldots, \theta_J, \sigma^2 )\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu,\tau^2)\\ & \ \ \ \ \times \pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \\ & \boldsymbol{=} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\}\\ & \ \ \ \ \times \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\}\\ & \ \ \ \ \times\pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \end{split}

9 / 15

Full conditional for grand mean

  • The full conditional distribution of \mu is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \mu.
10 / 15

Full conditional for grand mean

  • The full conditional distribution of \mu is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \mu.

  • That is,

    \begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\mu). \end{split}

10 / 15

Full conditional for grand mean

  • The full conditional distribution of \mu is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \mu.

  • That is,

    \begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\mu). \end{split}

  • This looks like the full conditional distribution from the one-sample normal case, so you can show that

    \begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\mu_n, \gamma^2_n \right) \ \ \ \ \textrm{where}\\ \\ \gamma^2_n = \dfrac{1}{ \dfrac{J}{\tau^2} + \dfrac{1}{\gamma_0^2} } ; \ \ \ \ \ \ \ \ \mu_n = \gamma^2_n \left[ \dfrac{J}{\tau^2} \bar{\theta} + \dfrac{1}{\gamma_0^2} \mu_0 \right] \end{split}

    and \bar{\theta} = \frac{1}{J} \sum\limits^J_{j=1} \theta_j.

10 / 15

Full conditionals for group means

  • Similarly, the full conditional distribution of each \theta_j is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \theta_j.
11 / 15

Full conditionals for group means

  • Similarly, the full conditional distribution of each \theta_j is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \theta_j.

  • That is,

    \begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot p(\theta_j | \mu,\tau^2) \\ \end{split}

11 / 15

Full conditionals for group means

  • Similarly, the full conditional distribution of each \theta_j is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \theta_j.

  • That is,

    \begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot p(\theta_j | \mu,\tau^2) \\ \end{split}

  • Those terms include a normal for \theta_j multiplied by a product of normals in which \theta_j is the mean, again mirroring the one-sample case, so you can show that

    \begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\theta_j^\star, \nu_j^\star \right) \ \ \ \ \textrm{where}\\ \\ \nu_j^\star & = \dfrac{1}{ \dfrac{n_j}{\sigma^2} + \dfrac{1}{\tau^2} } ; \ \ \ \ \ \ \ \theta_j^\star = \nu_j^\star \left[ \dfrac{n_j}{\sigma^2} \bar{y}_j + \dfrac{1}{\tau^2} \mu \right] \end{split}

11 / 15

Full conditionals for group means

  • Our estimate for each \theta_j is a weighted average of \bar{y}_j and \mu, ensuring that we are borrowing information across all levels through \mu and \tau^2.
12 / 15

Full conditionals for group means

  • Our estimate for each \theta_j is a weighted average of \bar{y}_j and \mu, ensuring that we are borrowing information across all levels through \mu and \tau^2.

  • The weights for the weighted average is determined by relative precisions from the data and from the second level model.

12 / 15

Full conditionals for group means

  • Our estimate for each \theta_j is a weighted average of \bar{y}_j and \mu, ensuring that we are borrowing information across all levels through \mu and \tau^2.

  • The weights for the weighted average is determined by relative precisions from the data and from the second level model.

  • The groups with smaller n_j have estimated \theta_j^\star closer to \mu than schools with larger n_j.

12 / 15

Full conditionals for group means

  • Our estimate for each \theta_j is a weighted average of \bar{y}_j and \mu, ensuring that we are borrowing information across all levels through \mu and \tau^2.

  • The weights for the weighted average is determined by relative precisions from the data and from the second level model.

  • The groups with smaller n_j have estimated \theta_j^\star closer to \mu than schools with larger n_j.

  • Thus, degree of shrinkage of \theta_j depends on ratio of within-group to between-group variances.

12 / 15

Full conditionals for across-group variance

  • The full conditional distribution of \tau^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \tau^2.
13 / 15

Full conditionals for across-group variance

  • The full conditional distribution of \tau^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \tau^2.

  • That is,

    \begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\tau^2)\\ \end{split}

13 / 15

Full conditionals for across-group variance

  • The full conditional distribution of \tau^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \tau^2.

  • That is,

    \begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\tau^2)\\ \end{split}

  • As in the case for \mu, this looks like the one-sample normal problem, and our full conditional posterior is

    \begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & = \mathcal{IG} \left(\dfrac{\eta_n}{2}, \dfrac{\eta_n\tau_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \eta_n = \eta_0 + J ; \ \ \ \ \ \ \ \tau_n^2 & = \dfrac{1}{\eta_n} \left[\eta_0\tau_0^2 + \sum\limits_{j=1}^{J} (\theta_j - \mu)^2 \right].\\ \end{split}

13 / 15

Full conditionals for within-group variance

  • Finally, the full conditional distribution of \sigma^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \sigma^2.
14 / 15

Full conditionals for within-group variance

  • Finally, the full conditional distribution of \sigma^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \sigma^2.

  • That is,

    \begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot \pi(\sigma^2)\\ \end{split}

14 / 15

Full conditionals for within-group variance

  • Finally, the full conditional distribution of \sigma^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \sigma^2.

  • That is,

    \begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot \pi(\sigma^2)\\ \end{split}

  • We can again take advantage of the one-sample normal problem, so that our full conditional posterior is

    \begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & = \mathcal{IG} \left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \nu_n = \nu_0 + \sum\limits_{j=1}^{J} n_j ; \ \ \ \ \ \ \ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{j=1}^{J}\sum\limits_{i=1}^{n_j} (y_{ij} - \theta_j)^2 \right].\\ \end{split}

14 / 15

What's next?

Move on to the readings for the next module!

15 / 15

Comparing multiple groups

  • Suppose we wish to investigate the mean (and distribution) of test scores for students at J different high schools.
2 / 15


Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow