Suppose we wish to investigate the mean (and distribution) of test scores for students at J different high schools.
In each school j, where j=1,…,J, suppose we test a random sample of nj students.
Suppose we wish to investigate the mean (and distribution) of test scores for students at J different high schools.
In each school j, where j=1,…,J, suppose we test a random sample of nj students.
Let yij be the test score for the ith student in school j, with i=1,…,nj, with
yij|θj,σ2j∼N(θj,σ2j)
where for each school j, θj is the school-wide average test score, and σ2j is the school-wide variance of individual test scores.
Suppose we wish to investigate the mean (and distribution) of test scores for students at J different high schools.
In each school j, where j=1,…,J, suppose we test a random sample of nj students.
Let yij be the test score for the ith student in school j, with i=1,…,nj, with
yij|θj,σ2j∼N(θj,σ2j)
where for each school j, θj is the school-wide average test score, and σ2j is the school-wide variance of individual test scores.
This is what we did for the the Pygmalion study and job training data.
Option I: Classical inference for each school can be based on large sample 95% CI: ˉyj±1.96√s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.
Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? ˉyj can be a good estimate if nj is large but it may be poor if nj is small.
Option I: Classical inference for each school can be based on large sample 95% CI: ˉyj±1.96√s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.
Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? ˉyj can be a good estimate if nj is large but it may be poor if nj is small.
Option II: alternatively, we might believe that θj=μ for all j; that is, all schools have the same mean. This is the assumption (null hypothesis) in ANOVA models for example. We can also set σ2j=σ2 for all J.
Option I: Classical inference for each school can be based on large sample 95% CI: ˉyj±1.96√s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.
Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? ˉyj can be a good estimate if nj is large but it may be poor if nj is small.
Option II: alternatively, we might believe that θj=μ for all j; that is, all schools have the same mean. This is the assumption (null hypothesis) in ANOVA models for example. We can also set σ2j=σ2 for all J.
Option I ignores that the θj's should be reasonably similar, whereas option II ignores any differences between them.
Option I: Classical inference for each school can be based on large sample 95% CI: ˉyj±1.96√s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.
Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? ˉyj can be a good estimate if nj is large but it may be poor if nj is small.
Option II: alternatively, we might believe that θj=μ for all j; that is, all schools have the same mean. This is the assumption (null hypothesis) in ANOVA models for example. We can also set σ2j=σ2 for all J.
Option I ignores that the θj's should be reasonably similar, whereas option II ignores any differences between them.
It would be nice to find a compromise! Borrowing information across, and shrinking our estimate towards a grand mean could be very useful here.
For the Pygmalion study and job training data, we focused on using priors that are independent between the groups.
For example, in the conjugate case, we would have
π(θj|σ2j)=N(μ0,σ2jκ0)π(σ2j)=IG(ν02,ν0σ202)
for some hyperparameters (constants), μ0, κ0, ν0, and σ20.
For the Pygmalion study and job training data, we focused on using priors that are independent between the groups.
For example, in the conjugate case, we would have
π(θj|σ2j)=N(μ0,σ2jκ0)π(σ2j)=IG(ν02,ν0σ202)
for some hyperparameters (constants), μ0, κ0, ν0, and σ20.
In the semi-conjugate case,
π(θj)=N(μ0,σ20)π(σ2j)=IG(ν02,ν0γ202)
for some hyperparameters (constants), μ0, σ20, ν0, and γ20.
Instead, we can assume that the θj's are drawn from a distribution based on the following: conceive of the schools themselves as being a random sample from all possible schools.
For now, assume the variance is constant across schools. The hierarchical normal model assumes normal sampling models both within and between groups:
yij|θj,σ2∼N(θj,σ2); i=1,…,njθj|μ,τ2∼N(μ,τ2); j=1,…,J,
which gives us an extra level in the prior on the means, and leads to sharing of information across the groups in estimating the group-specific means.
Instead, we can assume that the θj's are drawn from a distribution based on the following: conceive of the schools themselves as being a random sample from all possible schools.
For now, assume the variance is constant across schools. The hierarchical normal model assumes normal sampling models both within and between groups:
yij|θj,σ2∼N(θj,σ2); i=1,…,njθj|μ,τ2∼N(μ,τ2); j=1,…,J,
which gives us an extra level in the prior on the means, and leads to sharing of information across the groups in estimating the group-specific means.
We have an extra variance parameter τ2. Comparing τ2 to σ2 tells us how much of the variation in Y is due to within-group versus between-group variation.
π(μ)=N(μ0,γ20)π(σ2)=IG(ν02,ν0σ202)π(τ2)=IG(η02,η0τ202).
Standard semi-conjugate priors are given by
π(μ)=N(μ0,γ20)π(σ2)=IG(ν02,ν0σ202)π(τ2)=IG(η02,η0τ202).
with
Standard semi-conjugate priors are given by
π(μ)=N(μ0,γ20)π(σ2)=IG(ν02,ν0σ202)π(τ2)=IG(η02,η0τ202).
with
Standard semi-conjugate priors are given by
π(μ)=N(μ0,γ20)π(σ2)=IG(ν02,ν0σ202)π(τ2)=IG(η02,η0τ202).
with
This model relies heavily on exchangeability across units at each level.
For example, we assume the schools are a random sample from the population of all schools, and the students within schools are a random sample of all the students in each school.
This model relies heavily on exchangeability across units at each level.
For example, we assume the schools are a random sample from the population of all schools, and the students within schools are a random sample of all the students in each school.
This is not always completely true.
This model relies heavily on exchangeability across units at each level.
For example, we assume the schools are a random sample from the population of all schools, and the students within schools are a random sample of all the students in each school.
This is not always completely true.
Note: we can allow the variance to vary across schools if desired (and we will soon in fact).
Turns out that conditional exchangeability would be enough if we control for relevant variables in our modeling.
For example, the schools in Chapel Hill/Carrboro are not entirely exchangeable.
Turns out that conditional exchangeability would be enough if we control for relevant variables in our modeling.
For example, the schools in Chapel Hill/Carrboro are not entirely exchangeable.
For example, Phoenix Academy is for students on long-term out-of-school suspension or who need to make up work due to extended absences (e.g., pregnancy), and Memorial Hospital School is for children battling serious illnesses.
Turns out that conditional exchangeability would be enough if we control for relevant variables in our modeling.
For example, the schools in Chapel Hill/Carrboro are not entirely exchangeable.
For example, Phoenix Academy is for students on long-term out-of-school suspension or who need to make up work due to extended absences (e.g., pregnancy), and Memorial Hospital School is for children battling serious illnesses.
However, if we condition on school type (public, charter, private, special services, home), the schools may then be exchangeable.
yij|θj,σ2∼N(θj,σ2); i=1,…,njθj|μ,τ2∼N(μ,τ2); j=1,…,J,
Recall the model is
yij|θj,σ2∼N(θj,σ2); i=1,…,njθj|μ,τ2∼N(μ,τ2); j=1,…,J,
Under our prior specification, we can factor the posterior as follows:
\begin{split} \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) & \boldsymbol{\propto} p(y | \theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times \pi(\mu, \sigma^2,\tau^2)\\ \\ & \boldsymbol{=} p(y | \theta_1, \ldots, \theta_J, \sigma^2 )\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu,\tau^2)\\ & \ \ \ \ \times \pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \\ & \boldsymbol{=} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\}\\ & \ \ \ \ \times \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\}\\ & \ \ \ \ \times\pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \end{split}
The full conditional distribution of \mu is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \mu.
That is,
\begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\mu). \end{split}
The full conditional distribution of \mu is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \mu.
That is,
\begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\mu). \end{split}
This looks like the full conditional distribution from the one-sample normal case, so you can show that
\begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\mu_n, \gamma^2_n \right) \ \ \ \ \textrm{where}\\ \\ \gamma^2_n = \dfrac{1}{ \dfrac{J}{\tau^2} + \dfrac{1}{\gamma_0^2} } ; \ \ \ \ \ \ \ \ \mu_n = \gamma^2_n \left[ \dfrac{J}{\tau^2} \bar{\theta} + \dfrac{1}{\gamma_0^2} \mu_0 \right] \end{split}
and \bar{\theta} = \frac{1}{J} \sum\limits^J_{j=1} \theta_j.
Similarly, the full conditional distribution of each \theta_j is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \theta_j.
That is,
\begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot p(\theta_j | \mu,\tau^2) \\ \end{split}
Similarly, the full conditional distribution of each \theta_j is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \theta_j.
That is,
\begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot p(\theta_j | \mu,\tau^2) \\ \end{split}
Those terms include a normal for \theta_j multiplied by a product of normals in which \theta_j is the mean, again mirroring the one-sample case, so you can show that
\begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\theta_j^\star, \nu_j^\star \right) \ \ \ \ \textrm{where}\\ \\ \nu_j^\star & = \dfrac{1}{ \dfrac{n_j}{\sigma^2} + \dfrac{1}{\tau^2} } ; \ \ \ \ \ \ \ \theta_j^\star = \nu_j^\star \left[ \dfrac{n_j}{\sigma^2} \bar{y}_j + \dfrac{1}{\tau^2} \mu \right] \end{split}
Our estimate for each \theta_j is a weighted average of \bar{y}_j and \mu, ensuring that we are borrowing information across all levels through \mu and \tau^2.
The weights for the weighted average is determined by relative precisions from the data and from the second level model.
Our estimate for each \theta_j is a weighted average of \bar{y}_j and \mu, ensuring that we are borrowing information across all levels through \mu and \tau^2.
The weights for the weighted average is determined by relative precisions from the data and from the second level model.
The groups with smaller n_j have estimated \theta_j^\star closer to \mu than schools with larger n_j.
Our estimate for each \theta_j is a weighted average of \bar{y}_j and \mu, ensuring that we are borrowing information across all levels through \mu and \tau^2.
The weights for the weighted average is determined by relative precisions from the data and from the second level model.
The groups with smaller n_j have estimated \theta_j^\star closer to \mu than schools with larger n_j.
Thus, degree of shrinkage of \theta_j depends on ratio of within-group to between-group variances.
The full conditional distribution of \tau^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \tau^2.
That is,
\begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\tau^2)\\ \end{split}
The full conditional distribution of \tau^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \tau^2.
That is,
\begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\tau^2)\\ \end{split}
As in the case for \mu, this looks like the one-sample normal problem, and our full conditional posterior is
\begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & = \mathcal{IG} \left(\dfrac{\eta_n}{2}, \dfrac{\eta_n\tau_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \eta_n = \eta_0 + J ; \ \ \ \ \ \ \ \tau_n^2 & = \dfrac{1}{\eta_n} \left[\eta_0\tau_0^2 + \sum\limits_{j=1}^{J} (\theta_j - \mu)^2 \right].\\ \end{split}
Finally, the full conditional distribution of \sigma^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \sigma^2.
That is,
\begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot \pi(\sigma^2)\\ \end{split}
Finally, the full conditional distribution of \sigma^2 is proportional to the part of the joint posterior \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) that involves \sigma^2.
That is,
\begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot \pi(\sigma^2)\\ \end{split}
We can again take advantage of the one-sample normal problem, so that our full conditional posterior is
\begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & = \mathcal{IG} \left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \nu_n = \nu_0 + \sum\limits_{j=1}^{J} n_j ; \ \ \ \ \ \ \ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{j=1}^{J}\sum\limits_{i=1}^{n_j} (y_{ij} - \theta_j)^2 \right].\\ \end{split}
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |