Sometimes, we may have a natural grouping in our data, for example
For such grouped data, we may want to do inference across all the groups, for example, comparison of the group means.
Sometimes, we may have a natural grouping in our data, for example
For such grouped data, we may want to do inference across all the groups, for example, comparison of the group means.
Ideally, we should do so in a way that takes advantage of the relationship between observations in the same group, but we should also look to borrow information across groups when possible.
Sometimes, we may have a natural grouping in our data, for example
For such grouped data, we may want to do inference across all the groups, for example, comparison of the group means.
Ideally, we should do so in a way that takes advantage of the relationship between observations in the same group, but we should also look to borrow information across groups when possible.
Hierarchical modeling provides a principled way to do so.
yi|μ,σ2iid∼N(μ,σ2).
Recall the normal model:
yi|μ,σ2iid∼N(μ,σ2).
The MLE for the population mean μ is just the sample mean ˉy.
Recall the normal model:
yi|μ,σ2iid∼N(μ,σ2).
The MLE for the population mean μ is just the sample mean ˉy.
ˉy is unbiased for μ. That is, for any data yiiid∼N(μ,σ2), E[ˉy]=μ.
Recall the normal model:
yi|μ,σ2iid∼N(μ,σ2).
The MLE for the population mean μ is just the sample mean ˉy.
ˉy is unbiased for μ. That is, for any data yiiid∼N(μ,σ2), E[ˉy]=μ.
However, recall that in the conjugate normal model with known variance for example, the posterior expectation is a weighted average of the prior mean and the sample mean.
Recall the normal model:
yi|μ,σ2iid∼N(μ,σ2).
The MLE for the population mean μ is just the sample mean ˉy.
ˉy is unbiased for μ. That is, for any data yiiid∼N(μ,σ2), E[ˉy]=μ.
However, recall that in the conjugate normal model with known variance for example, the posterior expectation is a weighted average of the prior mean and the sample mean.
That is, the posterior mean is actually biased.
Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.
Of course, the magnitude of the pull depends on the sample size.
Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.
Of course, the magnitude of the pull depends on the sample size.
This "pulling" phenomenon is referred to as shrinkage.
Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.
Of course, the magnitude of the pull depends on the sample size.
This "pulling" phenomenon is referred to as shrinkage.
Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.
Of course, the magnitude of the pull depends on the sample size.
This "pulling" phenomenon is referred to as shrinkage.
Well, in part, because shrinkage estimators are often "more accurate" in prediction problems -- i.e. they tend to do a better job of predicting a future outcome or of recovering the actual parameter values. Remember variance-bias trade off!
Usually through the weighting of the sample data and prior, Bayes procedures have the tendency to pull the estimate of μ toward the prior mean.
Of course, the magnitude of the pull depends on the sample size.
This "pulling" phenomenon is referred to as shrinkage.
Well, in part, because shrinkage estimators are often "more accurate" in prediction problems -- i.e. they tend to do a better job of predicting a future outcome or of recovering the actual parameter values. Remember variance-bias trade off!
The fact that a biased estimator would do a better job in many prediction problems can be proven rigorously, and is referred to as Stein's paradox.
Stein's result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.
In fact, these are Bayes point estimators (the posterior expectation of the parameter μ).
Stein's result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.
In fact, these are Bayes point estimators (the posterior expectation of the parameter μ).
Most of what we do now in high-dimensional statistics is develop biased estimators that perform better than unbiased ones.
Stein's result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.
In fact, these are Bayes point estimators (the posterior expectation of the parameter μ).
Most of what we do now in high-dimensional statistics is develop biased estimators that perform better than unbiased ones.
Examples: lasso regression, ridge regression, various kinds of hierarchical Bayesian models, etc.
Stein's result implies, in particular, that the sample mean is an inadmissible estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation.
In fact, these are Bayes point estimators (the posterior expectation of the parameter μ).
Most of what we do now in high-dimensional statistics is develop biased estimators that perform better than unbiased ones.
Examples: lasso regression, ridge regression, various kinds of hierarchical Bayesian models, etc.
So, here we will get a very basic introduction to Bayesian hierarchical models, which provide a formal and coherent framework for constructing shrinkage estimators.
Bayesian hierarchical models is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.
Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible.
Bayesian hierarchical models is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.
Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible.
Why use them? Several reasons:
Bayesian hierarchical models is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.
Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible.
Why use them? Several reasons:
Bayesian hierarchical models is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior.
Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible.
Why use them? Several reasons:
Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).
BMI is known to often follow a normal distribution, so let's assume the same here.
Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).
BMI is known to often follow a normal distribution, so let's assume the same here.
We should expect some relationship between the mean BMI for the two groups.
Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).
BMI is known to often follow a normal distribution, so let's assume the same here.
We should expect some relationship between the mean BMI for the two groups.
We may also think the shape of the two distributions would be relatively the same (at least as a simplifying assumption for now).
Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female).
BMI is known to often follow a normal distribution, so let's assume the same here.
We should expect some relationship between the mean BMI for the two groups.
We may also think the shape of the two distributions would be relatively the same (at least as a simplifying assumption for now).
Thus, a reasonable model might be
yi,maleiid∼N(θm,σ2); i=1,…,nm;yi,femaleiid∼N(θf,σ2); i=1,…,nf.
but with some relationship between θm and θf.
yi,maleiid∼N(μ+δ,σ2); i=1,…,nm;yi,femaleiid∼N(μ−δ,σ2); i=1,…,nf.
One parameterization that can reflect some relationship between θm and θf is
yi,maleiid∼N(μ+δ,σ2); i=1,…,nm;yi,femaleiid∼N(μ−δ,σ2); i=1,…,nf.
where
θm=μ+δ and θf=μ−δ,
μ=θm+θf2 is the average of the population means, and
2δ=θm−θf is the difference in population means.
Convenient prior:
π(μ,δ,σ2)=π(μ)⋅π(δ)⋅π(σ2), where
π(μ)=N(μ0,γ20),
π(δ)=N(δ0,τ20), and
π(σ2)=IG(ν02,ν0σ202).
Note that we can rewrite
yi,maleiid∼N(μ+δ,σ2); i=1,…,nm;yi,femaleiid∼N(μ−δ,σ2); i=1,…,nf
as
(yi,male−δ)iid∼N(μ,σ2); i=1,…,nm;(yi,female+δ)iid∼N(μ,σ2); i=1,…,nf
Note that we can rewrite
yi,maleiid∼N(μ+δ,σ2); i=1,…,nm;yi,femaleiid∼N(μ−δ,σ2); i=1,…,nf
as
(yi,male−δ)iid∼N(μ,σ2); i=1,…,nm;(yi,female+δ)iid∼N(μ,σ2); i=1,…,nf
or
(yi,male−μ)iid∼N(δ,σ2); i=1,…,nm;(−1)(yi,female−μ)iid∼N(δ,σ2); i=1,…,nf.
as needed, so we can leverage past results for the full conditionals.
For the full conditionals we will derive here, we will take advantage of previous results from the regular univariate normal model.
Recall that if we assume
yi∼N(μ,σ2), i=1,…,n,
For the full conditionals we will derive here, we will take advantage of previous results from the regular univariate normal model.
Recall that if we assume
yi∼N(μ,σ2), i=1,…,n,
and set our priors to be
π(μ)=N(μ0,γ20).π(σ2)=IG(ν02,ν0σ202),
For the full conditionals we will derive here, we will take advantage of previous results from the regular univariate normal model.
Recall that if we assume
yi∼N(μ,σ2), i=1,…,n,
and set our priors to be
π(μ)=N(μ0,γ20).π(σ2)=IG(ν02,ν0σ202),
then we have
\begin{split} \pi(\mu, \sigma^2 | Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n} p(y_{i} | \mu, \sigma^2 ) \right\} \cdot \pi(\mu) \cdot \pi(\sigma^2) \\ \end{split}
We have
\begin{split} \pi(\mu | \sigma^2, Y) & = \mathcal{N}\left(\mu_n, \gamma_n^2\right).\\ \end{split}
where
\begin{split} \gamma^2_n = \dfrac{1}{ \dfrac{n}{\sigma^2} + \dfrac{1}{\gamma_0^2} }; \ \ \ \ \ \ \ \ \mu_n & = \gamma^2_n \left[ \dfrac{n}{\sigma^2} \bar{y} + \dfrac{1}{\gamma_0^2} \mu_0 \right], \end{split}
We have
\begin{split} \pi(\mu | \sigma^2, Y) & = \mathcal{N}\left(\mu_n, \gamma_n^2\right).\\ \end{split}
where
\begin{split} \gamma^2_n = \dfrac{1}{ \dfrac{n}{\sigma^2} + \dfrac{1}{\gamma_0^2} }; \ \ \ \ \ \ \ \ \mu_n & = \gamma^2_n \left[ \dfrac{n}{\sigma^2} \bar{y} + \dfrac{1}{\gamma_0^2} \mu_0 \right], \end{split}
and
\begin{split} \pi(\sigma^2 | \mu,Y) \boldsymbol{=} \mathcal{IG}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right), \end{split}
where
\begin{split} \nu_n & = \nu_0 + n; \ \ \ \ \ \ \ \sigma_n^2 = \dfrac{1}{\nu_n} \left[ \nu_0 \sigma_0^2 + \sum^n_{i=1} (y_i - \mu)^2 \right].\\ \end{split}
With \pi(\mu) = \mathcal{N}(\mu_0, \gamma_0^2), and
\begin{split} (y_{i,male} - \delta) & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ (y_{i,female} + \delta) & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_f,\\ \end{split}
we have
\begin{split} \mu | Y, \delta, \sigma^2 & \sim \mathcal{N}(\mu_n, \gamma_n^2), \ \ \ \text{where}\\ \\ \gamma_n^2 & = \dfrac{1}{\dfrac{1}{\gamma_0^2} + \dfrac{n_m + n_f}{\sigma^2} }\\ \\ \mu_n & = \gamma_n^2 \left[\dfrac{\mu_0}{\gamma_0^2} + \dfrac{ \sum\limits_{i=1}^{n_m} (y_{i,male} - \delta) + \sum\limits_{i=1}^{n_f} (y_{i,female} + \delta) }{\sigma^2} \right].\\ \end{split}
With \pi(\delta) = \mathcal{N}(\delta_0, \tau_0^2), and
\begin{split} (y_{i,male} - \mu) & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ (-1) (y_{i,female} - \mu) & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_f,\\ \end{split}
we have
\begin{split} \delta | Y, \mu, \sigma^2 & \sim \mathcal{N}(\delta_n, \tau_n^2), \ \ \ \text{where}\\ \\ \tau_n^2 & = \dfrac{1}{\dfrac{1}{\tau_0^2} + \dfrac{n_m + n_f}{\sigma^2} }\\ \\ \delta_n & = \tau_n^2 \left[\dfrac{\delta_0}{\tau_0^2} + \dfrac{\sum\limits_{i=1}^{n_m} (y_{i,male} - \mu) + (-1) \sum\limits_{i=1}^{n_f} (y_{i,female} - \mu) }{\sigma^2} \right].\\ \end{split}
With \pi(\sigma^2) = \mathcal{IG}(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}), and
\begin{split} y_{i,male} & \overset{iid}{\sim} \mathcal{N} \left(\mu + \delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ y_{i,female} & \overset{iid}{\sim} \mathcal{N} \left(\mu - \delta, \sigma^2\right); \ \ i = 1, \ldots, n_f\\ \end{split}
we have
\begin{split} \sigma^2 | Y, \mu, \delta & \sim \mathcal{IG}(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}), \ \ \ \text{where}\\ \\ \nu_n & = \nu_0 + n_m + n_f\\ \\ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{i=1}^{n_m} (y_{i,male} - [\mu + \delta])^2 + \sum\limits_{i=1}^{n_f} (y_{i,female} - [\mu - \delta])^2 \right].\\ \end{split}
With \pi(\sigma^2) = \mathcal{IG}(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}), and
\begin{split} y_{i,male} & \overset{iid}{\sim} \mathcal{N} \left(\mu + \delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ y_{i,female} & \overset{iid}{\sim} \mathcal{N} \left(\mu - \delta, \sigma^2\right); \ \ i = 1, \ldots, n_f\\ \end{split}
we have
\begin{split} \sigma^2 | Y, \mu, \delta & \sim \mathcal{IG}(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}), \ \ \ \text{where}\\ \\ \nu_n & = \nu_0 + n_m + n_f\\ \\ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{i=1}^{n_m} (y_{i,male} - [\mu + \delta])^2 + \sum\limits_{i=1}^{n_f} (y_{i,female} - [\mu - \delta])^2 \right].\\ \end{split}
We will use write a Gibbs sampler for this model and fit the model to real data in the next module.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |