Processing math: 0%
+ - 0:00:00
Notes for current slide
Notes for next slide

STA 360/602L: Module 4.2

Multivariate normal model II

Dr. Olanrewaju Michael Akande

1 / 15

Multivariate normal likelihood recap

  • For data \boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma), the likelihood is

    \begin{split} p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}. \end{split}

2 / 15

Multivariate normal likelihood recap

  • For data \boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma), the likelihood is

    \begin{split} p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}. \end{split}

  • For \boldsymbol{\theta}, it is convenient to write p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) as

    \begin{split} p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\},\\ \end{split}

    where \bar{\boldsymbol{y}} = (\bar{y}_1,\ldots,\bar{y}_p)^T.

2 / 15

Multivariate normal likelihood recap

  • For data \boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma), the likelihood is

    \begin{split} p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}. \end{split}

  • For \boldsymbol{\theta}, it is convenient to write p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) as

    \begin{split} p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\},\\ \end{split}

    where \bar{\boldsymbol{y}} = (\bar{y}_1,\ldots,\bar{y}_p)^T.

  • For \Sigma, it is convenient to write p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) as

    \begin{split} p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\ \end{split}

    where \boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T is the residual sum of squares matrix.

2 / 15

Prior for the mean

  • A convenient specification of the joint prior is \pi(\boldsymbol{\theta}, \Sigma) = \pi(\boldsymbol{\theta}) \pi(\Sigma).
3 / 15

Prior for the mean

  • A convenient specification of the joint prior is \pi(\boldsymbol{\theta}, \Sigma) = \pi(\boldsymbol{\theta}) \pi(\Sigma).

  • As in the univariate case, a convenient prior distribution for \boldsymbol{\theta} is also normal (multivariate in this case).

3 / 15

Prior for the mean

  • A convenient specification of the joint prior is \pi(\boldsymbol{\theta}, \Sigma) = \pi(\boldsymbol{\theta}) \pi(\Sigma).

  • As in the univariate case, a convenient prior distribution for \boldsymbol{\theta} is also normal (multivariate in this case).

  • Assume that \pi(\boldsymbol{\theta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0).

3 / 15

Prior for the mean

  • A convenient specification of the joint prior is \pi(\boldsymbol{\theta}, \Sigma) = \pi(\boldsymbol{\theta}) \pi(\Sigma).

  • As in the univariate case, a convenient prior distribution for \boldsymbol{\theta} is also normal (multivariate in this case).

  • Assume that \pi(\boldsymbol{\theta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0).

  • The pdf will be easier to work with if we write it as

    \begin{split} \pi(\boldsymbol{\theta}) & = (2\pi)^{-\frac{p}{2}} \left|\Lambda_0\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)^T \Lambda_0^{-1} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)\right\}\\ & \propto \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)^T \Lambda_0^{-1} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)\right\}\\ & = \textrm{exp} \left\{-\dfrac{1}{2} \left[\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} - \underbrace{\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 - \boldsymbol{\mu}_0^T\Lambda_0^{-1}\boldsymbol{\theta}}_{\textrm{same term}} + \underbrace{\boldsymbol{\mu}_0^T\Lambda_0^{-1}\boldsymbol{\mu}_0}_{\text{does not involve } \boldsymbol{\theta}} \right] \right\}\\ & \propto \textrm{exp} \left\{-\dfrac{1}{2} \left[\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} - 2\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right] \right\}\\ & = \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\ \end{split}

3 / 15

Prior for the mean

  • So we have

    \begin{split} \pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}.\\ \end{split}

4 / 15

Prior for the mean

  • So we have

    \begin{split} \pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}.\\ \end{split}

  • Key trick for combining with likelihood: When the normal density is written in this form, note the following details in the exponent.

    • In the first part, the inverse of the covariance matrix \Lambda_0^{-1} is "sandwiched" between \boldsymbol{\theta}^T and \boldsymbol{\theta}.

    • In the second part, the \boldsymbol{\theta} in the first part is replaced (sort of) with the mean \boldsymbol{\mu}_0, with \Lambda_0^{-1} keeping its place.

4 / 15

Prior for the mean

  • So we have

    \begin{split} \pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}.\\ \end{split}

  • Key trick for combining with likelihood: When the normal density is written in this form, note the following details in the exponent.

    • In the first part, the inverse of the covariance matrix \Lambda_0^{-1} is "sandwiched" between \boldsymbol{\theta}^T and \boldsymbol{\theta}.

    • In the second part, the \boldsymbol{\theta} in the first part is replaced (sort of) with the mean \boldsymbol{\mu}_0, with \Lambda_0^{-1} keeping its place.

  • The two points above will help us identify updated means and updated covariance matrices relatively quickly.

4 / 15

Conditional posterior for the mean

  • Our conditional posterior (full conditional) \boldsymbol{\theta} | \Sigma , \boldsymbol{Y}, is then

    \begin{split} \pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) \cdot \pi(\boldsymbol{\theta}) \\ \\ & \propto \underbrace{\textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\}}_{p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma)} \cdot \underbrace{\textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}}_{\pi(\boldsymbol{\theta})} \\ \\ & = \textrm{exp} \left\{\underbrace{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} -\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta}}_{\textrm{First parts from } p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) \textrm{ and } \pi(\boldsymbol{\theta})} + \underbrace{\boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0}_{\textrm{Second parts from } p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) \textrm{ and } \pi(\boldsymbol{\theta})} \right\}\\ \\ & = \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[n\Sigma^{-1} + \Lambda_0^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[ n\Sigma^{-1} \bar{\boldsymbol{y}} + \Lambda_0^{-1}\boldsymbol{\mu}_0 \right] \right\}, \end{split}

    which is just another multivariate normal distribution.

5 / 15

Conditional posterior for the mean

  • To confirm the normal density and its parameters, compare to the prior kernel

    \begin{split} \pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\ \end{split}

    and the posterior kernel we just derived, that is,

    \begin{split} \pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[\Lambda_0^{-1} + n\Sigma^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right] \right\}. \end{split}

6 / 15

Conditional posterior for the mean

  • To confirm the normal density and its parameters, compare to the prior kernel

    \begin{split} \pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\ \end{split}

    and the posterior kernel we just derived, that is,

    \begin{split} \pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[\Lambda_0^{-1} + n\Sigma^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right] \right\}. \end{split}

  • Easy to see (relatively) that \boldsymbol{\theta} | \Sigma, \boldsymbol{Y} \sim \mathcal{N}_p(\boldsymbol{\mu}_n, \Lambda_n), with

    \Lambda_n = \left[\Lambda_0^{-1} + n\Sigma^{-1}\right]^{-1}

    and

    \boldsymbol{\mu}_n = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]

6 / 15

Bayesian inference

  • As in the univariate case, we once again have that
    • Posterior precision is sum of prior precision and data precision:

      \Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}

7 / 15

Bayesian inference

  • As in the univariate case, we once again have that
    • Posterior precision is sum of prior precision and data precision:

      \Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}

    • Posterior expectation is weighted average of prior expectation and the sample mean:

      \begin{split} \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]\\ \\ & = \overbrace{\left[ \Lambda_n \Lambda_0^{-1} \right]}^{\textrm{weight on prior mean}} \underbrace{\boldsymbol{\mu}_0}_{\textrm{prior mean}} + \overbrace{\left[ \Lambda_n (n\Sigma^{-1}) \right]}^{\textrm{weight on sample mean}} \underbrace{ \bar{\boldsymbol{y}}}_{\textrm{sample mean}} \end{split}

7 / 15

Bayesian inference

  • As in the univariate case, we once again have that

    • Posterior precision is sum of prior precision and data precision:

      \Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}

    • Posterior expectation is weighted average of prior expectation and the sample mean:

      \begin{split} \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]\\ \\ & = \overbrace{\left[ \Lambda_n \Lambda_0^{-1} \right]}^{\textrm{weight on prior mean}} \underbrace{\boldsymbol{\mu}_0}_{\textrm{prior mean}} + \overbrace{\left[ \Lambda_n (n\Sigma^{-1}) \right]}^{\textrm{weight on sample mean}} \underbrace{ \bar{\boldsymbol{y}}}_{\textrm{sample mean}} \end{split}

  • Compare these to the results from the univariate case to gain more intuition.

7 / 15

What about the covariance matrix?

  • In the univariate case with y_i \sim \mathcal{N}(\mu, \sigma^2), the common choice for the prior is an inverse-gamma distribution for the variance \sigma^2.
8 / 15

What about the covariance matrix?

  • In the univariate case with y_i \sim \mathcal{N}(\mu, \sigma^2), the common choice for the prior is an inverse-gamma distribution for the variance \sigma^2.

  • As we have seen, we can rewrite as y_i \sim \mathcal{N}(\mu, \tau^{-1}), so that we have a gamma prior for the precision \tau.

8 / 15

What about the covariance matrix?

  • In the univariate case with y_i \sim \mathcal{N}(\mu, \sigma^2), the common choice for the prior is an inverse-gamma distribution for the variance \sigma^2.

  • As we have seen, we can rewrite as y_i \sim \mathcal{N}(\mu, \tau^{-1}), so that we have a gamma prior for the precision \tau.

  • In the multivariate normal case, we have a covariance matrix \Sigma instead of a scalar.

8 / 15

What about the covariance matrix?

  • In the univariate case with y_i \sim \mathcal{N}(\mu, \sigma^2), the common choice for the prior is an inverse-gamma distribution for the variance \sigma^2.

  • As we have seen, we can rewrite as y_i \sim \mathcal{N}(\mu, \tau^{-1}), so that we have a gamma prior for the precision \tau.

  • In the multivariate normal case, we have a covariance matrix \Sigma instead of a scalar.

  • Appealing to have a matrix-valued extension of the inverse-gamma (and gamma) that would be conjugate.

8 / 15

What about the covariance matrix?

  • In the univariate case with y_i \sim \mathcal{N}(\mu, \sigma^2), the common choice for the prior is an inverse-gamma distribution for the variance \sigma^2.

  • As we have seen, we can rewrite as y_i \sim \mathcal{N}(\mu, \tau^{-1}), so that we have a gamma prior for the precision \tau.

  • In the multivariate normal case, we have a covariance matrix \Sigma instead of a scalar.

  • Appealing to have a matrix-valued extension of the inverse-gamma (and gamma) that would be conjugate.

  • One complication is that the covariance matrix \Sigma must be positive definite and symmetric.

8 / 15

Positive definite and symmetric

  • "Positive definite" means that for all x \in \mathcal{R}^p, x^T \Sigma x > 0.
9 / 15

Positive definite and symmetric

  • "Positive definite" means that for all x \in \mathcal{R}^p, x^T \Sigma x > 0.

  • Basically ensures that the diagonal elements of \Sigma (corresponding to the marginal variances) are positive.

9 / 15

Positive definite and symmetric

  • "Positive definite" means that for all x \in \mathcal{R}^p, x^T \Sigma x > 0.

  • Basically ensures that the diagonal elements of \Sigma (corresponding to the marginal variances) are positive.

  • Also, ensures that the correlation coefficients for each pair of variables are between -1 and 1.

9 / 15

Positive definite and symmetric

  • "Positive definite" means that for all x \in \mathcal{R}^p, x^T \Sigma x > 0.

  • Basically ensures that the diagonal elements of \Sigma (corresponding to the marginal variances) are positive.

  • Also, ensures that the correlation coefficients for each pair of variables are between -1 and 1.

  • Our prior for \Sigma should thus assign probability one to set of positive definite matrices.

9 / 15

Positive definite and symmetric

  • "Positive definite" means that for all x \in \mathcal{R}^p, x^T \Sigma x > 0.

  • Basically ensures that the diagonal elements of \Sigma (corresponding to the marginal variances) are positive.

  • Also, ensures that the correlation coefficients for each pair of variables are between -1 and 1.

  • Our prior for \Sigma should thus assign probability one to set of positive definite matrices.

  • Analogous to the univariate case, the inverse-Wishart distribution is the corresponding conditionally conjugate prior for \Sigma (multivariate generalization of the inverse-gamma).

9 / 15

Positive definite and symmetric

  • "Positive definite" means that for all x \in \mathcal{R}^p, x^T \Sigma x > 0.

  • Basically ensures that the diagonal elements of \Sigma (corresponding to the marginal variances) are positive.

  • Also, ensures that the correlation coefficients for each pair of variables are between -1 and 1.

  • Our prior for \Sigma should thus assign probability one to set of positive definite matrices.

  • Analogous to the univariate case, the inverse-Wishart distribution is the corresponding conditionally conjugate prior for \Sigma (multivariate generalization of the inverse-gamma).

  • The textbook covers the construction of Wishart and inverse-Wishart random variables. We will skip the actual development in class but will write code to sample random variates.

9 / 15

Inverse-Wishart distribution

  • A random variable \Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), where \Sigma is positive definite and p \times p, has pdf

    \begin{split} p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}, \end{split}

    where

    • \nu_0 > p - 1 is the "degrees of freedom", and
    • \boldsymbol{S}_0 is a p \times p positive definite matrix.
10 / 15

Inverse-Wishart distribution

  • A random variable \Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), where \Sigma is positive definite and p \times p, has pdf

    \begin{split} p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}, \end{split}

    where

    • \nu_0 > p - 1 is the "degrees of freedom", and
    • \boldsymbol{S}_0 is a p \times p positive definite matrix.
  • For this distribution, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0, for \nu_0 > p + 1.

10 / 15

Inverse-Wishart distribution

  • A random variable \Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), where \Sigma is positive definite and p \times p, has pdf

    \begin{split} p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}, \end{split}

    where

    • \nu_0 > p - 1 is the "degrees of freedom", and
    • \boldsymbol{S}_0 is a p \times p positive definite matrix.
  • For this distribution, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0, for \nu_0 > p + 1.

  • Hence, \boldsymbol{S}_0 is the scaled mean of the \textrm{IW}_p(\nu_0, \boldsymbol{S}_0).

10 / 15

Inverse-Wishart distribution

  • If we are very confident in a prior guess \Sigma_0, for \Sigma, then we might set

    • \nu_0, the degrees of freedom to be very large, and
    • \boldsymbol{S}_0 = (\nu_0 - p - 1)\Sigma_0.

    In this case, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 = \dfrac{1}{\nu_0 - p - 1}(\nu_0 - p - 1)\Sigma_0 = \Sigma_0, and \Sigma is tightly (depending on the value of \nu_0) centered around \Sigma_0.

11 / 15

Inverse-Wishart distribution

  • If we are very confident in a prior guess \Sigma_0, for \Sigma, then we might set

    • \nu_0, the degrees of freedom to be very large, and
    • \boldsymbol{S}_0 = (\nu_0 - p - 1)\Sigma_0.

    In this case, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 = \dfrac{1}{\nu_0 - p - 1}(\nu_0 - p - 1)\Sigma_0 = \Sigma_0, and \Sigma is tightly (depending on the value of \nu_0) centered around \Sigma_0.

  • If we are not at all confident but we still have a prior guess \Sigma_0, we might set

    • \nu_0 = p + 2, so that the \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 is finite.
    • \boldsymbol{S}_0 = \Sigma_0

    Here, \mathbb{E}[\Sigma] = \Sigma_0 as before, but \Sigma is only loosely centered around \Sigma_0.

11 / 15

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.
12 / 15

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

12 / 15

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

  • Specifically, if \Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).

12 / 15

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

  • Specifically, if \Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).

  • A random variable \Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}), where \Phi has dimension (p \times p), has pdf

    \begin{split} f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}. \end{split}

12 / 15

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

  • Specifically, if \Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).

  • A random variable \Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}), where \Phi has dimension (p \times p), has pdf

    \begin{split} f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}. \end{split}

  • Here, \mathbb{E}[\Phi] = \nu_0 \boldsymbol{S}_0.

12 / 15

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

  • Specifically, if \Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).

  • A random variable \Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}), where \Phi has dimension (p \times p), has pdf

    \begin{split} f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}. \end{split}

  • Here, \mathbb{E}[\Phi] = \nu_0 \boldsymbol{S}_0.

  • Note that the textbook writes the inverse-Wishart as \textrm{IW}_p(\nu_0, \boldsymbol{S}_0^{-1}). I prefer \textrm{IW}_p(\nu_0, \boldsymbol{S}_0) instead. Feel free to use either notation but try not to get confused.

12 / 15

Conditional posterior for covariance

  • Assuming \pi(\Sigma) = \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), the conditional posterior (full conditional) \Sigma | \boldsymbol{\theta}, \boldsymbol{Y}, is then

    \begin{split} \pi(\Sigma| \boldsymbol{\theta}, \boldsymbol{Y}) & \propto p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) \cdot \pi(\boldsymbol{\theta}) \\ \\ & \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\}}_{p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma)} \cdot \underbrace{\left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}}_{\pi(\boldsymbol{\theta})} \\ \\ & \propto \left|\Sigma\right|^{\frac{-(\nu_0 + p + n + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[\boldsymbol{S}_0\Sigma^{-1} + \boldsymbol{S}_\theta \Sigma^{-1} \right] \right\} ,\\ \\ & \propto \left|\Sigma\right|^{\frac{-(\nu_0 + n + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[ \left(\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right) \Sigma^{-1} \right] \right\} ,\\ \end{split}

    which is \textrm{IW}_p(\nu_n, \boldsymbol{S}_n), or using the notation in the book, \textrm{IW}_p(\nu_n, \boldsymbol{S}_n^{-1} ), with

    • \nu_n = \nu_0 + n, and
    • \boldsymbol{S}_n = \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]
13 / 15

Conditional posterior for covariance

  • We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.
14 / 15

Conditional posterior for covariance

  • We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.

  • \boldsymbol{S}_n can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".

14 / 15

Conditional posterior for covariance

  • We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.

  • \boldsymbol{S}_n can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".

  • Recall that if \Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), then \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0.

14 / 15

Conditional posterior for covariance

  • We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.

  • \boldsymbol{S}_n can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".

  • Recall that if \Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0), then \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0.

  • \Rightarrow the conditional posterior expectation of the population covariance is

    \begin{split} \mathbb{E}[\Sigma | \boldsymbol{\theta}, \boldsymbol{Y}] & = \dfrac{1}{\nu_0 + n - p - 1} \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]\\ \\ & = \underbrace{\dfrac{\nu_0 - p - 1}{\nu_0 + n - p - 1}}_{\text{weight on prior expectation}} \overbrace{\left[\dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0\right]}^{\text{prior expectation}} + \underbrace{\dfrac{n}{\nu_0 + n - p - 1}}_{\text{weight on sample estimate}} \overbrace{\left[\dfrac{1}{n} \boldsymbol{S}_\theta \right]}^{\text{sample estimate}},\\ \end{split}

    which is a weighted average of prior expectation and sample estimate.

14 / 15

What's next?

Move on to the readings for the next module!

15 / 15

Multivariate normal likelihood recap

  • For data \boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma), the likelihood is

    \begin{split} p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}. \end{split}

2 / 15
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow