Processing math: 44%
+ - 0:00:00
Notes for current slide
Notes for next slide

STA 360/602L: Module 3.7

MCMC and Gibbs sampling I

Dr. Olanrewaju Michael Akande

1 / 13

Bayesian inference (conjugacy recap)

  • As we've seen so far, Bayesian inference is based on posterior distributions, that is,

    π(θ|y)=π(θ)p(y|θ)Θπ(˜θ)p(y|˜θ)d˜θ=π(θ)L(θ|y)L(y),

    where y=(y1,,yn).

2 / 13

Bayesian inference (conjugacy recap)

  • As we've seen so far, Bayesian inference is based on posterior distributions, that is,

    π(θ|y)=π(θ)p(y|θ)Θπ(˜θ)p(y|˜θ)d˜θ=π(θ)L(θ|y)L(y),

    where y=(y1,,yn).

  • Good news: we have the numerator in this expression.

2 / 13

Bayesian inference (conjugacy recap)

  • As we've seen so far, Bayesian inference is based on posterior distributions, that is,

    π(θ|y)=π(θ)p(y|θ)Θπ(˜θ)p(y|˜θ)d˜θ=π(θ)L(θ|y)L(y),

    where y=(y1,,yn).

  • Good news: we have the numerator in this expression.

  • Bad news: the denominator is typically not available (may involve high dimensional integral)!
2 / 13

Bayesian inference (conjugacy recap)

  • As we've seen so far, Bayesian inference is based on posterior distributions, that is,

    π(θ|y)=π(θ)p(y|θ)Θπ(˜θ)p(y|˜θ)d˜θ=π(θ)L(θ|y)L(y),

    where y=(y1,,yn).

  • Good news: we have the numerator in this expression.

  • Bad news: the denominator is typically not available (may involve high dimensional integral)!

  • How have we been getting by? Conjugacy! For conjugate priors, the posterior distribution of θ is available analytically.

2 / 13

Bayesian inference (conjugacy recap)

  • As we've seen so far, Bayesian inference is based on posterior distributions, that is,

    π(θ|y)=π(θ)p(y|θ)Θπ(˜θ)p(y|˜θ)d˜θ=π(θ)L(θ|y)L(y),

    where y=(y1,,yn).

  • Good news: we have the numerator in this expression.

  • Bad news: the denominator is typically not available (may involve high dimensional integral)!

  • How have we been getting by? Conjugacy! For conjugate priors, the posterior distribution of θ is available analytically.

  • What if a conjugate prior does not represent our prior information well, or we have a more complex model, and our posterior is no longer in a convenient distributional form?

2 / 13

Some conjugate models

  • For example, the most common conjugate models are
Prior Likelihood Posterior
beta binomial beta
gamma Poisson gamma
gamma exponential gamma
normal-gamma normal normal-gamma
beta negative-binomial beta
beta geometric beta
3 / 13

Some conjugate models

  • For example, the most common conjugate models are
Prior Likelihood Posterior
beta binomial beta
gamma Poisson gamma
gamma exponential gamma
normal-gamma normal normal-gamma
beta negative-binomial beta
beta geometric beta
  • There are a few more we have not covered yet, for example, the Dirichlet-multinomial model.
3 / 13

Some conjugate models

  • For example, the most common conjugate models are
Prior Likelihood Posterior
beta binomial beta
gamma Poisson gamma
gamma exponential gamma
normal-gamma normal normal-gamma
beta negative-binomial beta
beta geometric beta
  • There are a few more we have not covered yet, for example, the Dirichlet-multinomial model.

  • However, clearly, we cannot restrict ourselves to conjugate models only.

3 / 13

Back to the normal model

  • For example, for conjugacy in the normal model, we had

    π(μ|τ)=N(μ0,1κ0τ).π(τ) =Gamma(ν02,ν02τ0)

4 / 13

Back to the normal model

  • For example, for conjugacy in the normal model, we had

    π(μ|τ)=N(μ0,1κ0τ).π(τ) =Gamma(ν02,ν02τ0)

  • Suppose we instead wish to specify our uncertainty about μ as independent of τ, that is, we want π(μ,τ)=π(μ)π(τ). For example,

    π(μ)=N(μ0,σ20).π(τ) =Gamma(ν02,ν02τ0).

4 / 13

Back to the normal model

  • For example, for conjugacy in the normal model, we had

    π(μ|τ)=N(μ0,1κ0τ).π(τ) =Gamma(ν02,ν02τ0)

  • Suppose we instead wish to specify our uncertainty about μ as independent of τ, that is, we want π(μ,τ)=π(μ)π(τ). For example,

    π(μ)=N(μ0,σ20).π(τ) =Gamma(ν02,ν02τ0).

  • When σ20 is not proportional to 1τ, the marginal density of τ is not a gamma density (or a density we can easily sample from).

4 / 13

Back to the normal model

  • For example, for conjugacy in the normal model, we had

    π(μ|τ)=N(μ0,1κ0τ).π(τ) =Gamma(ν02,ν02τ0)

  • Suppose we instead wish to specify our uncertainty about μ as independent of τ, that is, we want π(μ,τ)=π(μ)π(τ). For example,

    π(μ)=N(μ0,σ20).π(τ) =Gamma(ν02,ν02τ0).

  • When σ20 is not proportional to 1τ, the marginal density of τ is not a gamma density (or a density we can easily sample from).

  • Side note: for conjugacy, the joint posterior should also be a product of two independent Normal and Gamma densities in μ and τ respectively.

4 / 13

Non-conjugate priors

  • In general, conjugate priors are not available for generalized linear models (GLMs) other than the normal linear model.
5 / 13

Non-conjugate priors

  • In general, conjugate priors are not available for generalized linear models (GLMs) other than the normal linear model.

  • One can potentially rely on an asymptotic normal approximation.

5 / 13

Non-conjugate priors

  • In general, conjugate priors are not available for generalized linear models (GLMs) other than the normal linear model.

  • One can potentially rely on an asymptotic normal approximation.

  • As n, the posterior distribution is normal centered on MLE.

5 / 13

Non-conjugate priors

  • In general, conjugate priors are not available for generalized linear models (GLMs) other than the normal linear model.

  • One can potentially rely on an asymptotic normal approximation.

  • As n, the posterior distribution is normal centered on MLE.

  • However, even for moderate sample sizes, asymptotic approximations may be inaccurate.

5 / 13

Non-conjugate priors

  • In general, conjugate priors are not available for generalized linear models (GLMs) other than the normal linear model.

  • One can potentially rely on an asymptotic normal approximation.

  • As n, the posterior distribution is normal centered on MLE.

  • However, even for moderate sample sizes, asymptotic approximations may be inaccurate.

  • In logistic regression for example, for rare outcomes or rare binary exposures, posterior can be highly skewed.

5 / 13

Non-conjugate priors

  • In general, conjugate priors are not available for generalized linear models (GLMs) other than the normal linear model.

  • One can potentially rely on an asymptotic normal approximation.

  • As n, the posterior distribution is normal centered on MLE.

  • However, even for moderate sample sizes, asymptotic approximations may be inaccurate.

  • In logistic regression for example, for rare outcomes or rare binary exposures, posterior can be highly skewed.

  • It is appealing to avoid any reliance on large sample assumptions and base inferences on exact posterior.

5 / 13

Non-conjugate priors

  • Even though we may not be able to sample from the marginal posterior of a particular parameter when using a non-conjugate prior, sometimes, we may still be able to sample from conditional distributions of those parameters given all other parameters and the data.
6 / 13

Non-conjugate priors

  • Even though we may not be able to sample from the marginal posterior of a particular parameter when using a non-conjugate prior, sometimes, we may still be able to sample from conditional distributions of those parameters given all other parameters and the data.

  • These conditional distributions, known as full conditionals, will be very important for us.

6 / 13

Non-conjugate priors

  • Even though we may not be able to sample from the marginal posterior of a particular parameter when using a non-conjugate prior, sometimes, we may still be able to sample from conditional distributions of those parameters given all other parameters and the data.

  • These conditional distributions, known as full conditionals, will be very important for us.

  • In our normal example with

    μN(μ0,σ20).τ Gamma(ν02,ν02τ0),

    turns out we will not be able sample easily from τ|Y,

6 / 13

Non-conjugate priors

  • Even though we may not be able to sample from the marginal posterior of a particular parameter when using a non-conjugate prior, sometimes, we may still be able to sample from conditional distributions of those parameters given all other parameters and the data.

  • These conditional distributions, known as full conditionals, will be very important for us.

  • In our normal example with

    μN(μ0,σ20).τ Gamma(ν02,ν02τ0),

    turns out we will not be able sample easily from τ|Y,

  • However, as you will see, we will be able to sample from τ|μ,Y. That is the full conditional for τ.

6 / 13

Non-conjugate priors

  • Even though we may not be able to sample from the marginal posterior of a particular parameter when using a non-conjugate prior, sometimes, we may still be able to sample from conditional distributions of those parameters given all other parameters and the data.

  • These conditional distributions, known as full conditionals, will be very important for us.

  • In our normal example with

    μN(μ0,σ20).τ Gamma(ν02,ν02τ0),

    turns out we will not be able sample easily from τ|Y,

  • However, as you will see, we will be able to sample from τ|μ,Y. That is the full conditional for τ.

  • By the way, note that we already know the full conditional for μ, i.e., μ|τ,Y from previous modules.

6 / 13

Full conditional distributions

  • Goal: try to take advantage of those full conditional distributions (without sampling directly from the marginal posteriors) to obtain samples from the said marginal posteriors.
7 / 13

Full conditional distributions

  • Goal: try to take advantage of those full conditional distributions (without sampling directly from the marginal posteriors) to obtain samples from the said marginal posteriors.

  • In our example, with π(μ)=N(μ0,σ20), we have

    μ|Y,τN(μn,τ1n),

7 / 13

Full conditional distributions

  • Goal: try to take advantage of those full conditional distributions (without sampling directly from the marginal posteriors) to obtain samples from the said marginal posteriors.

  • In our example, with π(μ)=N(μ0,σ20), we have

    μ|Y,τN(μn,τ1n),

    where
    • μn=μ0σ20+nτˉy1σ20+nτ; and
    • τn=1σ20+nτ.
7 / 13

Full conditional distributions

  • Goal: try to take advantage of those full conditional distributions (without sampling directly from the marginal posteriors) to obtain samples from the said marginal posteriors.

  • In our example, with π(μ)=N(μ0,σ20), we have

    μ|Y,τN(μn,τ1n),

    where
    • μn=μ0σ20+nτˉy1σ20+nτ; and
    • τn=1σ20+nτ.
  • Review results from previous modules on the normal model if you are not sure why this holds.

7 / 13

Full conditional distributions

  • Goal: try to take advantage of those full conditional distributions (without sampling directly from the marginal posteriors) to obtain samples from the said marginal posteriors.

  • In our example, with π(μ)=N(μ0,σ20), we have

    μ|Y,τN(μn,τ1n),

    where
    • μn=μ0σ20+nτˉy1σ20+nτ; and
    • τn=1σ20+nτ.
  • Review results from previous modules on the normal model if you are not sure why this holds.

  • Let's see if we can figure out the other full conditional τ|μ,Y.

7 / 13

Full conditional distributions

\begin{split} p(\tau| \mu,Y) & \boldsymbol{=} \dfrac{\Pr[\tau,\mu,Y]}{\Pr[\mu,Y]} = \dfrac{p(y|\mu,\tau)\pi(\mu,\tau)}{p[\mu,y]}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \\ & \boldsymbol{=} \dfrac{p(y|\mu,\tau)\pi(\mu)\pi(\tau)}{p[\mu,y]}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \\ & \boldsymbol{\propto} p(y|\mu,\tau)\pi(\tau)\\ & \ \ \ \ \ \ \ \ \ \ \ \ \\ & \boldsymbol{\propto} \underbrace{\tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau \sum^n_{i=1} (y_i - \mu)^2 \right\}}_{\propto \ p(y|\mu,\tau)} \times \underbrace{\tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0}{2\tau_0}\right\}}_{\propto \ \pi(\tau)}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \\ & \boldsymbol{=} \underbrace{\tau^{\dfrac{\nu_0 + n}{2} - 1} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau \left[ \dfrac{\nu_0}{\tau_0} + \sum^n_{i=1} (y_i - \mu)^2 \right] \right\}}_{\textrm{Gamma Kernel}}. \end{split}

8 / 13

Full conditional distributions

\begin{split} p(\tau| \mu,Y) & \boldsymbol{\propto} \underbrace{\tau^{\dfrac{\nu_0 + n}{2} - 1} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau \left[ \dfrac{\nu_0}{\tau_0} + \sum^n_{i=1} (y_i - \mu)^2 \right] \right\}}_{\textrm{Gamma Kernel}}\\ & \boldsymbol{=} \textrm{Gamma}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n}{2 \tau_n(\mu)}\right) \ \ \ \textrm{OR} \ \ \ \textrm{Gamma}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2(\mu)}{2}\right), \end{split}

where

\begin{split} \nu_n & = \nu_0 + n\\ & \ \ \ \ \ \ \ \ \ \ \ \ \\ \sigma_n^2(\mu) & = \dfrac{1}{\nu_n} \left[ \dfrac{\nu_0}{\tau_0} + \sum^n_{i=1} (y_i - \mu)^2 \right] = \dfrac{1}{\nu_n}\left[ \dfrac{\nu_0}{\tau_0} + ns^2_n(\mu) \right]\\ & \ \ \ \ \ \ \ \ \ \ \ \ \\ \ \ \textrm{OR} \ \ \tau_n(\mu) & = \dfrac{\nu_n}{\left[ \dfrac{\nu_0}{\tau_0} + \sum^n_{i=1} (y_i - \mu)^2 \right]} = \dfrac{\nu_n}{\left[ \dfrac{\nu_0}{\tau_0} + ns^2_n(\mu) \right]};\\ & \ \ \ \ \ \ \ \ \ \ \ \ \\ \textrm{with} \ \ \ s^2_n(\mu) & = \dfrac{1}{n} \sum^n_{i=1} (y_i - \mu)^2.\\ \end{split}

9 / 13

Iterative scheme

  • Now we have two full conditional distributions but what we really need is to sample from \pi(\tau | Y).
10 / 13

Iterative scheme

  • Now we have two full conditional distributions but what we really need is to sample from \pi(\tau | Y).

  • Actually, if we could sample from \pi(\mu, \tau| Y), we already know that the draws for \mu and \tau will be from the two marginal posterior distributions. So, we just need a scheme to sample from \pi(\mu, \tau| Y).

10 / 13

Iterative scheme

  • Now we have two full conditional distributions but what we really need is to sample from \pi(\tau | Y).

  • Actually, if we could sample from \pi(\mu, \tau| Y), we already know that the draws for \mu and \tau will be from the two marginal posterior distributions. So, we just need a scheme to sample from \pi(\mu, \tau| Y).

  • Suppose we had a single sample, say \tau^{(1)} from the marginal posterior distribution \pi(\tau | Y). Then we could sample

    \mu^{(1)} \sim p(\mu | \tau^{(1)}, Y).

10 / 13

Iterative scheme

  • Now we have two full conditional distributions but what we really need is to sample from \pi(\tau | Y).

  • Actually, if we could sample from \pi(\mu, \tau| Y), we already know that the draws for \mu and \tau will be from the two marginal posterior distributions. So, we just need a scheme to sample from \pi(\mu, \tau| Y).

  • Suppose we had a single sample, say \tau^{(1)} from the marginal posterior distribution \pi(\tau | Y). Then we could sample

    \mu^{(1)} \sim p(\mu | \tau^{(1)}, Y).

  • This is what we did in the last class, so that the pair \{\mu^{(1)},\tau^{(1)}\} is a sample from the joint posterior \pi(\mu, \tau| Y).

10 / 13

Iterative scheme

  • Now we have two full conditional distributions but what we really need is to sample from \pi(\tau | Y).

  • Actually, if we could sample from \pi(\mu, \tau| Y), we already know that the draws for \mu and \tau will be from the two marginal posterior distributions. So, we just need a scheme to sample from \pi(\mu, \tau| Y).

  • Suppose we had a single sample, say \tau^{(1)} from the marginal posterior distribution \pi(\tau | Y). Then we could sample

    \mu^{(1)} \sim p(\mu | \tau^{(1)}, Y).

  • This is what we did in the last class, so that the pair \{\mu^{(1)},\tau^{(1)}\} is a sample from the joint posterior \pi(\mu, \tau| Y).

  • \Rightarrow \ \mu^{(1)} can be considered a sample from the marginal distribution of \mu, which again means we can use it to sample

    \tau^{(2)} \sim p(\tau | \mu^{(1)}, Y),

    and so forth.

10 / 13

Gibbs sampling

  • So, we can use two full conditional distributions to generate samples from the joint distribution, once we have a starting value \tau^{(1)}.
11 / 13

Gibbs sampling

  • So, we can use two full conditional distributions to generate samples from the joint distribution, once we have a starting value \tau^{(1)}.

  • Formally, this sampling scheme is known as Gibbs sampling.

11 / 13

Gibbs sampling

  • So, we can use two full conditional distributions to generate samples from the joint distribution, once we have a starting value \tau^{(1)}.

  • Formally, this sampling scheme is known as Gibbs sampling.

    • Purpose: Draw from a joint distribution, say p(\mu, \tau| Y).
11 / 13

Gibbs sampling

  • So, we can use two full conditional distributions to generate samples from the joint distribution, once we have a starting value \tau^{(1)}.

  • Formally, this sampling scheme is known as Gibbs sampling.

    • Purpose: Draw from a joint distribution, say p(\mu, \tau| Y).
    • Method: Iterative conditional sampling
      • Draw \tau^{(1)} \sim p(\tau | \mu^{(0)}, Y)
      • Draw \mu^{(1)} \sim p(\mu | \tau^{(1)}, Y)
11 / 13

Gibbs sampling

  • So, we can use two full conditional distributions to generate samples from the joint distribution, once we have a starting value \tau^{(1)}.

  • Formally, this sampling scheme is known as Gibbs sampling.

    • Purpose: Draw from a joint distribution, say p(\mu, \tau| Y).
    • Method: Iterative conditional sampling

      • Draw \tau^{(1)} \sim p(\tau | \mu^{(0)}, Y)
      • Draw \mu^{(1)} \sim p(\mu | \tau^{(1)}, Y)
    • Purpose: Full conditional distributions have known forms, with sampling from the full conditional distributions fairly easy.

11 / 13

Gibbs sampling

  • So, we can use two full conditional distributions to generate samples from the joint distribution, once we have a starting value \tau^{(1)}.

  • Formally, this sampling scheme is known as Gibbs sampling.

    • Purpose: Draw from a joint distribution, say p(\mu, \tau| Y).
    • Method: Iterative conditional sampling

      • Draw \tau^{(1)} \sim p(\tau | \mu^{(0)}, Y)
      • Draw \mu^{(1)} \sim p(\mu | \tau^{(1)}, Y)
    • Purpose: Full conditional distributions have known forms, with sampling from the full conditional distributions fairly easy.

  • More generally, we can use this method to generate samples of \theta = (\theta_1, \ldots, \theta_p), the vector of p parameters of interest, from the joint density.

11 / 13

Gibbs sampling

  • Procedure:
    • Start with initial value \theta^{(0)} = (\theta_1^{(0)}, \ldots, \theta_p^{(0)}).
12 / 13

Gibbs sampling

  • Procedure:

    • Start with initial value \theta^{(0)} = (\theta_1^{(0)}, \ldots, \theta_p^{(0)}).
    • For iterations s = 1, \ldots, S,

      1. Sample \theta_1^{(s)} from the conditional posterior distribution

        \pi(\theta_1 | \theta_2 = \theta_2^{(s-1)}, \ldots, \theta_p = \theta_p^{(s-1)}, Y)

      2. Sample \theta_2^{(s)} from the conditional posterior distribution

        \pi(\theta_2 | \theta_1 = \theta_1^{(s)}, \theta_3 = \theta_3^{(s-1)}, \ldots, \theta_p = \theta_p^{(s-1)}, Y)

      3. Similarly, sample \theta_3^{(s)}, \ldots, \theta_p^{(s)} from the conditional posterior distributions given current values of other parameters.

12 / 13

Gibbs sampling

  • Procedure:

    • Start with initial value \theta^{(0)} = (\theta_1^{(0)}, \ldots, \theta_p^{(0)}).
    • For iterations s = 1, \ldots, S,

      1. Sample \theta_1^{(s)} from the conditional posterior distribution

        \pi(\theta_1 | \theta_2 = \theta_2^{(s-1)}, \ldots, \theta_p = \theta_p^{(s-1)}, Y)

      2. Sample \theta_2^{(s)} from the conditional posterior distribution

        \pi(\theta_2 | \theta_1 = \theta_1^{(s)}, \theta_3 = \theta_3^{(s-1)}, \ldots, \theta_p = \theta_p^{(s-1)}, Y)

      3. Similarly, sample \theta_3^{(s)}, \ldots, \theta_p^{(s)} from the conditional posterior distributions given current values of other parameters.

    • This generates a dependent sequence of parameter values.
12 / 13

Gibbs sampling

  • Procedure:

    • Start with initial value \theta^{(0)} = (\theta_1^{(0)}, \ldots, \theta_p^{(0)}).
    • For iterations s = 1, \ldots, S,

      1. Sample \theta_1^{(s)} from the conditional posterior distribution

        \pi(\theta_1 | \theta_2 = \theta_2^{(s-1)}, \ldots, \theta_p = \theta_p^{(s-1)}, Y)

      2. Sample \theta_2^{(s)} from the conditional posterior distribution

        \pi(\theta_2 | \theta_1 = \theta_1^{(s)}, \theta_3 = \theta_3^{(s-1)}, \ldots, \theta_p = \theta_p^{(s-1)}, Y)

      3. Similarly, sample \theta_3^{(s)}, \ldots, \theta_p^{(s)} from the conditional posterior distributions given current values of other parameters.

    • This generates a dependent sequence of parameter values.
  • In the next module, we will look into a simple example of how this works, before going back to the normal model.

12 / 13

What's next?

Move on to the readings for the next module!

13 / 13

Bayesian inference (conjugacy recap)

  • As we've seen so far, Bayesian inference is based on posterior distributions, that is,

    \begin{split} \pi(\theta | y) & = \frac{\pi(\theta) \cdot p(y|\theta)}{\int_{\Theta}\pi(\tilde{\theta}) \cdot p(y| \tilde{\theta}) \textrm{d}\tilde{\theta}} = \frac{\pi(\theta) \cdot L(\theta | y)}{L(y)}, \end{split}

    where y = (y_1, \ldots, y_n).

2 / 13
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow