Processing math: 40%
+ - 0:00:00
Notes for current slide
Notes for next slide

STA 360/602L: Module 2.6

Loss functions and Bayes risk

Dr. Olanrewaju Michael Akande

1 / 16

Bayes estimate

  • As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.
2 / 16

Bayes estimate

  • As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.

  • That said, it is still very appealing to have simple summaries, especially when dealing with clients or collaborators from other fields, who desire one.

2 / 16

Bayes estimate

  • As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.

  • That said, it is still very appealing to have simple summaries, especially when dealing with clients or collaborators from other fields, who desire one.

  • Can we obtain a single estimate of θ based on the posterior? Sure!

2 / 16

Bayes estimate

  • As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.

  • That said, it is still very appealing to have simple summaries, especially when dealing with clients or collaborators from other fields, who desire one.

  • Can we obtain a single estimate of θ based on the posterior? Sure!

  • Bayes estimate is the value ˆθ, that minimizes the Bayes risk.

2 / 16

Bayes estimate

  • Bayes risk is defined as the expected loss averaged over the posterior distribution.
3 / 16

Bayes estimate

  • Bayes risk is defined as the expected loss averaged over the posterior distribution.

  • Put differently, a Bayes estimate ˆθ has the lowest posterior expected loss.

3 / 16

Bayes estimate

  • Bayes risk is defined as the expected loss averaged over the posterior distribution.

  • Put differently, a Bayes estimate ˆθ has the lowest posterior expected loss.

  • That's fine, but what does expected loss mean?

3 / 16

Bayes estimate

  • Bayes risk is defined as the expected loss averaged over the posterior distribution.

  • Put differently, a Bayes estimate ˆθ has the lowest posterior expected loss.

  • That's fine, but what does expected loss mean?

  • Frequentist risk also exists but we won't go into that here.

3 / 16

Loss functions

  • A loss function L(θ,δ(y)) is a function of a parameter θ, where δ(y) is some decision about θ, based on just the data y.
4 / 16

Loss functions

  • A loss function L(θ,δ(y)) is a function of a parameter θ, where δ(y) is some decision about θ, based on just the data y.

  • For example, δ(y)=ˉy can be the decision to use the sample mean to estimate θ, the true population mean.

4 / 16

Loss functions

  • A loss function L(θ,δ(y)) is a function of a parameter θ, where δ(y) is some decision about θ, based on just the data y.

  • For example, δ(y)=ˉy can be the decision to use the sample mean to estimate θ, the true population mean.

  • L(θ,δ(y)) determines the penalty for making the decision δ(y), if θ is the true parameter; L(θ,δ(y)) characterizes the price paid for errors.

4 / 16

Loss functions

  • A common choice for example, when dealing with point estimation, is the squared error loss, which has

    L(θ,δ(y))=(θδ(y))2.

5 / 16

Loss functions

  • A common choice for example, when dealing with point estimation, is the squared error loss, which has

    L(θ,δ(y))=(θδ(y))2.

  • Bayes risk is thus

    ρ(θ,δ)=E[ L(θ,δ(y)) |y]=L(θ,δ(y))π(θ|y) dθ,

    and we proceed to find the value ˆθ, that is, the decision δ(y), that minimizes the Bayes risk.

5 / 16

Bayes estimator under squared error loss

  • Turns out that, under squared error loss, the decision δ(y) that minimizes the posterior risk is the posterior mean.
6 / 16

Bayes estimator under squared error loss

  • Turns out that, under squared error loss, the decision δ(y) that minimizes the posterior risk is the posterior mean.

  • Proof: Let L(θ,δ(y))=(θδ(y))2. Then,

    ρ(θ,δ)=L(θ,δ(y))π(θ|y) dθ.=(θδ(y))2π(θ|y) dθ.

6 / 16

Bayes estimator under squared error loss

  • Turns out that, under squared error loss, the decision δ(y) that minimizes the posterior risk is the posterior mean.

  • Proof: Let L(θ,δ(y))=(θδ(y))2. Then,

    ρ(θ,δ)=L(θ,δ(y))π(θ|y) dθ.=(θδ(y))2π(θ|y) dθ.

  • Expand, then take the partial derivative of ρ(θ,δ) with respect to δ(y).

6 / 16

Bayes estimator under squared error loss

  • Turns out that, under squared error loss, the decision δ(y) that minimizes the posterior risk is the posterior mean.

  • Proof: Let L(θ,δ(y))=(θδ(y))2. Then,

    ρ(θ,δ)=L(θ,δ(y))π(θ|y) dθ.=(θδ(y))2π(θ|y) dθ.

  • Expand, then take the partial derivative of ρ(θ,δ) with respect to δ(y).

  • To be continued on the board!
6 / 16

Bayes estimator under squared error loss

  • ρ(θ,δ)(θδ(y))2π(θ|y) dθ.

  • Easy to see then that δ(y)=E[θ|x] is the minimizer.

7 / 16

Bayes estimator under squared error loss

  • ρ(θ,δ)(θδ(y))2π(θ|y) dθ.

  • Easy to see then that δ(y)=E[θ|x] is the minimizer.

  • Well that's great! The posterior mean is often very easy to calculate in most cases.

7 / 16

Bayes estimator under squared error loss

  • ρ(θ,δ)(θδ(y))2π(θ|y) dθ.

  • Easy to see then that δ(y)=E[θ|x] is the minimizer.

  • Well that's great! The posterior mean is often very easy to calculate in most cases.

  • In the beta-binomial case for example, the Bayes estimate under squared error loss is just

    ˆθ=a+ya+b+n,

    the posterior mean.

7 / 16

What about other loss functions?

  • Clearly, squared error is only one possible loss function. An alternative is absolute loss, which has

    L(θ,δ(y))=|θδ(y)|.

8 / 16

What about other loss functions?

  • Clearly, squared error is only one possible loss function. An alternative is absolute loss, which has

    L(θ,δ(y))=|θδ(y)|.

  • Absolute loss places less of a penalty on large deviations & the resulting Bayes estimate is posterior median.

8 / 16

What about other loss functions?

  • Clearly, squared error is only one possible loss function. An alternative is absolute loss, which has

    L(θ,δ(y))=|θδ(y)|.

  • Absolute loss places less of a penalty on large deviations & the resulting Bayes estimate is posterior median.

  • Median is actually relatively easy to estimate.

8 / 16

What about other loss functions?

  • Recall that for a continuous random variable Y with cdf F, the median of the distribution is the value z, which satisfies

    F(z)=Pr

9 / 16

What about other loss functions?

  • Recall that for a continuous random variable Y with cdf F, the median of the distribution is the value z, which satisfies

    F(z) = \Pr(Y\leq z) = \dfrac{1}{2}= \Pr(Y\geq z) = 1-F(z).

  • As long as we know how to evaluate the CDF of the distribution we have, we can solve for z.

9 / 16

What about other loss functions?

  • Recall that for a continuous random variable Y with cdf F, the median of the distribution is the value z, which satisfies

    F(z) = \Pr(Y\leq z) = \dfrac{1}{2}= \Pr(Y\geq z) = 1-F(z).

  • As long as we know how to evaluate the CDF of the distribution we have, we can solve for z.

  • Think R!

9 / 16

What about other loss functions?

  • For the beta-binomial model, the CDF of the beta posterior can be written as

    F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.

10 / 16

What about other loss functions?

  • For the beta-binomial model, the CDF of the beta posterior can be written as

    F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.

  • Then, if \hat{\theta} is the median, we have that F(\hat{\theta}) = 0.5.

10 / 16

What about other loss functions?

  • For the beta-binomial model, the CDF of the beta posterior can be written as

    F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.

  • Then, if \hat{\theta} is the median, we have that F(\hat{\theta}) = 0.5.

  • To solve for \hat{\theta}, apply the inverse CDF \hat{\theta} = F^{-1}(0.5).

10 / 16

What about other loss functions?

  • For the beta-binomial model, the CDF of the beta posterior can be written as

    F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.

  • Then, if \hat{\theta} is the median, we have that F(\hat{\theta}) = 0.5.

  • To solve for \hat{\theta}, apply the inverse CDF \hat{\theta} = F^{-1}(0.5).

  • In R, that's simply

    qbeta(0.5,a+y,b+n-y)
10 / 16

What about other loss functions?

  • For the beta-binomial model, the CDF of the beta posterior can be written as

    F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.

  • Then, if \hat{\theta} is the median, we have that F(\hat{\theta}) = 0.5.

  • To solve for \hat{\theta}, apply the inverse CDF \hat{\theta} = F^{-1}(0.5).

  • In R, that's simply

    qbeta(0.5,a+y,b+n-y)
  • For other popular distributions, switch out the beta.
10 / 16

Loss functions and decisions

  • Loss functions are not specific to estimation problems but are a critical part of decision making.
11 / 16

Loss functions and decisions

  • Loss functions are not specific to estimation problems but are a critical part of decision making.

  • For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.

11 / 16

Loss functions and decisions

  • Loss functions are not specific to estimation problems but are a critical part of decision making.

  • For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.

  • Suppose, if Duke

    • loses (y = 0), you lose the amount you bet ($A)
    • wins (y = 1), you gain B per $1 bet
11 / 16

Loss functions and decisions

  • Loss functions are not specific to estimation problems but are a critical part of decision making.

  • For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.

  • Suppose, if Duke

    • loses (y = 0), you lose the amount you bet ($A)
    • wins (y = 1), you gain B per $1 bet
  • What is a good sampling distribution for y here?
11 / 16

Loss functions and decisions

  • Loss functions are not specific to estimation problems but are a critical part of decision making.

  • For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.

  • Suppose, if Duke

    • loses (y = 0), you lose the amount you bet ($A)
    • wins (y = 1), you gain B per $1 bet
  • What is a good sampling distribution for y here?
  • Then, the loss function can be characterized as

    L(A,y) = A(1-y) - y(BA),

    with your action being the amount bet A.

11 / 16

Loss functions and decisions

  • Loss functions are not specific to estimation problems but are a critical part of decision making.

  • For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.

  • Suppose, if Duke

    • loses (y = 0), you lose the amount you bet ($A)
    • wins (y = 1), you gain B per $1 bet
  • What is a good sampling distribution for y here?
  • Then, the loss function can be characterized as

    L(A,y) = A(1-y) - y(BA),

    with your action being the amount bet A.

  • When will your bet be "rational"?
11 / 16

How much to bet on Duke?

  • y is an unknown state, but we can think of it as a new prediction y_{n+1} given that we have data from win-loss records (y_{1:n}) that can be converted into a Bayesian posterior,

    \theta \sim \textrm{beta}(a_n,b_n),

12 / 16

How much to bet on Duke?

  • y is an unknown state, but we can think of it as a new prediction y_{n+1} given that we have data from win-loss records (y_{1:n}) that can be converted into a Bayesian posterior,

    \theta \sim \textrm{beta}(a_n,b_n),

    with this posterior concentrated slightly to the left of 0.5, if we only use data on UNC-Duke games (UNC men lead Duke 139-112 all time).
12 / 16

How much to bet on Duke?

  • y is an unknown state, but we can think of it as a new prediction y_{n+1} given that we have data from win-loss records (y_{1:n}) that can be converted into a Bayesian posterior,

    \theta \sim \textrm{beta}(a_n,b_n),

    with this posterior concentrated slightly to the left of 0.5, if we only use data on UNC-Duke games (UNC men lead Duke 139-112 all time).
  • Actually, it might make more sense to focus on more recent head-to-head data and not the all time record.

12 / 16

How much to bet on Duke?

  • y is an unknown state, but we can think of it as a new prediction y_{n+1} given that we have data from win-loss records (y_{1:n}) that can be converted into a Bayesian posterior,

    \theta \sim \textrm{beta}(a_n,b_n),

    with this posterior concentrated slightly to the left of 0.5, if we only use data on UNC-Duke games (UNC men lead Duke 139-112 all time).
  • Actually, it might make more sense to focus on more recent head-to-head data and not the all time record.

  • In fact, we might want to build a model that predicts the outcome of the game using historical data & predictors (current team rankings, injuries, etc).

12 / 16

How much to bet on Duke?

  • y is an unknown state, but we can think of it as a new prediction y_{n+1} given that we have data from win-loss records (y_{1:n}) that can be converted into a Bayesian posterior,

    \theta \sim \textrm{beta}(a_n,b_n),

    with this posterior concentrated slightly to the left of 0.5, if we only use data on UNC-Duke games (UNC men lead Duke 139-112 all time).
  • Actually, it might make more sense to focus on more recent head-to-head data and not the all time record.

  • In fact, we might want to build a model that predicts the outcome of the game using historical data & predictors (current team rankings, injuries, etc).

  • However, to keep it simple for this illustration, go with the posterior above.

12 / 16

How much to bet on Duke?

  • The Bayes risk for action A is then the expectation of the loss function,

    \rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].

13 / 16

How much to bet on Duke?

  • The Bayes risk for action A is then the expectation of the loss function,

    \rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].

  • To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.

13 / 16

How much to bet on Duke?

  • The Bayes risk for action A is then the expectation of the loss function,

    \rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].

  • To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.

  • Why are we using the posterior predictive distribution here instead of the posterior distribution?
13 / 16

How much to bet on Duke?

  • The Bayes risk for action A is then the expectation of the loss function,

    \rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].

  • To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.

  • Why are we using the posterior predictive distribution here instead of the posterior distribution?
  • As an aside, recall from Module 2.3 that

    p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.

13 / 16

How much to bet on Duke?

  • The Bayes risk for action A is then the expectation of the loss function,

    \rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].

  • To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.

  • Why are we using the posterior predictive distribution here instead of the posterior distribution?
  • As an aside, recall from Module 2.3 that

    p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.

  • Specifically, that the posterior predictive distribution here is \textrm{Bernoulli}(\hat{\theta}), with

    \hat{\theta} = \dfrac{a_n}{a_n + b_n}

13 / 16

How much to bet on Duke?

  • The Bayes risk for action A is then the expectation of the loss function,

    \rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].

  • To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.

  • Why are we using the posterior predictive distribution here instead of the posterior distribution?
  • As an aside, recall from Module 2.3 that

    p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.

  • Specifically, that the posterior predictive distribution here is \textrm{Bernoulli}(\hat{\theta}), with

    \hat{\theta} = \dfrac{a_n}{a_n + b_n}

  • By the way, what do a_n and b_n represent?

13 / 16

How much to bet on Duke?

  • With the loss function L(A,y) = A(1-y) - y(BA), and using the notation y_{n+1} instead of y (to make it obvious the game has not been played), the Bayes risk (expected loss) for bet A is

    \begin{split} \rho(A) & = \mathbb{E}\left[\ L(A,y_{n+1}) \ | y_{1:n}\right]\\ \\ & = \mathbb{E}\left[A(1-y_{n+1}) - y_{n+1}(BA) \ | y_{1:n}\right]\\ \\ & = A \ \mathbb{E}\left[1-y_{n+1} | \ y_{1:n}\right] - (BA) \ \mathbb{E}\left[y_{n+1} | y_{1:n}\right]\\ \\ & = A \ \left(1 - \mathbb{E}\left[y_{n+1} | y_{1:n}\right] \right) - (BA) \ \mathbb{E}\left[y_{n+1} | y_{1:n}\right]\\ \\ & = A \ \left(1 -\mathbb{E}\left[y_{n+1} | y_{1:n}\right] \ (1+B) \right). \end{split}

14 / 16

How much to bet on Duke?

  • Hence, your bet is rational as long as

    \begin{split} \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1\\ \\ \dfrac{a_n (1+B)}{a_n + b_n} > 1. \end{split}

15 / 16

How much to bet on Duke?

  • Hence, your bet is rational as long as

    \begin{split} \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1\\ \\ \dfrac{a_n (1+B)}{a_n + b_n} > 1. \end{split}

  • Clearly, there is no limit to the amount you should bet if this condition is satisfied (the loss function is clearly too simple).

15 / 16

How much to bet on Duke?

  • Hence, your bet is rational as long as

    \begin{split} \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1\\ \\ \dfrac{a_n (1+B)}{a_n + b_n} > 1. \end{split}

  • Clearly, there is no limit to the amount you should bet if this condition is satisfied (the loss function is clearly too simple).

  • Loss function needs to be carefully chosen to lead to a good decision - finite resources, diminishing returns, limits on donations, etc.

15 / 16

How much to bet on Duke?

  • Hence, your bet is rational as long as

    \begin{split} \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1\\ \\ \dfrac{a_n (1+B)}{a_n + b_n} > 1. \end{split}

  • Clearly, there is no limit to the amount you should bet if this condition is satisfied (the loss function is clearly too simple).

  • Loss function needs to be carefully chosen to lead to a good decision - finite resources, diminishing returns, limits on donations, etc.

  • Want more on loss functions, expected loss/utility, or decision problems in general? Consider taking a course on decision theory (STA623?).

15 / 16

What's next?

Move on to the readings for the next module!

16 / 16

Bayes estimate

  • As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.
2 / 16
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow