STA 360/602L: Module 2.6

class: center, middle, inverse, title-slide

# STA 360/602L: Module 2.6
## Loss functions and Bayes risk
### Dr. Olanrewaju Michael Akande

---

## Bayes estimate

- As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.

- That said, it is still very appealing to have simple summaries, especially when dealing with clients or collaborators from other fields, who desire one.

- Can we obtain a single estimate of `$\theta$` based on the posterior? Sure!

- .hlight[Bayes estimate] is the value `$\hat{\theta}$`, that minimizes the Bayes risk.

---
## Bayes estimate

- .hlight[Bayes risk] is defined as the expected loss averaged over the posterior distribution.

- Put differently, a .hlight[Bayes estimate] `$\hat{\theta}$` has the lowest posterior expected loss.

- That's fine, but what does expected loss mean?

- .hlight[Frequentist risk] also exists but we won't go into that here.

---
## Loss functions

- A .hlight[loss function] `$L(\theta,\delta(y))$` is a function of a parameter `$\theta$`, where `$\delta(y)$` is some .hlight[decision] about `$\theta$`, based on just the data `$y$`.

- For example, `$\delta(y) = \bar{y}$` can be the decision to use the sample mean to estimate `$\theta$`, the true population mean.

- `$L(\theta,\delta(y))$` determines the penalty for making the decision `$\delta(y)$`, if `$\theta$` is the true parameter; `$L(\theta,\delta(y))$` characterizes the price paid for errors.

---
## Loss functions

- A common choice for example, when dealing with point estimation, is the .hlight[squared error loss], which has
.block[
`$$L(\theta,\delta(y)) = (\theta - \delta(y))^2.$$`
]

- Bayes risk is thus
.block[
`$$\rho(\theta,\delta) = \mathbb{E}\left[\ L(\theta,\delta(y)) \ | y\right] = \int L(\theta,\delta(y)) \cdot \pi(\theta|y) \ d\theta,$$`
]

and we proceed to find the value `$\hat{\theta}$`, that is, the decision `$\delta(y)$`, that minimizes the Bayes risk.

---
## Bayes estimator under squared error loss

- Turns out that, under squared error loss, the decision `$\delta(y)$` that minimizes the posterior risk is the posterior mean.

- Proof: Let `$L(\theta,\delta(y)) = (\theta - \delta(y))^2$`. Then,
.block[
$$
`\begin{aligned}
\rho(\theta,\delta) & = \int L(\theta,\delta(y)) \cdot \pi(\theta|y) \ d\theta.\\
& = \int (\theta - \delta(y))^2 \cdot \pi(\theta|y) \ d\theta.\\
\end{aligned}`
$$
]

- Expand, then take the partial derivative of `$\rho(\theta,\delta)$` with respect to `$\delta(y)$`.

- <div class="question">
To be continued on the board!
</div>

---
## Bayes estimator under squared error loss

- .block[
$$
`\begin{aligned}
\rho(\theta,\delta) & \int (\theta - \delta(y))^2 \cdot \pi(\theta|y) \ d\theta.\\
\end{aligned}`
$$
]

- Easy to see then that `$\delta(y) = \mathbb{E}[\theta | x]$` is the minimizer.

- Well that's great! The posterior mean is often very easy to calculate in most cases.

- In the beta-binomial case for example, the Bayes estimate under squared error loss is just
.block[
`$$\hat{\theta} = \frac{a+y}{a+b+n},$$`
]

the posterior mean.

---
## What about other loss functions?

- Clearly, squared error is only one possible loss function. An alternative is .hlight[absolute loss], which has
.block[
`$$L(\theta,\delta(y)) = |\theta - \delta(y)|.$$`
]

- Absolute loss places less of a penalty on large deviations & the resulting Bayes estimate is **posterior median**.

- Median is actually relatively easy to estimate.

---
## What about other loss functions?

- Recall that for a continuous random variable `$Y$` with cdf `$F$`, the median of the distribution is the value `$z$`, which satisfies
.block[
`$$F(z) = \Pr(Y\leq z) = \dfrac{1}{2}= \Pr(Y\geq z) = 1-F(z).$$`
]

- As long as we know how to evaluate the CDF of the distribution we have, we can solve for `$z$`.

- Think R!

---
## What about other loss functions?

- For the beta-binomial model, the CDF of the beta posterior can be written as
.block[
`$$F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.$$`
]

- Then, if `$\hat{\theta}$` is the median, we have that `$F(\hat{\theta}) = 0.5$`.

- To solve for `$\hat{\theta}$`, apply the inverse CDF `$\hat{\theta} = F^{-1}(0.5)$`.

- In R, that's simply
    
    ```r
    qbeta(0.5,a+y,b+n-y)
    ```

- For other popular distributions, switch out the beta.

---
## Loss functions and decisions

- Loss functions are not specific to estimation problems but are a critical part of decision making.

- For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.

- Suppose, if Duke
  + loses (y = 0), you lose the amount you bet ($A)
  + wins (y = 1), you gain B per $1 bet
  
--

- <div class="question">
What is a good sampling distribution for y here?
</div>
  
--

- Then, the loss function can be characterized as
.block[
.small[
`$$L(A,y) = A(1-y) - y(BA),$$`
]
]
  
  with your action being the amount bet A.
  
--

- <div class="question">
When will your bet be "rational"?
</div>

---
## How much to bet on Duke?

- `$y$` is an unknown state, but we can think of it as a new prediction `$y_{n+1}$` given that we have data from win-loss records `$(y_{1:n})$` that can be converted into a Bayesian posterior,
.block[
.small[
`$$\theta \sim \textrm{beta}(a_n,b_n),$$`
]
]

--
  with this posterior concentrated slightly to the left of 0.5, if we only use data on UNC-Duke games (UNC men lead Duke 139-112 all time).
  
--

- Actually, it might make more sense to focus on more recent head-to-head data and not the all time record.

- In fact, we might want to build a model that predicts the outcome of the game using historical data & predictors (current team rankings, injuries, etc).

- However, to keep it simple for this illustration, go with the posterior above.

---
## How much to bet on Duke?

- The Bayes risk for action A is then the expectation of the loss function,
.block[
.small[
`$$\rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].$$`
]
]

- To calculate this as a function of `$A$` and find the optimal `$A$`, we need to marginalize over the **posterior predictive distribution** for `$y$`.

- <div class="question">
Why are we using the posterior predictive distribution here instead of the posterior distribution?
</div>

- As an aside, recall from Module 2.3 that
.block[
.small[
`$$p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.$$`
]
]

- Specifically, that the posterior predictive distribution here is `$\textrm{Bernoulli}(\hat{\theta})$`, with
.block[
.small[
`$$\hat{\theta} = \dfrac{a_n}{a_n + b_n}$$`
]
]

- By the way, what do `$a_n$` and `$b_n$` represent?

---
## How much to bet on Duke?

- With the loss function `$L(A,y) = A(1-y) - y(BA)$`, and using the notation `$y_{n+1}$` instead of `$y$` (to make it obvious the game has not been played), the Bayes risk (expected loss) for bet `$A$` is
.block[
$$
`\begin{split}
\rho(A) & = \mathbb{E}\left[\ L(A,y_{n+1}) \ | y_{1:n}\right]\\
\\
& = \mathbb{E}\left[A(1-y_{n+1}) - y_{n+1}(BA) \ | y_{1:n}\right]\\
\\
& = A \ \mathbb{E}\left[1-y_{n+1} | \ y_{1:n}\right]  - (BA) \ \mathbb{E}\left[y_{n+1} | y_{1:n}\right]\\
\\
& = A \ \left(1 -  \mathbb{E}\left[y_{n+1} | y_{1:n}\right] \right)  - (BA) \ \mathbb{E}\left[y_{n+1} | y_{1:n}\right]\\
\\
& = A \ \left(1 -\mathbb{E}\left[y_{n+1} | y_{1:n}\right] \ (1+B) \right).
\end{split}`
$$
]

---
## How much to bet on Duke?

- Hence, your bet is rational as long as
.block[
$$
`\begin{split}
\mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1\\
\\
\dfrac{a_n (1+B)}{a_n + b_n} > 1.
\end{split}`
$$
]

- Clearly, there is no limit to the amount you should bet if this condition is satisfied (the loss function is clearly too simple).

- Loss function needs to be carefully chosen to lead to a good decision - finite resources, diminishing returns, limits on donations, etc.

- Want more on loss functions, expected loss/utility, or decision problems in general? Consider taking a course on decision theory (STA623?).

---

class: center, middle

# What's next?

### Move on to the readings for the next module!