As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.
That said, it is still very appealing to have simple summaries, especially when dealing with clients or collaborators from other fields, who desire one.
As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.
That said, it is still very appealing to have simple summaries, especially when dealing with clients or collaborators from other fields, who desire one.
Can we obtain a single estimate of θ based on the posterior? Sure!
As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.
That said, it is still very appealing to have simple summaries, especially when dealing with clients or collaborators from other fields, who desire one.
Can we obtain a single estimate of θ based on the posterior? Sure!
Bayes estimate is the value ˆθ, that minimizes the Bayes risk.
Bayes risk is defined as the expected loss averaged over the posterior distribution.
Put differently, a Bayes estimate ˆθ has the lowest posterior expected loss.
Bayes risk is defined as the expected loss averaged over the posterior distribution.
Put differently, a Bayes estimate ˆθ has the lowest posterior expected loss.
That's fine, but what does expected loss mean?
Bayes risk is defined as the expected loss averaged over the posterior distribution.
Put differently, a Bayes estimate ˆθ has the lowest posterior expected loss.
That's fine, but what does expected loss mean?
Frequentist risk also exists but we won't go into that here.
A loss function L(θ,δ(y)) is a function of a parameter θ, where δ(y) is some decision about θ, based on just the data y.
For example, δ(y)=ˉy can be the decision to use the sample mean to estimate θ, the true population mean.
A loss function L(θ,δ(y)) is a function of a parameter θ, where δ(y) is some decision about θ, based on just the data y.
For example, δ(y)=ˉy can be the decision to use the sample mean to estimate θ, the true population mean.
L(θ,δ(y)) determines the penalty for making the decision δ(y), if θ is the true parameter; L(θ,δ(y)) characterizes the price paid for errors.
L(θ,δ(y))=(θ−δ(y))2.
A common choice for example, when dealing with point estimation, is the squared error loss, which has
L(θ,δ(y))=(θ−δ(y))2.
Bayes risk is thus
ρ(θ,δ)=E[ L(θ,δ(y)) |y]=∫L(θ,δ(y))⋅π(θ|y) dθ,
and we proceed to find the value ˆθ, that is, the decision δ(y), that minimizes the Bayes risk.
Turns out that, under squared error loss, the decision δ(y) that minimizes the posterior risk is the posterior mean.
Proof: Let L(θ,δ(y))=(θ−δ(y))2. Then,
ρ(θ,δ)=∫L(θ,δ(y))⋅π(θ|y) dθ.=∫(θ−δ(y))2⋅π(θ|y) dθ.
Turns out that, under squared error loss, the decision δ(y) that minimizes the posterior risk is the posterior mean.
Proof: Let L(θ,δ(y))=(θ−δ(y))2. Then,
ρ(θ,δ)=∫L(θ,δ(y))⋅π(θ|y) dθ.=∫(θ−δ(y))2⋅π(θ|y) dθ.
Expand, then take the partial derivative of ρ(θ,δ) with respect to δ(y).
Turns out that, under squared error loss, the decision δ(y) that minimizes the posterior risk is the posterior mean.
Proof: Let L(θ,δ(y))=(θ−δ(y))2. Then,
ρ(θ,δ)=∫L(θ,δ(y))⋅π(θ|y) dθ.=∫(θ−δ(y))2⋅π(θ|y) dθ.
Expand, then take the partial derivative of ρ(θ,δ) with respect to δ(y).
ρ(θ,δ)∫(θ−δ(y))2⋅π(θ|y) dθ.
Easy to see then that δ(y)=E[θ|x] is the minimizer.
ρ(θ,δ)∫(θ−δ(y))2⋅π(θ|y) dθ.
Easy to see then that δ(y)=E[θ|x] is the minimizer.
Well that's great! The posterior mean is often very easy to calculate in most cases.
ρ(θ,δ)∫(θ−δ(y))2⋅π(θ|y) dθ.
Easy to see then that δ(y)=E[θ|x] is the minimizer.
Well that's great! The posterior mean is often very easy to calculate in most cases.
In the beta-binomial case for example, the Bayes estimate under squared error loss is just
ˆθ=a+ya+b+n,
the posterior mean.
L(θ,δ(y))=|θ−δ(y)|.
Clearly, squared error is only one possible loss function. An alternative is absolute loss, which has
L(θ,δ(y))=|θ−δ(y)|.
Absolute loss places less of a penalty on large deviations & the resulting Bayes estimate is posterior median.
Clearly, squared error is only one possible loss function. An alternative is absolute loss, which has
L(θ,δ(y))=|θ−δ(y)|.
Absolute loss places less of a penalty on large deviations & the resulting Bayes estimate is posterior median.
Median is actually relatively easy to estimate.
F(z)=Pr
Recall that for a continuous random variable Y with cdf F, the median of the distribution is the value z, which satisfies
F(z) = \Pr(Y\leq z) = \dfrac{1}{2}= \Pr(Y\geq z) = 1-F(z).
As long as we know how to evaluate the CDF of the distribution we have, we can solve for z.
Recall that for a continuous random variable Y with cdf F, the median of the distribution is the value z, which satisfies
F(z) = \Pr(Y\leq z) = \dfrac{1}{2}= \Pr(Y\geq z) = 1-F(z).
As long as we know how to evaluate the CDF of the distribution we have, we can solve for z.
Think R!
F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.
For the beta-binomial model, the CDF of the beta posterior can be written as
F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.
Then, if \hat{\theta} is the median, we have that F(\hat{\theta}) = 0.5.
For the beta-binomial model, the CDF of the beta posterior can be written as
F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.
Then, if \hat{\theta} is the median, we have that F(\hat{\theta}) = 0.5.
To solve for \hat{\theta}, apply the inverse CDF \hat{\theta} = F^{-1}(0.5).
For the beta-binomial model, the CDF of the beta posterior can be written as
F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.
Then, if \hat{\theta} is the median, we have that F(\hat{\theta}) = 0.5.
To solve for \hat{\theta}, apply the inverse CDF \hat{\theta} = F^{-1}(0.5).
In R, that's simply
qbeta(0.5,a+y,b+n-y)
For the beta-binomial model, the CDF of the beta posterior can be written as
F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{Beta}(\theta| a+y, b+n-y) d\theta.
Then, if \hat{\theta} is the median, we have that F(\hat{\theta}) = 0.5.
To solve for \hat{\theta}, apply the inverse CDF \hat{\theta} = F^{-1}(0.5).
In R, that's simply
qbeta(0.5,a+y,b+n-y)
Loss functions are not specific to estimation problems but are a critical part of decision making.
For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.
Loss functions are not specific to estimation problems but are a critical part of decision making.
For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.
Suppose, if Duke
Loss functions are not specific to estimation problems but are a critical part of decision making.
For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.
Suppose, if Duke
Loss functions are not specific to estimation problems but are a critical part of decision making.
For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.
Suppose, if Duke
Then, the loss function can be characterized as
L(A,y) = A(1-y) - y(BA),
with your action being the amount bet A.
Loss functions are not specific to estimation problems but are a critical part of decision making.
For example, suppose you are deciding how much money to bet ($A) on Duke in the next UNC-Duke men's basketball game.
Suppose, if Duke
Then, the loss function can be characterized as
L(A,y) = A(1-y) - y(BA),
with your action being the amount bet A.
\theta \sim \textrm{beta}(a_n,b_n),
\theta \sim \textrm{beta}(a_n,b_n),
y is an unknown state, but we can think of it as a new prediction y_{n+1} given that we have data from win-loss records (y_{1:n}) that can be converted into a Bayesian posterior,
\theta \sim \textrm{beta}(a_n,b_n),
Actually, it might make more sense to focus on more recent head-to-head data and not the all time record.
y is an unknown state, but we can think of it as a new prediction y_{n+1} given that we have data from win-loss records (y_{1:n}) that can be converted into a Bayesian posterior,
\theta \sim \textrm{beta}(a_n,b_n),
Actually, it might make more sense to focus on more recent head-to-head data and not the all time record.
In fact, we might want to build a model that predicts the outcome of the game using historical data & predictors (current team rankings, injuries, etc).
y is an unknown state, but we can think of it as a new prediction y_{n+1} given that we have data from win-loss records (y_{1:n}) that can be converted into a Bayesian posterior,
\theta \sim \textrm{beta}(a_n,b_n),
Actually, it might make more sense to focus on more recent head-to-head data and not the all time record.
In fact, we might want to build a model that predicts the outcome of the game using historical data & predictors (current team rankings, injuries, etc).
However, to keep it simple for this illustration, go with the posterior above.
\rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].
The Bayes risk for action A is then the expectation of the loss function,
\rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].
To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.
The Bayes risk for action A is then the expectation of the loss function,
\rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].
To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.
The Bayes risk for action A is then the expectation of the loss function,
\rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].
To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.
As an aside, recall from Module 2.3 that
p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.
The Bayes risk for action A is then the expectation of the loss function,
\rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].
To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.
As an aside, recall from Module 2.3 that
p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.
Specifically, that the posterior predictive distribution here is \textrm{Bernoulli}(\hat{\theta}), with
\hat{\theta} = \dfrac{a_n}{a_n + b_n}
The Bayes risk for action A is then the expectation of the loss function,
\rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].
To calculate this as a function of A and find the optimal A, we need to marginalize over the posterior predictive distribution for y.
As an aside, recall from Module 2.3 that
p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.
Specifically, that the posterior predictive distribution here is \textrm{Bernoulli}(\hat{\theta}), with
\hat{\theta} = \dfrac{a_n}{a_n + b_n}
By the way, what do a_n and b_n represent?
\begin{split} \rho(A) & = \mathbb{E}\left[\ L(A,y_{n+1}) \ | y_{1:n}\right]\\ \\ & = \mathbb{E}\left[A(1-y_{n+1}) - y_{n+1}(BA) \ | y_{1:n}\right]\\ \\ & = A \ \mathbb{E}\left[1-y_{n+1} | \ y_{1:n}\right] - (BA) \ \mathbb{E}\left[y_{n+1} | y_{1:n}\right]\\ \\ & = A \ \left(1 - \mathbb{E}\left[y_{n+1} | y_{1:n}\right] \right) - (BA) \ \mathbb{E}\left[y_{n+1} | y_{1:n}\right]\\ \\ & = A \ \left(1 -\mathbb{E}\left[y_{n+1} | y_{1:n}\right] \ (1+B) \right). \end{split}
\begin{split} \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1\\ \\ \dfrac{a_n (1+B)}{a_n + b_n} > 1. \end{split}
Hence, your bet is rational as long as
\begin{split} \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1\\ \\ \dfrac{a_n (1+B)}{a_n + b_n} > 1. \end{split}
Clearly, there is no limit to the amount you should bet if this condition is satisfied (the loss function is clearly too simple).
Hence, your bet is rational as long as
\begin{split} \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1\\ \\ \dfrac{a_n (1+B)}{a_n + b_n} > 1. \end{split}
Clearly, there is no limit to the amount you should bet if this condition is satisfied (the loss function is clearly too simple).
Loss function needs to be carefully chosen to lead to a good decision - finite resources, diminishing returns, limits on donations, etc.
Hence, your bet is rational as long as
\begin{split} \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1\\ \\ \dfrac{a_n (1+B)}{a_n + b_n} > 1. \end{split}
Clearly, there is no limit to the amount you should bet if this condition is satisfied (the loss function is clearly too simple).
Loss function needs to be carefully chosen to lead to a good decision - finite resources, diminishing returns, limits on donations, etc.
Want more on loss functions, expected loss/utility, or decision problems in general? Consider taking a course on decision theory (STA623?).
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |