class: center, middle, inverse, title-slide # STA 360/602L: Module 8.1 ## The multinomial model ### Dr. Olanrewaju Michael Akande --- ## Categorical data (univariate) - Suppose + `\(Y \in \{1,\ldots, D\}\)`; + `\(\Pr(y = d) = \theta_d\)` for each `\(d = 1,\ldots, D\)`; and + `\(\boldsymbol{\theta} = (\theta_1,\ldots,\theta_D)\)`. - Then the pmf of `\(Y\)` is .block[ .small[ `$$\Pr[y = d| \boldsymbol{\theta}] = \prod_{d=1}^D \theta_d^{\mathbb{1}[y = d]}.$$` ] ] -- - We say `\(Y\)` has a .hlight[multinomial distribution] with sample size 1, or a .hlight[categorical distribution]. -- - Write as `\(Y | \boldsymbol{\theta} \sim \textrm{Multinomial}(1,\boldsymbol{\theta})\)` or `\(Y | \boldsymbol{\theta} \sim \textrm{Categorical}(\boldsymbol{\theta})\)`. -- - Clearly, this is just an extension of the Bernoulli distribution. --- ## Dirichlet distribution - Since the elements of the probability vector `\(\boldsymbol{\theta}\)` must always sum to one, the support is often called a .hlight[simplex]. -- - A conjugate prior for categorical/multinomial data is the .hlight[Dirichlet distribution]. -- - A random variable `\(\boldsymbol{\theta}\)` has a .hlight[Dirichlet distribution] with parameter `\(\boldsymbol{\alpha}\)`, if .block[ .small[ `$$p[\boldsymbol{\theta} | \boldsymbol{\alpha}] = \dfrac{\Gamma\left(\sum_{d=1}^D \alpha_d\right)}{\prod_{d=1}^D \Gamma(\alpha_d)} \prod_{d=1}^D \theta_d^{\alpha_d-1}, \ \ \ \alpha_d > 0 \ \textrm{ for all } \ d = 1,\ldots, D.$$` ] ] where `\(\boldsymbol{\alpha} = (\alpha_1,\ldots,\alpha_D)\)`, and .block[ .small[ `$$\sum_{d=1}^D \theta_d = 1, \ \ \theta_d \geq 0 \ \textrm{ for all } \ d = 1,\ldots, D.$$` ] ] -- - We write this as `\(\boldsymbol{\theta} \sim \textrm{Dirichlet}(\boldsymbol{\alpha}) = \textrm{Dirichlet}(\alpha_1,\ldots,\alpha_D)\)`. -- - The Dirichlet distribution is a multivariate generalization of the .hlight[beta distribution]. --- ## Dirichlet distribution - Write .block[ .small[ `$$\alpha_0 = \sum_{d=1}^D \alpha_d \ \ \ \textrm{and} \ \ \ \alpha_d^\star = \dfrac{\alpha_d}{\alpha_0}.$$` ] ] -- - Then we can re-write the pdf slightly as .block[ .small[ `$$p[\boldsymbol{\theta} | \boldsymbol{\alpha}] = \dfrac{\Gamma\left(\alpha_0\right)}{\prod_{d=1}^D \Gamma(\alpha_d)} \prod_{d=1}^D \theta_d^{\alpha_d-1}, \ \ \ \alpha_d > 0 \ \textrm{ for all } \ d = 1,\ldots, D.$$` ] ] -- - Properties: - .block[ .small[ `$$\mathbb{E}[\theta_d] = \alpha_d^\star;$$` ] ] -- - .block[ .small[ `$$\textrm{Mode}[\theta_d] = \dfrac{\alpha_d - 1}{\alpha_0 - d};$$` ] ] -- - .block[ .small[ `$$\mathbb{Var}[\theta_d] = \dfrac{\alpha_d^\star(1-\alpha_d^\star)}{\alpha_0 + 1} = \dfrac{\mathbb{E}[\theta_d](1-\mathbb{E}[\theta_d])}{\alpha_0 + 1};$$` ] ] -- - .block[ .small[ `$$\mathbb{Cov}[\theta_d,\theta_k] = \dfrac{\alpha_d^\star\alpha_k^\star}{\alpha_0 + 1} = \dfrac{\mathbb{E}[\theta_d]\mathbb{E}[\theta_k]}{\alpha_0 + 1}.$$` ] ] --- ## Dirichlet examples `\(\textrm{Dirichlet}(1,1,1)\)` <img src="8-1-multinomial-model_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Dirichlet examples `\(\textrm{Dirichlet}(10,10,10)\)` <img src="8-1-multinomial-model_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Dirichlet examples `\(\textrm{Dirichlet}(100,100,100)\)` <img src="8-1-multinomial-model_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Dirichlet examples `\(\textrm{Dirichlet}(1,10,1)\)` <img src="8-1-multinomial-model_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Dirichlet examples `\(\textrm{Dirichlet}(50,100,10)\)` <img src="8-1-multinomial-model_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Likelihood - Let `\(Y_i, \ldots, Y_n | \boldsymbol{\theta} \sim \textrm{Categorical}(\boldsymbol{\theta})\)`. -- - Recall .block[ .small[ `$$\Pr[y_i = d| \boldsymbol{\theta}] = \prod_{d=1}^D \theta_d^{\mathbb{1}[y_i = d]}.$$` ] ] -- - Then, .block[ .small[ $$ `\begin{split} p[Y | \boldsymbol{\theta}] = p[y_1,\ldots,y_n | \boldsymbol{\theta}] = \prod_{i=1}^n \prod_{d=1}^D \theta_d^{\mathbb{1}[y_i = d]} = \prod_{d=1}^D \theta_d^{\sum_{i=1}^n \mathbb{1}[y_i = d]} = \prod_{d=1}^D \theta_d^{n_d} \end{split}` $$ ] ] where `\(n_d\)` is just the number of individuals in category `\(d\)`. -- - Maximum likelihood estimate of `\(\theta_d\)` is .block[ .small[ `$$\hat{\theta}_d = \dfrac{n_d}{n}, \ \ d = 1,\ldots, D$$` ] ] --- ## Posterior - Set `\(\pi(\boldsymbol{\theta}) = \textrm{Dirichlet}(\alpha_1,\ldots,\alpha_D)\)`. .block[ .small[ $$ `\begin{split} \pi(\boldsymbol{\theta} | Y) & \propto p[Y| \boldsymbol{\theta}] \cdot \pi[\boldsymbol{\theta}]\\ & \propto \prod_{d=1}^D \theta_d^{n_d} \prod_{d=1}^D \theta_d^{\alpha_d-1} \\ & \propto \prod_{d=1}^D \theta_d^{\alpha_d + n_d - 1}\\ & = \textrm{Dirichlet}(\alpha_1 + n_1,\ldots,\alpha_D + n_D) \end{split}` $$ ] ] -- - Posterior expectation: .block[ .small[ `$$\mathbb{E}[\theta_d | Y] = \dfrac{\alpha_d + n_d}{\sum_{d^\star=1}^D (\alpha_{d^\star} + n_{d^\star})}.$$` ] ] --- ## Combining information - For the prior, we have .block[ .small[ `$$\mathbb{E}[\theta_d] = \dfrac{\alpha_d}{\sum_{d^\star=1}^D \alpha_{d^\star}}$$` ] ] -- - We can think of + `\(\theta_{0d} = \mathbb{E}[\theta_d]\)` as being our **"prior guess"** about `\(\theta_d\)`, and + `\(n_0 = \sum_{d^\star=1}^D \alpha_{d^\star}\)` as being our **"prior sample size"**. -- - We can then rewrite the prior as `\(\pi(\boldsymbol{\theta}) = \textrm{Dirichlet}(n_0\theta_{01},\ldots,n_0\theta_{0D})\)`. --- ## Combining information - We can write the posterior expectation as: .block[ .small[ $$ `\begin{split} \mathbb{E}[\theta_d | Y] & = \dfrac{\alpha_d + n_d}{\sum_{d^\star=1}^D (\alpha_{d^\star} + n_{d^\star})}\\ & = \dfrac{\alpha_d}{\sum_{d^\star=1}^D \alpha_{d^\star} + \sum_{d^\star=1}^D n_{d^\star}} + \dfrac{n_d}{\sum_{d^\star=1}^D \alpha_{d^\star} + \sum_{d^\star=1}^D n_{d^\star}}\\ & = \dfrac{n_0\theta_{0d}}{n_0 + n} + \dfrac{n \hat{\theta}_d}{n_0 + n}\\ & = \dfrac{n_0}{n_0 + n} \theta_{0d} + \dfrac{n}{n_0 + n} \hat{\theta}_d. \end{split}` $$ ] ] since MLE is .block[ .small[ `$$\hat{\theta}_d = \dfrac{n_d}{n}$$` ] ] -- - Once again, we can express our posterior expectation as a weighted average of the prior expectation and MLE. -- - <div class="question"> We can also extend the Dirichlet-multinomial model to more variables (contingency tables). </div> --- ## Example: pre-election polling - Fox News Nov 3-6 pre-election survey of 1295 likely voters for the 2016 election. -- - For those interested, [FiveThirtyEight](https://projects.fivethirtyeight.com/) is an interesting source for pre-election polls. -- - Out of 1295 respondents, 622 indicated support for Clinton, 570 for Trump, and the remaining 103 for other candidates or no opinion. -- - Drawing inference from pre-election polls is way more complicated and nuanced that this. We only use the data here for this simple illustration. -- - Assuming no other information on the respondents, we can assume simple random sampling and use a multinomial distribution with parameter `\(\boldsymbol{\theta} = (\theta_1,\theta_2,\theta_3)\)`, the proportion, in the survey population, of Clinton supporters, Trump supporters and other candidates or no opinion. --- ## Example: pre-election polling - With a noninformative uniform prior, we have `\(\pi(\boldsymbol{\theta}) = \textrm{Dirichlet}(1,1,1)\)`. -- - The resulting posterior is `\(\textrm{Dirichlet}(1 + n_1,1 + n_2,1 + n_3) = \textrm{Dirichlet}(623,571,104)\)`. -- - Suppose we wish to compare the proportion of people who would vote for Trump versus Clinton, we could examine the posterior distribution of `\(\theta_1-\theta_2\)`. -- - We can even compute the probability `\(\Pr(\theta_1 > \theta_2 | Y)\)`. --- ## Example: pre-election polling ```r #library(gtools) PostSamples <- rdirichlet(10000, alpha=c(623,571,104)) #dim(PostSamples) hist((PostSamples[,1] - PostSamples[,2]),col=rainbow(20),xlab=expression(theta[1]-theta[2]), ylab="Posterior Density",freq=F,breaks=50, main=expression(paste("Posterior density of ",theta[1]-theta[2]))) ``` <img src="8-1-multinomial-model_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Example: pre-election polling - Posterior probability that Clinton had more support than Trump in the survey population, that is, `\(\Pr(\theta_1 > \theta_2 | Y)\)`, is ```r #library(gtools) mean(PostSamples[,1] > PostSamples[,2]) ``` ``` ## [1] 0.9314 ``` -- - Once again, this is just a simple illustration with a very small subset of the 2016 pre-election polling data. -- - Inference for pre-election polls is way more complex and nuanced that this. --- class: center, middle # What's next? ### Move on to the readings for the next module!