STA 360/602L: Module 8.1

class: center, middle, inverse, title-slide

# STA 360/602L: Module 8.1
## The multinomial model
### Dr. Olanrewaju Michael Akande

---

## Categorical data (univariate)

- Suppose

+ `$Y \in \{1,\ldots, D\}$`;
  
  + `$\Pr(y = d) = \theta_d$` for each `$d = 1,\ldots, D$`; and
  
  + `$\boldsymbol{\theta} = (\theta_1,\ldots,\theta_D)$`.
  
- Then the pmf of `$Y$` is
.block[
.small[
`$$\Pr[y = d| \boldsymbol{\theta}] = \prod_{d=1}^D \theta_d^{\mathbb{1}[y = d]}.$$`
]
]

- We say `$Y$` has a .hlight[multinomial distribution] with sample size 1, or a .hlight[categorical distribution].

- Write as `$Y | \boldsymbol{\theta} \sim \textrm{Multinomial}(1,\boldsymbol{\theta})$` or `$Y | \boldsymbol{\theta} \sim \textrm{Categorical}(\boldsymbol{\theta})$`.

- Clearly, this is just an extension of the Bernoulli distribution.

---
## Dirichlet distribution

- Since the elements of the probability vector `$\boldsymbol{\theta}$` must always sum to one, the support is often called a .hlight[simplex].

- A conjugate prior for categorical/multinomial data is the .hlight[Dirichlet distribution].

- A random variable `$\boldsymbol{\theta}$` has a .hlight[Dirichlet distribution] with parameter `$\boldsymbol{\alpha}$`, if
.block[
.small[
`$$p[\boldsymbol{\theta} | \boldsymbol{\alpha}] = \dfrac{\Gamma\left(\sum_{d=1}^D \alpha_d\right)}{\prod_{d=1}^D \Gamma(\alpha_d)} \prod_{d=1}^D \theta_d^{\alpha_d-1}, \ \ \ \alpha_d > 0 \ \textrm{ for all } \ d = 1,\ldots, D.$$`
]
]

where `$\boldsymbol{\alpha} = (\alpha_1,\ldots,\alpha_D)$`, and
.block[
.small[
`$$\sum_{d=1}^D \theta_d = 1, \ \ \theta_d \geq 0 \ \textrm{ for all } \ d = 1,\ldots, D.$$`
]
]

- We write this as `$\boldsymbol{\theta} \sim \textrm{Dirichlet}(\boldsymbol{\alpha}) = \textrm{Dirichlet}(\alpha_1,\ldots,\alpha_D)$`.

- The Dirichlet distribution is a multivariate generalization of the .hlight[beta distribution].

---
## Dirichlet distribution

- Write
.block[
.small[
`$$\alpha_0 = \sum_{d=1}^D \alpha_d \ \ \ \textrm{and} \ \ \ \alpha_d^\star = \dfrac{\alpha_d}{\alpha_0}.$$`
]
]

- Then we can re-write the pdf slightly as
.block[
.small[
`$$p[\boldsymbol{\theta} | \boldsymbol{\alpha}] = \dfrac{\Gamma\left(\alpha_0\right)}{\prod_{d=1}^D \Gamma(\alpha_d)} \prod_{d=1}^D \theta_d^{\alpha_d-1}, \ \ \ \alpha_d > 0 \ \textrm{ for all } \ d = 1,\ldots, D.$$`
]
]
  
--

- Properties:

- 
.block[
.small[
`$$\mathbb{E}[\theta_d] = \alpha_d^\star;$$`
]
]

- 
.block[
.small[
`$$\textrm{Mode}[\theta_d] = \dfrac{\alpha_d - 1}{\alpha_0 - d};$$`
]
]

--
  
  - 
.block[
.small[
`$$\mathbb{Var}[\theta_d] = \dfrac{\alpha_d^\star(1-\alpha_d^\star)}{\alpha_0 + 1} = \dfrac{\mathbb{E}[\theta_d](1-\mathbb{E}[\theta_d])}{\alpha_0 + 1};$$`
]
]
  
--

- 
.block[
.small[
`$$\mathbb{Cov}[\theta_d,\theta_k] = \dfrac{\alpha_d^\star\alpha_k^\star}{\alpha_0 + 1} = \dfrac{\mathbb{E}[\theta_d]\mathbb{E}[\theta_k]}{\alpha_0 + 1}.$$`
]
]

---
## Dirichlet examples

`$\textrm{Dirichlet}(1,1,1)$`

---
## Dirichlet examples

`$\textrm{Dirichlet}(10,10,10)$`

---
## Dirichlet examples

`$\textrm{Dirichlet}(100,100,100)$`

---
## Dirichlet examples

`$\textrm{Dirichlet}(1,10,1)$`

---
## Dirichlet examples

`$\textrm{Dirichlet}(50,100,10)$`

---
## Likelihood

- Let `$Y_i, \ldots, Y_n | \boldsymbol{\theta} \sim \textrm{Categorical}(\boldsymbol{\theta})$`.

- Recall
.block[
.small[
`$$\Pr[y_i = d| \boldsymbol{\theta}] = \prod_{d=1}^D \theta_d^{\mathbb{1}[y_i = d]}.$$`
]
]

- Then,
.block[
.small[
$$
`\begin{split}
p[Y | \boldsymbol{\theta}] = p[y_1,\ldots,y_n | \boldsymbol{\theta}] = \prod_{i=1}^n \prod_{d=1}^D \theta_d^{\mathbb{1}[y_i = d]} = \prod_{d=1}^D  \theta_d^{\sum_{i=1}^n \mathbb{1}[y_i = d]} = \prod_{d=1}^D  \theta_d^{n_d}
\end{split}`
$$
]
]

where `$n_d$` is just the number of individuals in category `$d$`.

- Maximum likelihood estimate of `$\theta_d$` is
.block[
.small[
`$$\hat{\theta}_d = \dfrac{n_d}{n}, \ \ d = 1,\ldots, D$$`
]
]

---
## Posterior

- Set `$\pi(\boldsymbol{\theta}) = \textrm{Dirichlet}(\alpha_1,\ldots,\alpha_D)$`.
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta} | Y) & \propto p[Y| \boldsymbol{\theta}] \cdot \pi[\boldsymbol{\theta}]\\
& \propto \prod_{d=1}^D  \theta_d^{n_d} \prod_{d=1}^D \theta_d^{\alpha_d-1} \\
& \propto \prod_{d=1}^D  \theta_d^{\alpha_d + n_d - 1}\\
& = \textrm{Dirichlet}(\alpha_1 + n_1,\ldots,\alpha_D + n_D)
\end{split}`
$$
]
]

- Posterior expectation:
.block[
.small[
`$$\mathbb{E}[\theta_d | Y] = \dfrac{\alpha_d + n_d}{\sum_{d^\star=1}^D (\alpha_{d^\star} + n_{d^\star})}.$$`
]
]

---
## Combining information

- For the prior, we have
.block[
.small[
`$$\mathbb{E}[\theta_d] = \dfrac{\alpha_d}{\sum_{d^\star=1}^D \alpha_{d^\star}}$$`
]
]

- We can think of

+ `$\theta_{0d} = \mathbb{E}[\theta_d]$` as being our **"prior guess"** about `$\theta_d$`, and
  
  + `$n_0 = \sum_{d^\star=1}^D \alpha_{d^\star}$` as being our **"prior sample size"**.

- We can then rewrite the prior as `$\pi(\boldsymbol{\theta}) = \textrm{Dirichlet}(n_0\theta_{01},\ldots,n_0\theta_{0D})$`.

---
## Combining information

- We can write the posterior expectation as:
.block[
.small[
$$
`\begin{split}
\mathbb{E}[\theta_d | Y] & = \dfrac{\alpha_d + n_d}{\sum_{d^\star=1}^D (\alpha_{d^\star} + n_{d^\star})}\\
& = \dfrac{\alpha_d}{\sum_{d^\star=1}^D \alpha_{d^\star} + \sum_{d^\star=1}^D n_{d^\star}} + \dfrac{n_d}{\sum_{d^\star=1}^D \alpha_{d^\star} + \sum_{d^\star=1}^D n_{d^\star}}\\
& = \dfrac{n_0\theta_{0d}}{n_0 + n} + \dfrac{n \hat{\theta}_d}{n_0 + n}\\
& = \dfrac{n_0}{n_0 + n} \theta_{0d} + \dfrac{n}{n_0 + n} \hat{\theta}_d.
\end{split}`
$$
]
]

since
MLE is
.block[
.small[
`$$\hat{\theta}_d = \dfrac{n_d}{n}$$`
]
]

- Once again, we can express our posterior expectation as a weighted average of the prior expectation and MLE.

- <div class="question">
We can also extend the Dirichlet-multinomial model to more variables (contingency tables).
</div>

---
## Example: pre-election polling

- Fox News Nov 3-6 pre-election survey of 1295 likely voters for the 2016 election.

- For those interested, [FiveThirtyEight](https://projects.fivethirtyeight.com/) is an interesting source for pre-election polls.

- Out of 1295 respondents, 622 indicated support for Clinton, 570 for Trump, and the remaining 103 for other candidates or no opinion.

- Drawing inference from pre-election polls is way more complicated and nuanced that this. We only use the data here for this simple illustration.

- Assuming no other information on the respondents, we can assume simple random sampling and use a multinomial distribution with parameter `$\boldsymbol{\theta} = (\theta_1,\theta_2,\theta_3)$`, the proportion, in the survey population, of Clinton supporters, Trump supporters and other candidates or no opinion.

---
## Example: pre-election polling

- With a noninformative uniform prior, we have `$\pi(\boldsymbol{\theta}) = \textrm{Dirichlet}(1,1,1)$`.

- The resulting posterior is `$\textrm{Dirichlet}(1 + n_1,1 + n_2,1 + n_3) = \textrm{Dirichlet}(623,571,104)$`.

- Suppose we wish to compare the proportion of people who would vote for Trump versus Clinton, we could examine the posterior distribution of `$\theta_1-\theta_2$`.

- We can even compute the probability `$\Pr(\theta_1 > \theta_2 | Y)$`.

---
## Example: pre-election polling

```r
#library(gtools)
PostSamples <- rdirichlet(10000, alpha=c(623,571,104))
#dim(PostSamples)
hist((PostSamples[,1] - PostSamples[,2]),col=rainbow(20),xlab=expression(theta[1]-theta[2]),
     ylab="Posterior Density",freq=F,breaks=50,
     main=expression(paste("Posterior density of ",theta[1]-theta[2])))
```

---
## Example: pre-election polling

- Posterior probability that Clinton had more support than Trump in the survey population, that is, `$\Pr(\theta_1 > \theta_2 | Y)$`, is

```r
  #library(gtools)
  mean(PostSamples[,1] > PostSamples[,2])
  ```
  
  ```
  ## [1] 0.9341
  ```

- Once again, this is just a simple illustration with a very small subset of the 2016 pre-election polling data.

- Inference for pre-election polls is way more complex and nuanced that this.

---

class: center, middle

# What's next?

### Move on to the readings for the next module!