STA 360/602L: Module 2.3

class: center, middle, inverse, title-slide

# STA 360/602L: Module 2.3
## Marginal likelihood and posterior prediction
### Dr. Olanrewaju Michael Akande

---

## Marginal likelihood

- Recall that the .hlight[marginal likelihood] is
.block[
.small[
`$$L(y) = f(y_1,\ldots,y_n) = \int_\Theta f(y_1,\ldots,y_n | \theta) \pi(\theta)\textrm{d}\theta = \int_\Theta L(\theta | y)\pi(\theta)\textrm{d}\theta.$$`
]
]

- For clarity, when dealing with a single `$y$` instead of `$y_1,\ldots,y_n$`, we can write
.block[
.small[
`$$L(y) = f(y) = \int_\Theta f(y | \theta) \pi(\theta)\textrm{d}\theta = \int_\Theta L(\theta | y)\pi(\theta)\textrm{d}\theta.$$`
]
]

- When this is the case, for example in the case of the binomial distribution, I will often write

+ the marginal likelihood as `$L(y)$` or `$f(y)$`, and
  
  + the sampling (conditional) likelihood as `$f(y | \theta)$` or `$L(\theta | y)$`.

---
## Marginal likelihood

- What is the marginal likelihood for the beta-binomial?

- We have
.block[
.small[
$$
`\begin{aligned}
L(y) & = \int_\Theta p(y|\theta)\pi(\theta)\textrm{d}\theta \\
& = \int_0^1 {n \choose y} \theta^y(1-\theta)^{n-y} \frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1} d\theta\\
& = {n \choose y} \frac{B(a + y,\, b + n-y)}{B(a,b)},
\end{aligned}`
$$
]
]

by the integral definition of the Beta function.
  
--

- Marginal likelihood for the beta-Bernoulli follows directly.
  
--

- Deriving the marginal likelihood for conjugate distributions is often relatively straightforward.

---
## Prior predictive distribution

- We may care about making predictions before we even see any data.

- This is often useful as a way to see if the sampling distribution we have chosen is appropriate, after integrating out all unknown parameters.

- The .hlight[prior predictive distribution] is
.block[
.small[
$$
`\begin{aligned}
p(y) &= \int_\Theta p(y,\theta)\,d\theta\\
& = \int_\Theta p(y|\theta) \cdot \pi(\theta)\,d\theta.
\end{aligned}`
$$
]
]

- In some sense, the .hlight[prior predictive distribution] marginalizes the sampling distribution (for a single y) over the prior.

- When dealing with a single `$y$` instead of `$y_1,\ldots,y_n$`, this is just the marginal likelihood of the data.

---
## Posterior predictive distribution

- We often care about making predictions for new data points, given the current pbserved data.

- For example, suppose `$y_1,\ldots,y_n \overset{iid}{\sim} \textrm{Bernoulli}(\theta)$`.

- We may wish to predict a new data point `$y_{n+1}$`.

- We can do so using the .hlight[posterior predictive distribution] `$p(y_{n+1}|y_{1:n})$`.

- <div class="question">
Why are we not including the parameter in the posterior predictive distribution?
</div>

---
## Posterior predictive distribution

- Recall that we have conditional independence of the `$y$`'s given `$\theta$`.

- So,
.block[
.small[
$$
`\begin{aligned}
p(y_{n+1}|y_{1:n}) &= \int_\Theta p(y_{n+1},\theta|y_{1:n})\,d\theta\\
&= \int_\Theta p(y_{n+1}|\theta,y_{1:n}) \cdot \pi(\theta|y_{1:n})\,d\theta\\
& = \int_\Theta p(y_{n+1}|\theta) \cdot \pi(\theta|y_{1:n})\,d\theta.
\end{aligned}`
$$
]
]

- So, in some sense, the .hlight[posterior predictive distribution] marginalizes the sampling distribution over the posterior.

---
## Posterior predictive distribution

- When we talk about the posterior predictive distribution for Bernoulli data, we are really referring to `$p(y_{n+1} = 1|y_{1:n})$` and `$p(y_{n+1} = 0|y_{1:n})$`.

- Then,
.block[
.small[
$$
`\begin{aligned}
p(y_{n+1}=1|y_{1:n}) &= \int_\Theta p(y_{n+1}= 1|\theta) \cdot \pi(\theta|y_{1:n})\,d\theta \\
&= ... \\
&= ...
\end{aligned}`
$$
]
]

<div class="question">
which simplifies to what? To be done on the board!
</div>
  
--

- What then is `$p(y_{n+1} = 0|y_{1:n})$`?

- Posterior predictive pmf therefore takes the form
.block[
.small[
`$$p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.$$`
]
]

- What are `$a_n$` and `$b_n$`?

---
## Going forward...

- From here on, we will often deal with multiple data points `$y_1, \ldots, y_n$` frequently.

- To make that obvious, we will write the Bayes rule as one of the following
.block[
.small[
$$
`\begin{split}
\pi(\theta | y_1, \ldots, y_n) & = \frac{\pi(\theta) \cdot p(y_1, \ldots, y_n|\theta)}{\int_{\Theta}\pi(\tilde{\theta}) \cdot p(y_1, \ldots, y_n| \tilde{\theta}) \textrm{d}\tilde{\theta}}\\
\\
\pi(\theta | y_1, \ldots, y_n) & = \frac{\pi(\theta) \cdot p(y_1, \ldots, y_n|\theta)}{p(y_1, \ldots, y_n)}\\
\\
\pi(\theta | y) & = \frac{\pi(\theta) \cdot L(\theta | y)}{L(y)},
\end{split}`
$$
]
]

where `$y = (y_1, \ldots, y_n)$`.

---

class: center, middle

# What's next?

### Move on to the readings for the next module!