STA 360/602L: Module 6.4

class: center, middle, inverse, title-slide

# STA 360/602L: Module 6.4
## Bayesian hypothesis testing
### Dr. Olanrewaju Michael Akande

---

## Bayesian hypothesis testing

- How to do .hlight[Bayesian hypothesis testing] for a simple model?

- Suppose we have univariate data `$y_i \overset{iid}{\sim} \mathcal{N}(\mu, 1)$` and wish to test `$\mathcal{H}_0: \mu = 0; \ \ \text{vs.} \mathcal{H}_1: \mu \neq 0$` under the Bayesian paradigm.

- .hlight[Informal approach]: 
  1. Put a prior on `$\mu$`, `$\pi(\mu) = \mathcal{N}(\mu_0, \sigma_0^2)$`.

2. Compute posterior `$\mu | Y = (y_1, \ldots, y_n) \sim \mathcal{N}(\mu_n, \sigma_n^2)$` for updated parameters `$\mu_n$` and `$\sigma_n^2$`.

3. Compute a 95% CI based on the posterior.

4. Reject `$\mathcal{H}_0$` if interval does not contain zero.
  
  
  
---
## Bayesian hypothesis testing

- .hlight[Formal approach]:

1. Put a prior on the actual hypotheses/models, that is, on `$\pi(\mathcal{H}_0) = \Pr(\mathcal{H}_0 = \text{True})$` and `$\pi(\mathcal{H}_1) = \Pr(\mathcal{H}_1 = \text{True})$`.
      
--
      For example, set `$\pi(\mathcal{H}_0) = 0.5$` and `$\pi(\mathcal{H}_1) = 0.5$`, if apriori, we believe the two hypotheses are equally likely.

--
  2. Put a prior on the parameters in each model. 
  
      In our simple normal model, the only unknown parameter is `$\mu$`, so for example, our prior can once again be `$\pi(\mu) = \mathcal{N}(\mu_0, \sigma_0^2)$`.

--
  3. Compute marginal posterior probabilities for each hypothesis, that is, `$\pi(\mathcal{H}_0 | Y)$` and `$\pi(\mathcal{H}_1 | Y)$`. Can start with the joint posterior between each hypothesis and the parameter, then integrate out the parameter.

4. Conclude based on the magnitude of `$\pi(\mathcal{H}_1 | Y)$` relative to `$\pi(\mathcal{H}_0 | Y)$`.

---
## Bayesian hypothesis testing

- Using Bayes theorem,
.block[
.small[
$$
`\begin{split}
\pi(\mathcal{H}_1 | Y) = \frac{ p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1) }{ p(Y | \mathcal{H}_0) \pi(\mathcal{H}_0) + p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1)},
\end{split}`
$$
]
]

where `$p(Y | \mathcal{H}_0)$` and `$p(Y | \mathcal{H}_1)$` are the marginal likelihoods for the data under the null and alternative hypotheses respectively.
  
--

- If for example we set `$\pi(\mathcal{H}_0) = 0.5$` and `$\pi(\mathcal{H}_1) = 0.5$` apriori, then
.block[
.small[
$$
`\begin{split}
\pi(\mathcal{H}_1 | Y) & = \frac{ 0.5 p(Y | \mathcal{H}_1) }{ 0.5 p(Y | \mathcal{H}_0) + 0.5 p(Y | \mathcal{H}_1) } \\
\\
& = \frac{ p(Y | \mathcal{H}_1) }{ p(Y | \mathcal{H}_0) + p(Y | \mathcal{H}_1) }= \frac{ 1 }{ \frac{p(Y | \mathcal{H}_0)}{p(Y | \mathcal{H}_1)} + 1 }.\\
\end{split}`
$$
]
]

- The ratio `$\frac{p(Y | \mathcal{H}_0)}{p(Y | \mathcal{H}_1)}$` is known as the .hlight[Bayes factor] in favor of `$\mathcal{H}_0$`, and often written as `$\mathcal{BF}_{01}$`. Similarly, we can compute `$\mathcal{BF}_{10}$`.

---
## Bayes factors

- .hlight[Bayes factor]: is a ratio of marginal likelihoods and it provides a weight of evidence in the data in favor of one model over another.

- It is often used as an alternative to the frequentist p-value.

- **Rule of thumb**: `$\mathcal{BF}_{01} > 10$` is strong evidence for `$\mathcal{H}_0$`;  `$\mathcal{BF}_{01} > 100$` is decisive evidence for `$\mathcal{H}_0$`.

- Notice that for our example,
.block[
.small[
$$
`\begin{split}
\pi(\mathcal{H}_1 | Y) = \frac{ 1 }{ \frac{p(Y | \mathcal{H}_0)}{p(Y | \mathcal{H}_1)} + 1 } = \frac{ 1 }{ \mathcal{BF}_{01} + 1 } \\
\end{split}`
$$
]
]

the higher the value of `$\mathcal{BF}_{01}$`, that is, the weight of evidence in the data in favor of `$\mathcal{H}_0$`, the lower the marginal posterior probability that `$\mathcal{H}_1$` is true.
  
--

- That is, here, as `$\mathcal{BF}_{01} \uparrow$`, `$\pi(\mathcal{H}_1 | Y) \downarrow$`.

---
## Bayes factors

- Let's look at another way to think of Bayes factors. First, recall that
.block[
.small[
$$
`\begin{split}
\pi(\mathcal{H}_1 | Y) = \frac{ p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1) }{ p(Y | \mathcal{H}_0) \pi(\mathcal{H}_0) + p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1)},
\end{split}`
$$
]
]

so that
.block[
.small[
$$
`\begin{split}
\frac{\pi(\mathcal{H}_0 | Y)}{\pi(\mathcal{H}_1 | Y)} & = \frac{ p(Y | \mathcal{H}_0) \pi(\mathcal{H}_0) }{ p(Y | \mathcal{H}_0) \pi(\mathcal{H}_0) + p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1)} \div \frac{ p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1) }{ p(Y | \mathcal{H}_0) \pi(\mathcal{H}_0) + p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1)}\\
\\
& = \frac{ p(Y | \mathcal{H}_0) \pi(\mathcal{H}_0) }{ p(Y | \mathcal{H}_0) \pi(\mathcal{H}_0) + p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1)} \times \frac{ p(Y | \mathcal{H}_0) \pi(\mathcal{H}_0) + p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1)}{ p(Y | \mathcal{H}_1) \pi(\mathcal{H}_1) }\\
\\
\therefore \underbrace{\frac{\pi(\mathcal{H}_0 | Y)}{\pi(\mathcal{H}_1 | Y)}}_{\text{posterior odds}} & = \underbrace{\frac{ \pi(\mathcal{H}_0) }{ \pi(\mathcal{H}_1) }}_{\text{prior odds}} \times \underbrace{\frac{ p(Y | \mathcal{H}_0) }{ p(Y | \mathcal{H}_1) }}_{\text{Bayes factor } \mathcal{BF}_{01}} \\
\end{split}`
$$
]
]

- Therefore, the Bayes factor can be thought of as the factor by which our prior odds change (towards the posterior odds) in the light of the data.

- In linear regression, **BIC** approximates the `$\mathcal{BF}$` comparing a model to the null model.

---
## Bayes factors

- While Bayes factors can be appealing, calculating them can be computationally demanding.

- Why have we been "mildly obsessed" with MCMC sampling? To avoid computing any **marginal likelihoods**! Well, guess what? Bayes factors are ratios of marginal likelihoods, taking us back to the problem we always try to avoid.

- Of course this isn't all *"doom and gloom"*, there are various ways (once again!) of getting around computing those likelihoods analytically.

- Unfortunately, we will not have time to cover them in this course.

---
## Bayes factors

- As a teaser, one approach is to flip the relationship on the previous slide:
.block[
.small[
$$
`\begin{split}
\underbrace{\frac{ p(Y | \mathcal{H}_0) }{ p(Y | \mathcal{H}_1) }}_{\text{Bayes factor } \mathcal{BF}_{01}} & = \underbrace{\frac{\pi(\mathcal{H}_0 | Y)}{\pi(\mathcal{H}_1 | Y)}}_{\text{posterior odds}} \times  \underbrace{\frac{ \pi(\mathcal{H}_1) }{ \pi(\mathcal{H}_0) }}_{\text{prior odds}}, \\
\end{split}`
$$
]
]

which is easy to compute as long as we can use posterior samples to compute/approximate the posterior odds.

- Bayes factors can work well when the underlying model is discrete but do not work well for models that are inherently continuous.

- For more discussions on this, see Chapter 7.4 of [Bayesian Data Analysis (Third Edition)](https://find.library.duke.edu/catalog/DUKE006588051?utm_campaign=bento&utm_content=bento_result_link&utm_source=library.duke.edu&utm_medium=referral).

- Even in the discrete case, Bayes factors are not perfect, as we see in the following example.
  
  
---
## Hypothesis testing example

- Suppose we have univariate data `$y_1, \ldots, y_n | \theta \sim \text{Bernoulli}(\theta)$`.

- Also, suppose we wish to test `$\mathcal{H}_0: \theta = 0.5 \ \ \text{vs. } \mathcal{H}_1: \theta \neq 0.5$`, using the Bayes factor.

- First, we need to put priors on the two hypotheses. Again, if apriori we believe the two hypotheses are equally likely, then we can set
.block[
`$$\pi(\mathcal{H}_0) = \Pr(\mathcal{H}_0 = \text{True}) = 0.5; \ \ \pi(\mathcal{H}_1) = \Pr(\mathcal{H}_1 = \text{True}) = 0.5.$$`
]

- Next, we need to put priors on the parameters in each model.
  + When `$\mathcal{H}_0$` is true, we have that `$\theta = 0.5$` and so there's no need for a prior on `$\theta$`.
  
  + When `$\mathcal{H}_1$` is true, we can set a conjugate prior for `$\theta$`, that is, `$\text{Beta}(a,b)$`.

---
## Hypothesis testing example

- To compute the Bayes factor, we need to compute `$p(Y | \mathcal{H}_0)$` and `$p(Y | \mathcal{H}_1)$`.

- For each, we need to start with the joint distribution of the data and parameter, given each hypothesis, then integrate out the parameter.

- For `$p(Y | \mathcal{H}_0)$`, we have
.block[
.small[
$$
`\begin{split}
p(Y | \mathcal{H}_0) & = \int_0^1 p(Y, \theta | \mathcal{H}_0) \textrm{d}\theta \\
& = \int_0^1 p(Y | \mathcal{H}_0, \theta) \cdot \pi(\theta | \mathcal{H}_0) \textrm{d}\theta \\
& = \int_0^1 p(Y | \theta = 0.5) \cdot 1 \  \textrm{d}\theta \\
& = \int_0^1 0.5^{\sum_{i=1}^n y_i}(1-0.5)^{n-\sum_{i=1}^n y_i} \cdot 1 \  \textrm{d}\theta \\
& =  0.5^n \int_0^1 1 \  \textrm{d}\theta \\
& =  0.5^n
\end{split}`
$$
]
]

---
## Hypothesis testing example

- For `$p(Y | \mathcal{H}_1)$`, we have
.block[
.small[
$$
`\begin{split}
p(Y | \mathcal{H}_1) & = \int_0^1 p(Y | \mathcal{H}_1, \theta) \cdot \pi(\theta | \mathcal{H}_1) \textrm{d}\theta \\
& = \int_0^1 \theta^{\sum_{i=1}^n y_i}(1-\theta)^{n-\sum_{i=1}^n y_i} \cdot \frac{1}{B(a,b)} \theta^{a-1} (1-\theta)^{b-1} \textrm{d}\theta \\
& = \frac{1}{B(a,b)} \int_0^1 \theta^{a + \sum_{i=1}^n y_i - 1}(1-\theta)^{b + n-\sum_{i=1}^n y_i - 1} \textrm{d}\theta \\
& = \frac{B(a + \sum_{i=1}^n y_i,b+n-\sum_{i=1}^n y_i)}{B(a,b)}
\end{split}`
$$
]
]

- Bayes factor in favor of `$\mathcal{H}_0$`, `$\mathcal{BF}_{01}$`, is therefore
.block[
.small[
$$
`\begin{split}
\mathcal{BF}_{01} & = \frac{p(Y | \mathcal{H}_0)}{p(Y | \mathcal{H}_1)} = \frac{0.5^n B(a,b)}{B(a + \sum_{i=1}^n y_i,b+n-\sum_{i=1}^n y_i)}.
\end{split}`
$$
]
]

- Also,
.block[
.small[
$$
`\begin{split}
\pi(\mathcal{H}_1 | Y) = \frac{ 1 }{ \mathcal{BF}_{01} + 1 } = \frac{ 1 }{ \frac{0.5^n B(a,b)}{B(a + \sum_{i=1}^n y_i,b+n-\sum_{i=1}^n y_i)} + 1 }.\\
\end{split}`
$$
]
]

---
## Hypothesis testing example

- Suppose the true value of `$\theta = 0.6$`. Suppose that in `$n = 20$` trials, we observe `$13$` successes, that is, `$\sum_{i=1}^n y_i = 13$`.

- If we assume a `$\text{Beta}(a=1,b=1)$` prior on `$\theta$`, then `$\mathcal{BF}_{01}$` is
  
  ```r
  0.5^20*beta(1,1)/beta(1+13,1+7)
  ```
  
  ```
  ## [1] 1.552505
  ```

- On the other hand, `$\mathcal{BF}_{10} \approx 0.64$`. So that even though based on the data, our estimate of `$\theta$` is `$\hat{\theta} = \frac{13}{20}=0.65$`, we still have stronger evidence in favor of `$\mathcal{H}_0$` over `$\mathcal{H}_1$`, which is interesting!

- There are a few contributing factors, including the sample size, our choice of prior, and how far `$\hat{\theta}$` is from the true `$\theta$`.

- You will explore this in more detail on the homework.

---

class: center, middle

# What's next?

### Move on to the readings for the next module!