STA 360/602L: Module 4.2

class: center, middle, inverse, title-slide

# STA 360/602L: Module 4.2
## Multivariate normal model II
### Dr. Olanrewaju Michael Akande

---

## Multivariate normal likelihood recap

- For data `$\boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma)$`, the likelihood is
.block[
$$
`\begin{split}
p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}.
\end{split}`
$$
]

- For `$\boldsymbol{\theta}$`, it is convenient to write `$p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma)$` as
.block[
$$
`\begin{split}
p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}})  \right\},\\
\end{split}`
$$
]

where `$\bar{\boldsymbol{y}} = (\bar{y}_1,\ldots,\bar{y}_p)^T$`.

- For `$\Sigma$`, it is convenient to write `$p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma)$` as
.block[
$$
`\begin{split}
p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\
\end{split}`
$$
]

where `$\boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T$` is the residual sum of squares matrix.

---
## Prior for the mean

- A convenient specification of the joint prior is `$\pi(\boldsymbol{\theta}, \Sigma) = \pi(\boldsymbol{\theta}) \pi(\Sigma)$`.

- As in the univariate case, a convenient prior distribution for `$\boldsymbol{\theta}$` is also normal (multivariate in this case).

- Assume that `$\pi(\boldsymbol{\theta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0)$`.

- The pdf will be easier to work with if we write it as
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta}) & =  (2\pi)^{-\frac{p}{2}} \left|\Lambda_0\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)^T \Lambda_0^{-1} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)\right\}\\
& \propto  \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)^T \Lambda_0^{-1} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)\right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \left[\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} - \underbrace{\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 - \boldsymbol{\mu}_0^T\Lambda_0^{-1}\boldsymbol{\theta}}_{\textrm{same term}} + \underbrace{\boldsymbol{\mu}_0^T\Lambda_0^{-1}\boldsymbol{\mu}_0}_{\text{does not involve } \boldsymbol{\theta}} \right] \right\}\\
& \propto \textrm{exp} \left\{-\dfrac{1}{2} \left[\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} - 2\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right] \right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\
\end{split}`
$$
]
]

---
## Prior for the mean

- So we have
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}.\\
\end{split}`
$$
]
]

- **Key trick for combining with likelihood:** When the normal density is written in this form, note the following details in the exponent.

+ In the first part, the inverse of the *covariance matrix* `$\Lambda_0^{-1}$` is "sandwiched" between `$\boldsymbol{\theta}^T$` and `$\boldsymbol{\theta}$`.
  
  + In the second part, the `$\boldsymbol{\theta}$` in the first part is replaced (sort of) with the *mean* `$\boldsymbol{\mu}_0$`, with `$\Lambda_0^{-1}$` keeping its place.
  
--

- The two points above will help us identify **updated means** and **updated covariance matrices** relatively quickly.

---
## Conditional posterior for the mean

- Our conditional posterior (full conditional) `$\boldsymbol{\theta} | \Sigma , \boldsymbol{Y}$`, is then
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) \cdot \pi(\boldsymbol{\theta}) \\
\\
& \propto \underbrace{\textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\}}_{p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma)} \cdot \underbrace{\textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}}_{\pi(\boldsymbol{\theta})} \\
\\
& = \textrm{exp} \left\{\underbrace{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} -\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta}}_{\textrm{First parts from } p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) \textrm{ and } \pi(\boldsymbol{\theta})} + \underbrace{\boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0}_{\textrm{Second parts from } p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) \textrm{ and } \pi(\boldsymbol{\theta})} \right\}\\
\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[n\Sigma^{-1} + \Lambda_0^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[ n\Sigma^{-1} \bar{\boldsymbol{y}} + \Lambda_0^{-1}\boldsymbol{\mu}_0 \right] \right\},
\end{split}`
$$
]
]

which is just another multivariate normal distribution.

---
## Conditional posterior for the mean

- To confirm the normal density and its parameters, compare to the prior kernel
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\
\end{split}`
$$
]
]

and the posterior kernel we just derived, that is,
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[\Lambda_0^{-1} + n\Sigma^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right] \right\}.
\end{split}`
$$
]
]

- Easy to see (relatively) that `$\boldsymbol{\theta} | \Sigma, \boldsymbol{Y} \sim \mathcal{N}_p(\boldsymbol{\mu}_n, \Lambda_n)$`, with
.block[
.small[
$$
\Lambda_n = \left[\Lambda_0^{-1} + n\Sigma^{-1}\right]^{-1}
$$
]
]

and
.block[
.small[
$$
\boldsymbol{\mu}_n = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]
$$
]
]

---
## Bayesian inference

- As in the univariate case, we once again have that
  + Posterior precision is sum of prior precision and data precision:
.block[
.small[
$$
\Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}
$$
]
]

--
  + Posterior expectation is weighted average of prior expectation and the sample mean:
.block[
.small[
$$
`\begin{split}
\boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]\\
\\
& = \overbrace{\left[ \Lambda_n \Lambda_0^{-1} \right]}^{\textrm{weight on prior mean}} \underbrace{\boldsymbol{\mu}_0}_{\textrm{prior mean}} + \overbrace{\left[ \Lambda_n (n\Sigma^{-1}) \right]}^{\textrm{weight on sample mean}} \underbrace{ \bar{\boldsymbol{y}}}_{\textrm{sample mean}}
\end{split}`
$$
]
]

- Compare these to the results from the univariate case to gain more intuition.

---
## What about the covariance matrix?

- In the univariate case with `$y_i \sim \mathcal{N}(\mu, \sigma^2)$`, the common choice for the prior is an inverse-gamma distribution for the variance `$\sigma^2$`.

- As we have seen, we can rewrite as `$y_i \sim \mathcal{N}(\mu, \tau^{-1})$`, so that we have a gamma prior for the precision `$\tau$`.

- In the multivariate normal case, we have a covariance matrix `$\Sigma$` instead of a scalar.

- Appealing to have a matrix-valued extension of the inverse-gamma (and gamma) that would be conjugate.

- One complication is that the covariance matrix `$\Sigma$` must be **positive definite and symmetric**.

---
## Positive definite and symmetric

- "Positive definite" means that for all `$x \in \mathcal{R}^p$`, `$x^T \Sigma x > 0$`.

- Basically ensures that the diagonal elements of `$\Sigma$` (corresponding to the marginal variances) are positive.

- Also, ensures that the correlation coefficients for each pair of variables are between -1 and 1.

- Our prior for `$\Sigma$` should thus assign probability one to set of positive definite matrices.

- Analogous to the univariate case, the .hlight[inverse-Wishart distribution] is the corresponding conditionally conjugate prior for `$\Sigma$` (multivariate generalization of the inverse-gamma).

- The textbook covers the construction of Wishart and inverse-Wishart random variables. We will skip the actual development in class but will write code to sample random variates.

---
## Inverse-Wishart distribution

- A random variable `$\Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`, where `$\Sigma$` is positive definite and `$p \times p$`, has pdf
.block[
.small[
$$
`\begin{split}
p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\},
\end{split}`
$$
]
]

where
    + `$\nu_0 > p - 1$` is the "degrees of freedom", and 
    + `$\boldsymbol{S}_0$` is a `$p \times p$` positive definite matrix.
  
--

- For this distribution, `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$`, for `$\nu_0 > p + 1$`.

- Hence, `$\boldsymbol{S}_0$` is the scaled mean of the `$\textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`.

---
## Inverse-Wishart distribution

- If we are very confident in a prior guess `$\Sigma_0$`, for `$\Sigma$`, then we might set
  + `$\nu_0$`, the degrees of freedom to be very large, and
  + `$\boldsymbol{S}_0 = (\nu_0 - p - 1)\Sigma_0$`. 
  
  In this case, `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 = \dfrac{1}{\nu_0 - p - 1}(\nu_0 - p - 1)\Sigma_0 = \Sigma_0$`, and `$\Sigma$` is tightly (depending on the value of `$\nu_0$`) centered around `$\Sigma_0$`.
  
--

- If we are not at all confident but we still have a prior guess `$\Sigma_0$`, we might set
  + `$\nu_0 = p + 2$`, so that the `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$` is finite.
  + `$\boldsymbol{S}_0 = \Sigma_0$`
  
  Here, `$\mathbb{E}[\Sigma] = \Sigma_0$` as before, but `$\Sigma$` is only loosely centered around `$\Sigma_0$`.

---
## Wishart distribution

- Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the .hlight[Wishart distribution] (multivariate generalization of the gamma) instead.

- The .hlight[Wishart distribution] provides a conditionally-conjugate prior for the precision matrix `$\Sigma^{-1}$` in a multivariate normal model.

- Specifically, if `$\Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`, then `$\Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1})$`.

- A random variable `$\Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1})$`, where `$\Phi$` has dimension `$(p \times p)$`, has pdf
.block[
.small[
$$
`\begin{split}
f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}.
\end{split}`
$$
]
]

- Here, `$\mathbb{E}[\Phi] = \nu_0 \boldsymbol{S}_0$`.

- Note that the textbook writes the inverse-Wishart as `$\textrm{IW}_p(\nu_0, \boldsymbol{S}_0^{-1})$`. I prefer `$\textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$` instead. Feel free to use either notation but try not to get confused.

---
## Conditional posterior for covariance

- Assuming `$\pi(\Sigma) = \textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`, the conditional posterior (full conditional) `$\Sigma | \boldsymbol{\theta}, \boldsymbol{Y}$`, is then
.block[
.small[
$$
`\begin{split}
\pi(\Sigma| \boldsymbol{\theta}, \boldsymbol{Y}) & \propto p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) \cdot \pi(\Sigma) \\
\\
& \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\}}_{p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma)} \cdot \underbrace{\left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}}_{\pi(\Sigma)} \\
\\
& \propto \left|\Sigma\right|^{\frac{-(\nu_0 + p + n + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[\boldsymbol{S}_0\Sigma^{-1} + \boldsymbol{S}_\theta \Sigma^{-1} \right] \right\} ,\\
\\
& \propto \left|\Sigma\right|^{\frac{-(\nu_0 + n + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[ \left(\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right) \Sigma^{-1} \right] \right\} ,\\
\end{split}`
$$
]
]

which is `$\textrm{IW}_p(\nu_n, \boldsymbol{S}_n)$`, or using the notation in the book, `$\textrm{IW}_p(\nu_n, \boldsymbol{S}_n^{-1} )$`, with
    + `$\nu_n = \nu_0 + n$`, and
    + `$\boldsymbol{S}_n = \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]$`

---
## Conditional posterior for covariance

- We once again see that the "posterior sample size" or "posterior degrees of freedom" `$\nu_n$` is the sum of the "prior degrees of freedom" `$\nu_0$` and the data sample size `$n$`.

- `$\boldsymbol{S}_n$` can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".

- Recall that if `$\Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`, then `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$`.

- `$\Rightarrow$` the conditional posterior expectation of the population covariance is
.block[
.small[
$$
`\begin{split}
\mathbb{E}[\Sigma | \boldsymbol{\theta}, \boldsymbol{Y}] & = \dfrac{1}{\nu_0 + n - p - 1} \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]\\
\\
& = \underbrace{\dfrac{\nu_0 - p - 1}{\nu_0 + n - p - 1}}_{\text{weight on prior expectation}} \overbrace{\left[\dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0\right]}^{\text{prior expectation}} + \underbrace{\dfrac{n}{\nu_0 + n - p - 1}}_{\text{weight on sample estimate}} \overbrace{\left[\dfrac{1}{n} \boldsymbol{S}_\theta \right]}^{\text{sample estimate}},\\
\end{split}`
$$
]
]

which is a weighted average of prior expectation and sample estimate.

---

class: center, middle

# What's next?

### Move on to the readings for the next module!