STA 360/602L: Module 4.1

class: center, middle, inverse, title-slide

# STA 360/602L: Module 4.1
## Multivariate normal model I
### Dr. Olanrewaju Michael Akande

---

## Multivariate data

- So far we have only considered basic models with scalar/univariate outcomes, `$Y_1, \ldots, Y_n$`.

- In practice however, outcomes of interest are actually often multivariate, e.g.,
  + Repeated measures of weight over time in a weight loss study
  + Measures of multiple disease markers
  + Tumor counts at different locations along the intestine

- Longitudinal data is just a special case of multivariate data.

- Interest then is often on how multiple outcomes are correlated, and on how that correlation may change across outcomes or time points.

---
## Multivariate normal distribution

- The most common model for multivariate outcomes is the .hlight[multivariate normal distribution].

- Let `$\boldsymbol{Y} = (Y_1,\ldots,Y_p)^T$`, where `$p$` represents the dimension of the multivariate outcome variable for a single unit of observation.

- For multiple observations, `$\boldsymbol{Y_i} = (Y_{i1},\ldots,Y_{ip})^T$` for `$i = 1,\ldots, n$`.

- `$\boldsymbol{Y}$` follows a multivariate normal distribution, that is, `$\boldsymbol{Y} \sim \mathcal{N}_p(\boldsymbol{\mu}, \Sigma)$`, if
.block[
`$$p(\boldsymbol{y} | \boldsymbol{\mu}, \Sigma) = (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{y} - \boldsymbol{\mu})\right\},$$`
]

where `$\left|\Sigma\right|$` denotes the determinant of `$\Sigma$`.

---
## Multivariate normal distribution

If `$\boldsymbol{Y} \sim \mathcal{N}_p(\boldsymbol{\mu}, \Sigma)$`, then
  
--

+ `$\boldsymbol{\mu}$` is the `$p \times 1$` mean vector, that is, `$$\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{Y}] = \{\mathbb{E}[Y_1],\ldots,\mathbb{E}[Y_p]\} = (\mu_1, \ldots,\mu_p)^T.$$`

+ `$\Sigma$` is the `$p \times p$` **positive definite and symmetric** covariance matrix, that is,
      - `$\Sigma = \{\sigma_{jk}\}$`, where `$\sigma_{jk}$` denotes the covariance between `$Y_j$` and `$Y_k$`.
  
--

+ `$Y_1,\ldots,Y_p$` may be linearly dependent depending on the structure of `$\Sigma$`, which characterizes the association between them.

+ For each `$j = 1, \ldots, p$`, `$Y_j \sim \mathcal{N}(\mu_j, \sigma_{jj})$`.

---
## Bivariate normal distribution

- In the bivariate case, we have
.block[
.small[
`\begin{eqnarray*}
\boldsymbol{Y} =
\begin{pmatrix}Y_1\\
Y_2
\end{pmatrix} & \sim & \mathcal{N}_2\left[\mu = \left(\begin{array}{c}
\mu_1\\
\mu_2
\end{array}\right),\Sigma = \left(\begin{array}{cc}
\sigma_{11} = \sigma^2_1 & \sigma_{12} \\
\sigma_{21} & \sigma_{22} = \sigma^2_2
\end{array}\right)\right],\\
\end{eqnarray*}`
]
]

and `$\sigma_{12} = \sigma_{21} = \mathbb{Cov}[Y_1,Y_2]$`.
  
--

- The correlation between `$Y_1$` and `$Y_2$` is defined as
.block[
.small[
`$$\rho_{1,2} = \dfrac{\mathbb{Cov}[Y_1,Y_2]}{\sqrt{\mathbb{Var}[Y_1]}\sqrt{\mathbb{Var}[Y_2]}} = \dfrac{\sigma_{12}}{\sigma_1 \sigma_2}.$$`
]
]

- `$-1 \leq \rho_{1,2} \leq 1$`.

- Correlation coefficient is free of the measurement units.

---
## Back to the multivariate normal

- There are many special properties of the multivariate normal as we will see as we continue to work with the distribution.

- First, dependence between any `$Y_j$` and `$Y_k$` does not depend on the other `$p-2$` variables.

- Second, while generally, **independence implies zero covariance**, for the normal family, the converse is also true. That is, **zero covariance also implies independence**.

- Thus, the covariance `$\Sigma$` carries a lot of information about marginal relationships, especially **marginal independence**.

- If `$\boldsymbol{\epsilon} = (\epsilon_1, \ldots, \epsilon_p) \sim \mathcal{N}_p(\boldsymbol{0}, \boldsymbol{I}_p)$`, that is, `$\epsilon_1, \ldots, \epsilon_p \overset{iid}{\sim} \mathcal{N}(0,1)$`, then
.block[
`$$\boldsymbol{Y} = \boldsymbol{\mu} + A\boldsymbol{\epsilon} \Rightarrow \ \boldsymbol{Y} \sim \mathcal{N}_p(\boldsymbol{\mu}, \Sigma)$$`
]

holds for any matrix square root `$A$` of `$\Sigma$`, that is, `$AA^T = \Sigma$` (see Cholesky decomposition).

---
## Conditional distributions

- Partition `$\boldsymbol{Y} = (Y_1,\ldots,Y_p)^T$` as
.block[
.small[
`\begin{eqnarray*}
\boldsymbol{Y} =
\begin{pmatrix}\boldsymbol{Y}_1\\
\boldsymbol{Y}_2
\end{pmatrix} & \sim & \mathcal{N}_p\left[\left(\begin{array}{c}
\boldsymbol{\mu}_1\\
\boldsymbol{\mu}_2
\end{array}\right),\left(\begin{array}{cc}
\Sigma_{11} & \Sigma_{12} \\
\Sigma_{21} & \Sigma_{22}
\end{array}\right)\right],\\
\end{eqnarray*}`
]
]

where
  + `$\boldsymbol{Y}_1$` and `$\boldsymbol{\mu}_1$` are `$q \times 1$`,
  + `$\boldsymbol{Y}_2$` and `$\boldsymbol{\mu}_2$` are `$(p-q) \times 1$`,
  + `$\Sigma_{11}$` is `$q \times q$`, and
  + `$\Sigma_{22}$` is `$(p-q) \times (p-q)$`, with `$\Sigma_{22} > 0$`.
  
--

- Then,
.block[
`$$\boldsymbol{Y}_1 | \boldsymbol{Y}_2 = \boldsymbol{y}_2 \sim \mathcal{N}_q\left(\boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}  (\boldsymbol{y}_2-\boldsymbol{\mu}_2), \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right).$$`
]

- Marginal distributions are once again normal, that is,
.block[
`$$\boldsymbol{Y}_1  \sim \mathcal{N}_q\left(\boldsymbol{\mu}_1, \Sigma_{11}\right); \ \ \ \boldsymbol{Y}_2  \sim \mathcal{N}_{p-q}\left(\boldsymbol{\mu}_2, \Sigma_{22}\right).$$`
]

---
## Conditional distributions

- In the bivariate normal case with
.block[
.small[
`\begin{eqnarray*}
\boldsymbol{Y} =
\begin{pmatrix}Y_1\\
Y_2
\end{pmatrix} & \sim & \mathcal{N}_2\left[\mu = \left(\begin{array}{c}
\mu_1\\
\mu_2
\end{array}\right),\Sigma = \left(\begin{array}{cc}
\sigma_{11} = \sigma^2_1 & \sigma_{12} \\
\sigma_{21} & \sigma_{22} = \sigma^2_2
\end{array}\right)\right],\\
\end{eqnarray*}`
]
]

we have
.block[
.small[
`$$Y_1 | Y_2 = y_2 \sim \mathcal{N}\left(\mu_1 + \dfrac{\sigma_{12}}{\sigma_2^2} (y_2-\mu_2), \sigma_1^2 - \dfrac{\sigma_{12}^2}{\sigma_2^2}\right).$$`
]
]

which can also be written as
.block[
.small[
`$$Y_1 | Y_2 = y_2 \sim \mathcal{N}\left(\mu_1 + \dfrac{\sigma_{1}}{\sigma_2} \rho (y_2-\mu_2), (1-\rho^2) \sigma_1^2 \right).$$`
]
]

---
## Multivariate normal likelihood

- Suppose `$\boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma)$`, `$i = 1, \ldots, n$`.

- Write `$\boldsymbol{Y} = (\boldsymbol{y}_1,\ldots,\boldsymbol{y}_n)^T$`. The resulting likelihood can then be written as
.block[
.small[
$$
`\begin{split}
p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & = \prod^n_{i=1} (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\
\\
& \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}.
\end{split}`
$$
]
]

- It will be super useful to be able to write the likelihood in two different formulations depending on whether we care about the posterior of `$\boldsymbol{\theta}$` or `$\Sigma$`.

---
## Multivariate normal likelihood

- For inference on `$\boldsymbol{\theta}$`, it is convenient to write `$p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma)$` as
.block[
.small[
$$
`\begin{split}
p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}}}_{\text{does not involve } \boldsymbol{\theta}} \ \ \cdot \ \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\
& \propto \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i^T - \boldsymbol{\theta}^T) \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \left[\underbrace{\boldsymbol{y}_i^T\Sigma^{-1}\boldsymbol{y}_i}_{\text{does not involve } \boldsymbol{\theta}} - \ \ \ \underbrace{ \boldsymbol{y}_i^T\Sigma^{-1}\boldsymbol{\theta} - \boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{y}_i}_{\text{same term}}  \ \ \ +  \ \ \ \boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{\theta} \right] \right\}\\
& \propto \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \left[\boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{\theta} -2\boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{y}_i \right] \right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{\theta} - \dfrac{1}{2}\sum^n_{i=1} (-2)\boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{y}_i  \right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} n \boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Sigma^{-1} \sum^n_{i=1}\boldsymbol{y}_i  \right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}})  \right\},\\
\end{split}`
$$
]
]

where `$\bar{\boldsymbol{y}} = (\bar{y}_1,\ldots,\bar{y}_p)^T$`.

---
## Multivariate normal likelihood

- For inference on `$\Sigma$`, we need to rewrite the likelihood a bit.

- First a few results from matrix algebra:

1. `$\text{tr}(\boldsymbol{A}) = \sum^p_{j=1}a_{jj}$`, where `$a_{jj}$` is the `$j$`th diagonal element of a square `$p \times p$` matrix `$\boldsymbol{A}$`, where `$\text{tr}(\cdot)$` is the **trace function** (sum of diagonal elements).
  
--
  2. Cyclic property:
  `$$\text{tr}(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}) = \text{tr}(\boldsymbol{B}\boldsymbol{C}\boldsymbol{A}) = \text{tr}(\boldsymbol{C}\boldsymbol{A}\boldsymbol{B}),$$`
    given that the product `$\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}$` is a square matrix.
  
--
  3. If `$\boldsymbol{A}$` is a `$p \times p$` matrix, then for a `$p \times 1$` vector `$\boldsymbol{x}$`, 
    `$$\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x} = \text{tr}(\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x})$$`
    holds by (1), since `$\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x}$` is a scalar.
  
--
  4. `$\text{tr}(\boldsymbol{A} + \boldsymbol{B}) = \text{tr}(\boldsymbol{A}) + \text{tr}(\boldsymbol{B})$`.

---
## Multivariate normal likelihood

- It is convenient to rewrite `$p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma)$` as
.block[
.small[
$$
`\begin{split}
p(\boldsymbol{Y} | \boldsymbol{\theta}, \Sigma) & \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}}_{\text{no algebra/change yet}}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta}) \right]}_{\text{by result 3}} \right\}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by cyclic property}} \right\}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \underbrace{\text{tr}\left[\sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by result 4}} \right\}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\
\end{split}`
$$
]
]

where `$\boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T$` is the residual sum of squares matrix.

---

class: center, middle

# What's next?

### Move on to the readings for the next module!