class: center, middle, inverse, title-slide # STA 360/602L: Module 5.3 ## Hierarchical normal models with constant variance: multiple groups ### Dr. Olanrewaju Michael Akande --- ## Comparing multiple groups - Suppose we wish to investigate the mean (and distribution) of test scores for students at `\(J\)` different high schools. -- - In each school `\(j\)`, where `\(j = 1, \ldots, J\)`, suppose we test a random sample of `\(n_j\)` students. -- - Let `\(y_{ij}\)` be the test score for the `\(i\)`th student in school `\(j\)`, with `\(i = 1,\ldots, n_j\)`, with .block[ .small[ `$$y_{ij} | \theta_j, \sigma^2_j \sim \mathcal{N} \left( \theta_j, \sigma^2_j\right)$$` ] ] where for each school `\(j\)`, `\(\theta_j\)` is the school-wide average test score, and `\(\sigma^2_j\)` is the school-wide variance of individual test scores. -- - This is what we did for the the Pygmalion study and job training data. --- ## School testing example - .hlight[Option I]: Classical inference for each school can be based on large sample 95% CI: `\(\bar{y}_j \pm 1.96 \sqrt{s^2_j/n_j}\)`, where `\(\bar{y}_j\)` is the sample average in school `\(j\)`, and `\(s^2_j\)` is the sample variance in school `\(j\)`. -- - Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? `\(\bar{y}_j\)` can be a good estimate if `\(n_j\)` is large but it may be poor if `\(n_j\)` is small. -- - .hlight[Option II]: alternatively, we might believe that `\(\theta_j = \mu\)` for all `\(j\)`; that is, all schools have the same mean. This is the assumption (null hypothesis) in ANOVA models for example. We can also set `\(\sigma^2_j = \sigma^2\)` for all `\(J\)`. -- - Option I ignores that the `\(\theta_j\)`'s should be reasonably similar, whereas option II ignores any differences between them. -- - It would be nice to find a compromise! Borrowing information across, and shrinking our estimate towards a **grand mean** could be very useful here. --- ## School testing example - For the Pygmalion study and job training data, we focused on using priors that are independent between the groups. -- - For example, in the conjugate case, we would have .block[ .small[ $$ `\begin{split} \pi(\theta_j|\sigma^2_j) & = \mathcal{N}\left(\mu_0, \dfrac{\sigma^2_j}{\kappa_0}\right)\\ \pi(\sigma^2_j) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\ \end{split}` $$ ] ] for some hyperparameters (constants), `\(\mu_0\)`, `\(\kappa_0\)`, `\(\nu_0\)`, and `\(\sigma_0^2\)`. -- - In the semi-conjugate case, .block[ .small[ $$ `\begin{split} \pi(\theta_j) & = \mathcal{N}\left(\mu_0, \sigma^2_0\right)\\ \pi(\sigma^2_j) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\gamma_0^2}{2}\right)\\ \end{split}` $$ ] ] for some hyperparameters (constants), `\(\mu_0\)`, `\(\sigma^2_0\)`, `\(\nu_0\)`, and `\(\gamma_0^2\)`. --- ## Hierarchical normal model - Instead, we can assume that the `\(\theta_j\)`'s are drawn from a distribution based on the following: conceive of the schools themselves as being a random sample from all possible schools. -- - For now, assume the variance is constant across schools. The hierarchical normal model assumes normal sampling models both within and between groups: .block[ .small[ $$ `\begin{split} y_{ij} | \theta_j, \sigma^2 & \sim \mathcal{N} \left( \theta_j, \sigma^2\right); \ \ \ i = 1,\ldots, n_j\\ \theta_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J, \end{split}` $$ ] ] which gives us an extra level in the prior on the means, and leads to sharing of information across the groups in estimating the group-specific means. -- - We have an extra variance parameter `\(\tau^2\)`. Comparing `\(\tau^2\)` to `\(\sigma^2\)` tells us how much of the variation in `\(Y\)` is due to within-group versus between-group variation. --- ## Hierarchical normal model - Standard semi-conjugate priors are given by .block[ .small[ $$ `\begin{split} \pi(\mu) & = \mathcal{N}\left(\mu_0, \gamma^2_0\right)\\ \pi(\sigma^2) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\ \pi(\tau^2) & = \mathcal{IG} \left(\dfrac{\eta_0}{2}, \dfrac{\eta_0\tau_0^2}{2}\right).\\ \end{split}` $$ ] ] -- with + `\(\mu_0\)`: best guess of average of school averages + `\(\gamma^2_0\)`: set based on plausible ranges of values of `\(\mu\)` -- + `\(\tau_0^2\)`: best guess of variance of school averages + `\(\eta_0\)`: set based on how tight prior for `\(\tau^2\)` is around `\(\tau_0^2\)` -- + `\(\sigma_0^2\)`: best guess of variance of individual test scores around respective school means + `\(\nu_0\)`: set based on how tight prior for `\(\sigma^2\)` is around `\(\sigma_0^2\)`. --- ## Exchangeability - This model relies heavily on exchangeability across units at each level. -- - For example, we assume the schools are a random sample from the population of all schools, and the students within schools are a random sample of all the students in each school. -- - This is not always completely true. -- - Note: we can allow the variance to vary across schools if desired (and we will soon in fact). --- ## Exchangeability - Turns out that **conditional exchangeability** would be enough if we control for relevant variables in our modeling. -- - For example, the schools in Chapel Hill/Carrboro are not entirely exchangeable. -- - For example, Phoenix Academy is for students on long-term out-of-school suspension or who need to make up work due to extended absences (e.g., pregnancy), and Memorial Hospital School is for children battling serious illnesses. -- - However, if we condition on school type (public, charter, private, special services, home), the schools may then be exchangeable. --- ## Posterior inference - Recall the model is .block[ .small[ $$ `\begin{split} y_{ij} | \theta_j, \sigma^2 & \sim \mathcal{N} \left( \theta_j, \sigma^2\right); \ \ \ i = 1,\ldots, n_j\\ \theta_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J, \end{split}` $$ ] ] -- - Under our prior specification, we can factor the posterior as follows: .block[ .small[ $$ `\begin{split} \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) & \boldsymbol{\propto} p(y | \theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times \pi(\mu, \sigma^2,\tau^2)\\ \\ & \boldsymbol{=} p(y | \theta_1, \ldots, \theta_J, \sigma^2 )\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu,\tau^2)\\ & \ \ \ \ \times \pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \\ & \boldsymbol{=} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\}\\ & \ \ \ \ \times \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\}\\ & \ \ \ \ \times\pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \end{split}` $$ ] ] --- ## Full conditional for grand mean - The full conditional distribution of `\(\mu\)` is proportional to the part of the joint posterior `\(\pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y)\)` that involves `\(\mu\)`. -- - That is, .block[ .small[ $$ `\begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\mu). \end{split}` $$ ] ] -- - This looks like the full conditional distribution from the one-sample normal case, so you can show that .block[ .small[ $$ `\begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\mu_n, \gamma^2_n \right) \ \ \ \ \textrm{where}\\ \\ \gamma^2_n = \dfrac{1}{ \dfrac{J}{\tau^2} + \dfrac{1}{\gamma_0^2} } ; \ \ \ \ \ \ \ \ \mu_n = \gamma^2_n \left[ \dfrac{J}{\tau^2} \bar{\theta} + \dfrac{1}{\gamma_0^2} \mu_0 \right] \end{split}` $$ ] ] and `\(\bar{\theta} = \frac{1}{J} \sum\limits^J_{j=1} \theta_j\)`. --- ## Full conditionals for group means - Similarly, the full conditional distribution of each `\(\theta_j\)` is proportional to the part of the joint posterior `\(\pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y)\)` that involves `\(\theta_j\)`. -- - That is, .block[ .small[ $$ `\begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot p(\theta_j | \mu,\tau^2) \\ \end{split}` $$ ] ] -- - Those terms include a normal for `\(\theta_j\)` multiplied by a product of normals in which `\(\theta_j\)` is the mean, again mirroring the one-sample case, so you can show that .block[ .small[ $$ `\begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\theta_j^\star, \nu_j^\star \right) \ \ \ \ \textrm{where}\\ \\ \nu_j^\star & = \dfrac{1}{ \dfrac{n_j}{\sigma^2} + \dfrac{1}{\tau^2} } ; \ \ \ \ \ \ \ \theta_j^\star = \nu_j^\star \left[ \dfrac{n_j}{\sigma^2} \bar{y}_j + \dfrac{1}{\tau^2} \mu \right] \end{split}` $$ ] ] --- ## Full conditionals for group means - Our estimate for each `\(\theta_j\)` is a weighted average of `\(\bar{y}_j\)` and `\(\mu\)`, ensuring that we are borrowing information across all levels through `\(\mu\)` and `\(\tau^2\)`. -- - The weights for the weighted average is determined by relative precisions from the data and from the second level model. -- - The groups with smaller `\(n_j\)` have estimated `\(\theta_j^\star\)` closer to `\(\mu\)` than schools with larger `\(n_j\)`. -- - Thus, degree of shrinkage of `\(\theta_j\)` depends on ratio of within-group to between-group variances. --- ## Full conditionals for across-group variance - The full conditional distribution of `\(\tau^2\)` is proportional to the part of the joint posterior `\(\pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y)\)` that involves `\(\tau^2\)`. -- - That is, .block[ .small[ $$ `\begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\tau^2)\\ \end{split}` $$ ] ] -- - As in the case for `\(\mu\)`, this looks like the one-sample normal problem, and our full conditional posterior is .block[ .small[ $$ `\begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & = \mathcal{IG} \left(\dfrac{\eta_n}{2}, \dfrac{\eta_n\tau_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \eta_n = \eta_0 + J ; \ \ \ \ \ \ \ \tau_n^2 & = \dfrac{1}{\eta_n} \left[\eta_0\tau_0^2 + \sum\limits_{j=1}^{J} (\theta_j - \mu)^2 \right].\\ \end{split}` $$ ] ] --- ## Full conditionals for within-group variance - Finally, the full conditional distribution of `\(\sigma^2\)` is proportional to the part of the joint posterior `\(\pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y)\)` that involves `\(\sigma^2\)`. -- - That is, .block[ .small[ $$ `\begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot \pi(\sigma^2)\\ \end{split}` $$ ] ] -- - We can again take advantage of the one-sample normal problem, so that our full conditional posterior is .block[ .small[ $$ `\begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & = \mathcal{IG} \left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \nu_n = \nu_0 + \sum\limits_{j=1}^{J} n_j ; \ \ \ \ \ \ \ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{j=1}^{J}\sum\limits_{i=1}^{n_j} (y_{ij} - \theta_j)^2 \right].\\ \end{split}` $$ ] ] --- class: center, middle # What's next? ### Move on to the readings for the next module!