Due: 11:59pm, Monday, June 7
You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.
You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very basic R Markdown template: https://sta-360-602l-su21.github.io/Course-Website/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://sta-360-602l-su21.github.io/Course-Website/resources/ ) for links to help you learn how to use R markdown.
You MUST submit both your .Rmd and .pdf files to the course site on Gradescope here: https://www.gradescope.com/courses/190490/assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.
You will need the following R packages. If you do not already have them installed, please do so first using the install.packages
function.
require(tidyverse)
require(rstanarm)
require(magrittr)
library(ggplot2)
library(mlmRev)
library(tidybayes)
library(ggstance)
library(dplyr)
library(modelr)
If you run into any problems running any of the functions in the rstanarm
package, first uninstall both the rstanarm
and rstan
packages. To be on the safe side, it is sometimes necessary to remove any existing RStan via
remove.packages("rstan")
if (file.exists(".RData")) file.remove(".RData")
Next, restart R. Then, re-install a fresh copy of rstan
first, then rstanarm
afterwards. Make sure to install version \(\geq\) 2.19.3 for both packages.
We have data from a General Certificate of Secondary (GCSE) exam, which is an academic qualification exam taken in the UK. There are two components to the exam: a written paper and course work. Scores for both are included in the dataset, along with the school each student attended, the student’s unique ID, and the gender of the student. We will work only with the course work variable as our response variable:
data(Gcsemv, package = "mlmRev")
dim(Gcsemv)
## [1] 1905 5
summary(Gcsemv)
## school student gender written course
## 68137 : 104 77 : 14 F:1128 Min. : 1 Min. : 9
## 68411 : 84 83 : 14 M: 777 1st Qu.:37 1st Qu.: 63
## 68107 : 79 53 : 13 Median :46 Median : 76
## 68809 : 73 66 : 13 Mean :46 Mean : 73
## 22520 : 65 27 : 12 3rd Qu.:55 3rd Qu.: 86
## 60457 : 54 110 : 12 Max. :90 Max. :100
## (Other):1446 (Other):1827 NA's :202 NA's :180
# Make Male the reference category and rename variable
$female <- relevel(Gcsemv$gender, "M")
Gcsemv
# Use only total score on coursework paper
<- subset(x = Gcsemv,
GCSE select = c(school, student, female, course))
# Count unique schools and students
<- length(unique(GCSE$school))
m <- nrow(GCSE) N
We will use a slightly different notation from what we used in class. Here, each individual \(i\), with \(i = 1,\ldots, n\), belongs to a certain school \(j\), with \(j = 1,\ldots, m\), and \(m =\) 73. We also have the gender variable which we can use as a regressor/predictor/covariate/feature for the response. A natural first step may be to model course work with a linear regression model with the gender variable as a covariate. Actually let’s start with a bit of explanatory analysis.
course
scores and plot the distribution of these averages in a histogram. What does this plot reveal?As one extreme, suppose we did not want to take the different schools into account, and simply fit the linear model \[y_i \sim N(\alpha + x_i^T \beta, \sigma).\]
We call this model a pooled model because we are ignoring differences between the groups and use the common \(\beta\) coefficient. The pooled model estimates a single model, with inference performed on \(\alpha, \beta,\) and \(\sigma\).
The other extreme is an unpooled model where we fit a separate model for each group/school. So for school \(j\), we fit the model \[y_{ij} \sim N(\alpha_{j} + x_i^T \beta_j, \sigma_j).\]
In the unpooled framework, we estimate \(m\) models, but no information is shared between groups. That is, we estimate the coefficients for the first school independently of the coefficients for the second school. Both are fit here:
<- stan_glm(course ~ 1 + female, data = GCSE, refresh = 0)
pooled <- stan_glm(course ~ -1 + school + female,data=GCSE, refresh = 0) unpooled
However, it seems likely that we can improve our models if we can share information about the \(\alpha\) and \(\beta\) between groups. As discussed in class, this naturally leads to a hierarchical framework where we use a prior distribution to encode relationships across the schools. We will explore several multilevel models for this data.
If we let \(Y_{ij}\) be individual \(i\) in school \(j\)’s exam score, we can write the following model:
\[ \begin{align*} &Y_{ij} = \theta_j + \epsilon_{ij}, \quad \epsilon_{ij} \overset{iid} \sim N(0, \sigma^2) \\ &\theta_j = \mu_\theta + \omega_j, \quad \omega_j \overset{iid} \sim N(0, \tau^2) \end{align*} \]
We see that \(\theta_j\) is the school-specific intercept, and \(\mu_\theta\) is the overall mean across the \(m\) schools. We could introduce the covariates into this model as well, but we begin with a simple intercept-only regression:
<- stan_lmer(formula = course ~ 1 + (1 | school),
mod1 data = GCSE,
seed = 349,
refresh = 0)
The stan_lmer()
function allows for easy multilevel/hierarchical modeling. Take a look at the help page for stan_lmer()
to see default options. Looking to the formula, we have
\[\text{formula} = \text{course}\sim 1 + (1 \ | \text{ school}).\]
Similar to the syntax for the usual lm()
, the variable on the left of the tilde is our response variable. The 1
on the right side of the formula specifies that we would like an intercept, and the (1 | school)
term specifies that we would like the intercept to vary by the variable school
. We will estimate an overall intercept, and then for each school we will estimate a term that provides an adjustment or deviation from that intercept. We can think of the multilevel model as introducing an additional random effect or component of variation, because we are allowing the intercept to vary across schools.
In the function call, we did not set a prior, which means that we used the default priors for \(\mu_\theta\), \(\sigma^2\), and \(\tau^2\). We can see the priors that were used by running the following code:
prior_summary(object = mod1)
## Priors for model 'mod1'
## ------
## Intercept (after predictors centered)
## Specified prior:
## ~ normal(location = 73, scale = 2.5)
## Adjusted prior:
## ~ normal(location = 73, scale = 41)
##
## Auxiliary (sigma)
## Specified prior:
## ~ exponential(rate = 1)
## Adjusted prior:
## ~ exponential(rate = 0.061)
##
## Covariance
## ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
## ------
## See help('prior_summary.stanreg') for more details
Notice that the priors have been scaled (we could set autoscale = F
in the function call to avoid this). Where did the adjusted scale values come from? Consider the observed standard deviation of the course variable:
sd(GCSE$course, na.rm = T)
## [1] 16
Basically, by setting autoscale = T
, rstanarm
attempts to make the default priors weakly informative by internally adjusting the scales of the priors. When setting your own “informative priors”, you should disable this option. For more information on this, see http://mc-stan.org/rstanarm/articles/priors.html.
Let’s look at the output of the model fitting. The output contains summaries for \(\mu_\theta\), \(\sigma^2\), and \(\tau^2\). Can you identify which is which? If you cannot, ask the TA.
print(mod1, digits = 3)
## stan_lmer
## family: gaussian [identity]
## formula: course ~ 1 + (1 | school)
## observations: 1725
## ------
## Median MAD_SD
## (Intercept) 73.780 1.124
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 13.818 0.240
##
## Error terms:
## Groups Name Std.Dev.
## school (Intercept) 8.888
## Residual 13.821
## Num. levels: school 73
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
Note: The standard deviations reported (labeled MAD_SD in the print output) are proportional to the median absolute deviation (mad) from the median. Compared to the raw posterior standard deviation, the MAD_SD will be more robust for long-tailed distributions.
We can obtain posterior summaries and credible intervals as follows:
summary(mod1,
pars = c("(Intercept)", "sigma", "Sigma[school:(Intercept),(Intercept)]"),
probs = c(0.025, 0.975),
digits = 3)
##
## Model Info:
## function: stan_lmer
## family: gaussian [identity]
## formula: course ~ 1 + (1 | school)
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 1725
## groups: school (73)
##
## Estimates:
## mean sd 2.5% 98%
## (Intercept) 73.777 1.136 71.447 75.941
## sigma 13.821 0.241 13.362 14.299
## Sigma[school:(Intercept),(Intercept)] 79.004 16.416 52.086 114.869
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.041 1.002 756
## sigma 0.003 1.002 5455
## Sigma[school:(Intercept),(Intercept)] 0.671 1.011 598
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
If we want to create plots, it is helpful to extract the posterior draws. We can extract the draws for each variable by specifying the variable name, as we’ve seen before. Notice that we use regex_pars
, which means that we want to extract all the variables with names that match the form the regular expression.
<- as.matrix(mod1)
mod1_sims dim(mod1_sims)
## [1] 4000 76
<- colnames(mod1_sims)
par_names head(par_names)
## [1] "(Intercept)" "b[(Intercept) school:20920]"
## [3] "b[(Intercept) school:22520]" "b[(Intercept) school:22710]"
## [5] "b[(Intercept) school:22738]" "b[(Intercept) school:22908]"
tail(par_names)
## [1] "b[(Intercept) school:76631]"
## [2] "b[(Intercept) school:77207]"
## [3] "b[(Intercept) school:84707]"
## [4] "b[(Intercept) school:84772]"
## [5] "sigma"
## [6] "Sigma[school:(Intercept),(Intercept)]"
# obtain draws for mu_theta
<- as.matrix(mod1, pars = "(Intercept)")
mu_theta_sims
# obtain draws for each school's contribution to intercept
<- as.matrix(mod1,
theta_sims regex_pars ="b\\[\\(Intercept\\) school\\:")
# to finish: obtain draws for sigma and tau^2
<- as.matrix(mod1,
sig_sims pars = "sigma")
<- as.matrix(mod1,
tau2_sims pars = "Sigma[school:(Intercept),(Intercept)]")
The intercept variable is the same for all of the 73 schools (corresponds to the 1
in the regression formula). The (1 | school)
term in the formula is each school’s specific difference from the overall intercept. These differences have coefficients named b[(Intercept) school: <school number>]
. The following lines of code compute the 73 total varying intercepts and store the the posterior means and 95% credible intervals for each intercept:
<- as.numeric(mu_theta_sims) + theta_sims
int_sims
# posterior mean
<- apply(int_sims, MARGIN = 2, FUN = mean)
int_mean
# credible interval
<- apply(int_sims, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975))
int_ci <- data.frame(t(int_ci))
int_ci
# combine into a single df
<- data.frame(int_mean, int_ci)
int_df names(int_df) <- c("post_mean","Q2.5", "Q97.5")
# sort DF according to posterior mean
<- int_df[order(int_df$post_mean),]
int_df
# create variable "index" to represent order
<- int_df %>% mutate(index = row_number())
int_df
# plot posterior means of school-varying intercepts, along with 95 CIs
ggplot(data = int_df, aes(x = index, y = post_mean))+
geom_pointrange(aes(ymin = Q2.5, ymax = Q97.5))+
scale_x_continuous("Index", breaks = seq(0,m, 5)) +
scale_y_continuous(expression(paste("varying intercept ", theta[j])))
Now that we have sampled all the parameters and obtained the varying intercepts posterior estimates, we may be interested in comparing the intercepts of schools.
We can add a level of complexity to the model by taking advantage of the covariates provided to us. Let \(x_{ij}\) be the value of the covariate for individual \(i\) in school \(j\). Then the only modification to Model 1 is a change to the sampling model:
\[Y_{ij} \sim N(\theta_j + \beta X_{ij}, \sigma^2),\]
where \(\theta_j\) takes the same form as before. If we allow \(X_{ij}\) to represent whether or not individual \(i\) is female, how would we code this? (The coefficient for the female covariate \(\beta\) will be the same for all schools). This would be reasonable if we think that the effect of gender on our response variable is the same across schools.
See the code below for how to run this model. Notice in the code that we specify a prior for \(\mu_\theta\) hyperparameter and for the \(\beta\) coefficient, and that we chose to not autoscale the data. How informative is our prior for \(\beta\)?
<- stan_lmer(formula = course ~ 1 + female + (1 | school),
mod2 data = GCSE,
prior = normal(location = 0,
scale = 100,
autoscale = F),
prior_intercept = normal(location = 0,
scale = 100,
autoscale = F),
seed = 349,
refresh = 0)
# plot varying intercepts
<- as.matrix(mod2)
mod2.sims <- mean(mod2.sims[,1])
group_int <- mean(mod2.sims[,2])
mp <- apply(mod2.sims[, 3:75], 2, mean)
bp <- seq(0,1,.01)
xvals plot(x = xvals, y = rep(0, length(xvals)),
ylim = c(50, 90), xlim = c(-0.1,1.1), xaxt = "n", xlab = "female", ylab = "course")
axis(side = 1, at = c(0,1))
for (bi in bp){
lines(xvals, (group_int + bi)+xvals*mp)
}
Now, we allow the coefficient for female to vary across the 73 schools: \[Y_{ij} \sim N(\theta_j + \beta_j X_{ij}, \sigma^2).\]
This would be reasonable if we think that the effect of gender on our response variable actual might differ across schools. When we do not allow group-specific intercepts and slopes, it is common to model the coefficients independently (that is, specify an independent prior for each one). However, if we allow the intercept and slope to vary across the schools, we can model them as dependent in the prior:
\[ \begin{bmatrix}\theta_j \\ \beta_j \end{bmatrix} \sim N\left( \begin{bmatrix} \mu_\theta \\ \mu_\beta \end{bmatrix}, \begin{bmatrix} \sigma^2_\theta & Cov(\theta_j, \beta_j)\\ Cov(\beta_j, \theta_j) & \sigma^2_\beta \end{bmatrix}\right) \]
We will now have to specify a prior for the covariance matrix, which we will call \(\Sigma\). Setting priors for covariance parameters is always a tricky task, and we here we will explain the default priors used in stan_lmer()
. The method used in stan_lmer()
decomposes a covariance matrix into a correlation matrix \(R\) and a matrix of variances \(V\):
\[ \begin{align*} \Sigma &= \begin{bmatrix} \sigma^2_\theta & Cov(\theta_j, \beta_j)\\ Cov(\beta_j, \theta_j) & \sigma^2_\beta \end{bmatrix} \\ &= \begin{bmatrix} \sigma^2_\theta & \rho \sigma_\theta \sigma_\beta \\ \rho \sigma_\theta \sigma_\beta & \sigma^2_\beta \end{bmatrix} \\ &= \sigma^2 \begin{bmatrix} \sigma^2_\theta/\sigma^2 & \rho \sigma_\theta \sigma_\beta/\sigma^2 \\ \rho \sigma_\theta \sigma_\beta/\sigma^2 & \sigma^2_\beta/\sigma^2 \end{bmatrix} \\ &= \sigma^2 \begin{bmatrix} \sigma_\theta/\sigma & 0 \\ 0 & \sigma_\beta/\sigma \end{bmatrix} \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}\begin{bmatrix} \sigma_\theta/\sigma & 0 \\ 0 & \sigma_\beta/\sigma \end{bmatrix}\\ &= \sigma^2 VRV \end{align*} \]
After decomposing the covariance matrix into correlation and variance matrices, the variances are further decomposed into the product of a simplex vector (its elements sum to \(1\)) and the trace of the matrix. An LKJ prior is placed on the correlation matrix, with default being jointly uniform over all correlation matrices of the same dimensions as \(R\). A symmetric Dirichlet prior is used on the simplex vector, with default being uniform over space of simplex vectors of the same size. This all sounds like a lot. Don’t worry, for now, let’s just focus on fitting the model. See the priors help page for rstanarm
for more information.
Now, to actually fit the model, we specify that we want both the intercept and the slope of the female covariate to vary across schools. The code will take much longer to run because of the extra sampling that goes into the random slopes:
<- stan_lmer(formula = course~ 1+ female + (1 + female | school),
mod3 data = GCSE,
seed = 349,
refresh = 0)
<- as.matrix(mod3)
mod3_sims
# obtain draws for mu_theta
<- as.matrix(mod3, pars = "(Intercept)")
mu_theta_sims
<- as.matrix(mod3, pars = "femaleF")
fem_sims # obtain draws for each school's contribution to intercept
<- as.matrix(mod3,
theta_sims regex_pars ="b\\[\\(Intercept\\) school\\:")
<- as.matrix(mod3,
beta_sims regex_pars ="b\\[femaleF school\\:")
<- as.numeric(mu_theta_sims) + theta_sims
int_sims <- as.numeric(fem_sims) + beta_sims
slope_sims
# posterior mean
<- apply(slope_sims, MARGIN = 2, FUN = mean)
slope_mean
# credible interval
<- apply(slope_sims, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975))
slope_ci <- data.frame(t(slope_ci))
slope_ci
# combine into a single df
<- data.frame(slope_mean, slope_ci, levels(GCSE$school))
slope_df names(slope_df) <- c("post_mean","Q2.5", "Q97.5", "school")
# sort DF according to posterior mean
<- slope_df[order(slope_df$post_mean),]
slope_df
# create variable "index" to represent order
<- slope_df %>% mutate(index = row_number())
slope_df
# plot posterior means of school-varying slopes, along with 95% CIs
ggplot(data = slope_df, aes(x = index, y = post_mean))+
geom_pointrange(aes(ymin = Q2.5, ymax = Q97.5))+
scale_x_continuous("Index", breaks = seq(1,m, 1),
labels = slope_df$school) +
scale_y_continuous(expression(paste("varying slopes ", beta[j])))+
theme(axis.text.x = element_text(angle = 90))
Now that we’ve fit three different hierarchical models, we will compare them. We can use the compare_models()
as we did in the GLM lab. However, since we are comparing more than 2 models, instead of a difference in expected log predictive density, the functions returns a matrix arranged in descending order according to expected out-of-sample predictive accuracy.
<- loo(mod1)
loo1 <- loo(mod2)
loo2 <- loo(mod3)
loo3 loo_compare(loo1,loo2,loo3)
## elpd_diff se_diff
## mod3 0.0 0.0
## mod2 -29.6 9.9
## mod1 -79.4 15.1
Here, we plot the regression lines for some of the schools using the following models: + Pooled (red) + Unpooled (blue) + Varying intercept, fixed slope (green) + Varying intercept, varying slope (orange)
<- as.matrix(pooled)
pooled.sim <- as.matrix(unpooled)
unpooled.sim <- as.matrix(mod1)
m1.sim <- as.matrix(mod2)
m2.sim <- as.matrix(mod3)
m3.sim <- unique(GCSE$school)
schools
= mean(m2.sim[,1])
alpha2 <- mean(m3.sim[,1])
alpha3
<- mean(m2.sim[,2])
partial.fem2 <- mean(m3.sim[,2])
partial.fem3 <- mean(unpooled.sim[,74])
unpooled.fem
par(mfrow = c(2, 3), mar = c(1,2,2,1))
for (i in 1:18){
= GCSE %>% filter(school == schools[i]) %>%
temp na.omit()
<- temp$course
y <- as.numeric(temp$female)-1
x plot(x + rnorm(length(x)) *0.001, y, ylim = c(35,101), xlab = "female",main =schools[i], xaxt = "n", ylab = "course")
axis(1,c(0,1),cex.axis=0.8)
# no pooling
= mean(unpooled.sim[,i])
b
# plot lines and data
= seq(-0.1, 1.1, 0.01)
xvals lines(xvals, xvals * mean(pooled.sim[,2]) + mean(pooled.sim[,1]), col = "red") # pooled
lines(xvals, xvals * unpooled.fem + b, col = "blue") # unpooled
lines(xvals, xvals*partial.fem2 + (alpha2 + mean(m2.sim[,i+2])) , col = "green") # varying int
lines(xvals, xvals*(partial.fem3 + mean(m3.sim[, 2 + i*2])) + (alpha3 + mean(m3.sim[, 1 + i*2])), col = "orange") # varying int and slope
legend("bottom", legend = paste("n =", length(y), " "))
}
loo_compare
, which model woud you recommend?Now that you have seen a multilevel model implemented with some of the rstanarm
functionality, it’s your turn to specify one on your own.
We have an additional dataset on the radon levels in houses in the state of Minnesota. Specifically, for each house we have the radon measurement on the log scale (log_radon), an indicator from whether the measurement was take in the basement or the first floor (floor, where 0 = basement, 1 = first floor), and which of 85 counties the house belongs to. We have an additional fourth variable which gives the county-level uranium level (log_uranium).
Download the data (here: https://sta-360-602l-su21.github.io/Course-Website/labs/radon.txt) and save it locally to the same directory as your R markdown file. Once you have downloaded the data file into the SAME folder as your R markdown file, load the data by using the following R code.
<- read.csv("radon.txt", header = T,sep="")
radon $county <- as.factor(radon$county) radon
Do you think a hierarchical model is warranted here? Do some EDA! Look for differences across counties.
Begin by creating an unpooled model, i.e. a model where each county has a unique intercept and there are no other predictors. Call that model radon.unpooled
. Then create a hierarchical/partially-pooled model where we model each county’s intercept hierarchically, again with no other predictors. Call that model radon.mod1
.
Once you have fit the two models, run the following code. You should see two plots which give slightly wider than 95% credible intervals for the county-level intercepts. What do you notice?
<- as.numeric(table(radon$county))
n_county <- function(sim,model){
create_df <- apply(sim,2,mean)
mean <- apply(sim,2,sd)
sd <- cbind(n_county, mean, sd) %>%
df as.data.frame()%>%
mutate(se = sd/ sqrt(n_county), model = model)
return(df)
}
<- as.matrix(radon.unpooled)
unpooled.sim <- create_df(unpooled.sim[,1:85], model = "unpooled")
unpooled.df
<- as.matrix(radon.mod1)[,1:86]
mod1.sim <- (mod1.sim[,1] + mod1.sim)[,-1]
mod1.sim <- create_df(mod1.sim, model = "partial")
partial.df
ggplot(rbind(unpooled.df, partial.df)%>% mutate(model = factor(model, levels = c("unpooled", "partial"))), aes(x= n_county, y = mean)) +
#draws the means
geom_jitter() +
#draws the CI error bars
geom_errorbar(aes(ymin=mean-2*se, ymax= mean+2*se), width=.1)+
ylim(0,3)+
xlim(0,60)+
geom_hline(aes(yintercept= mean(coef(radon.unpooled))))+
facet_wrap(~model)
## Warning: Removed 6 rows containing missing values (geom_point).
Fit a varying-intercept model, but now add the variable floor as a fixed slope. Call the model radon.mod2
. For another model radon.mod3
, fit a varying-intercept and varying-slope model, again with floor as the predictor.
Lastly, recall that we have a fourth variable which gives the county-level log uranium measurements. A really powerful aspect of hierarchical/multilevel modeling is the ability to incorporate data at different levels of coarseness. Fit a varying-intercept model, but include both floor and log_uranium in your model as well. So now we have both an individual/house-level covariate, as well as a group/county-level covariate. Group-level predictors help reduce group-level variation, which induces stronger pooling effects. Call this model radon.mod4
.
Once you have fit these five models, report on the differences in model performance.
10 points: 3 points for question 8, 1 point for each remaining question.
This lab was adapted from this tutorial by Jordan Bryan and Becky Tang.