Due: 11:59pm, Friday, May 21
You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.
You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very basic R Markdown template: https://sta-360-602l-su21.github.io/Course-Website/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://sta-360-602l-su21.github.io/Course-Website/resources/ ) for links to help you learn how to use R markdown.
You MUST submit both your .Rmd and .pdf files to the course site on Gradescope here: https://www.gradescope.com/courses/190490/assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.
You will need the following R packages. If you do not already have them installed, please do so first using the install.packages
function.
require(tidyverse)
require(rstanarm)
require(magrittr)
require(rstan)
require(bayesplot)
require(loo)
require(readxl)
For this lab, you will need two stan files lab-03-poisson-simple.stan
and lab-03-poisson-simple.stan
, which you can download here:
Download both and make sure to save them in the same folder as the R script or R markdown file you are working from.
The data we have are counts of deaths per episode of Game of Thrones. Download the data (here: https://sta-360-602l-su21.github.io/Course-Website/labs/GoT_Deaths.xlsx) and save it locally to the same directory as your R markdown file. Once you have downloaded the data file into the SAME folder as your R markdown file, load the data by using the following R code.
<- read_xlsx("GoT_Deaths.xlsx", col_names = T)
GoT head(GoT)
## # A tibble: 6 x 3
## Season Episode Count
## <dbl> <dbl> <dbl>
## 1 1 1 4
## 2 1 2 3
## 3 1 3 0
## 4 1 4 1
## 5 1 5 5
## 6 1 6 4
Because we have count data, it is natural to model the data with a Poisson distribution with parameter \(\lambda\). If we take the empirical mean of the data as our estimate for \(\lambda\), what does the sampling distribution look like?
<- GoT$Count
y <- length(y)
n
# Finish: obtain empirical mean
<- mean(y)
ybar <- rpois(n, ybar)
sim_dat qplot(sim_dat, bins = 20, xlab = "Simulated number of deaths", fill = I("#9ecae1"))
# create data frame for side-by-side histograms
<- rbind(data.frame(y, "observed") %>% rename(count = 1, data = 2),
df data.frame(sim_dat, "simulated") %>% rename(count = 1, data = 2))
ggplot(df, aes(x = count, fill = data))+
geom_histogram(position = "dodge", bins = 20) +
scale_fill_brewer() +
labs(x = "Number of deaths", y = "Count")
Plotting the histogram of the data next to the histogram of the sampled data using \(\hat{\lambda} = \bar{y}\), we see that there are many more 0-valued observations in the observed data than there are in the simulated data. We might suspect that the Poisson model will not be a good fit for the data. For now, let’s proceed using the Poisson model and see how it behaves.
We have learned that the Gamma distribution is conjugate for Poisson data. Here, we have chosen the prior \(\lambda \sim Gamma(10,2)\) based on prior knowledge that Game of Thrones is a show filled with death. We run the model in Stan. We save the posterior sampled values of \(\lambda\) in the variable lambda_draws
.
<- list(y = y, N = n)
stan_dat <- stan("lab-03-poisson-simple.stan", data = stan_dat, refresh = 0, chains = 2)
fit <- as.matrix(fit, pars = "lambda") lambda_draws
mcmc_areas()
to plot the smoothed posterior distribution for \(\lambda\) with a 90% Highest Posterior Density region. Changing the prob
parameter in mcmc_areas()
will change the amount of probability mass that is highlighted. The highlighted region will begin in regions of highest density, and will move towards regions of lower density as prob
gets larger.mcmc_areas(lambda_draws, prob = 0.9)
print(fit, pars = "lambda")
The line in the plot represents the posterior mean, and the shaded area represents 90% of the posterior density. How does the posterior mean for \(\lambda\) compare to the sample mean?
length(lambda_draws)
by n
matrix called y_rep
.We now step through some graphical posterior predictive checks. Here we plot the histogram of the observed counts against several of the posterior predictions. What do you notice?
ppc_hist(y, y_rep[1:8, ], binwidth = 1)
We can compare the density estimate of \(y\) to the density estimates for several (60) of the \(y_rep\)s. What do you notice here?
ppc_dens_overlay(y, y_rep[1:60, ])
In our first histogram of the data, we noticed the high number of 0 counts. Let us calculate the proportion of zeros in \(y\), and compare that to the proportion of zeros in all of the posterior predictive samples.
<- function(x){
prop_zero mean(x == 0)
} prop_zero(y)
## [1] 0.16
ppc_stat(y, y_rep, stat = "prop_zero")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can plot the means and variances for all of the simulated datasets, along with the observed statistics.
ppc_stat_2d(y, y_rep, stat = c("mean", "var"))
ppc_error_hist(y, y_rep[1:4, ], binwidth = 1) + xlim(-15, 15)
## Warning: Removed 8 rows containing missing values (geom_bar).
How should we account for the high frequency of zeros? As it turns out, there is a lot of literature on how to work with this sort of data. We can fit a zero-inflated Poisson model, where the data is modeled as a mixture of a Poisson distribution with a point mass at zero. Here, we fit a hurdle model: \(y_i = 0\) with probability \(\theta\), and \(y_i > 0\) with probability \(1-\theta\). Conditional on having observed \(y_i\) nonzero, we model \(y_i\) with a Poisson distribution which is truncated from below with a lower bound of 1. For the truncated Poisson, we use the same \(Gamma(10,2)\) prior as in the simple Poisson model above.
poisson-hurdle.stan
. Store the resulting object in a variable called fit2
.Here we compare the posterior densities of \(\lambda\) from the simple Poisson model and the Poisson hurdle model.
# Extract the sampled vlaues for lambda, and store them in a variable called lambda_draws2:
<- as.matrix(fit2, pars = "lambda")
lambda_draws2
# Compare
<- cbind(lambda_fit1 = lambda_draws[, 1],
lambdas lambda_fit2 = lambda_draws2[, 1])
# Shade 90% interval
mcmc_areas(lambdas, prob = 0.9)
Obtaining posterior samples from the hurdle model is more complicated, so the Stan file includes code to draw from the posterior predictive distribution for you. We extract them here, and store the predictive samples in y_rep2
.
<- as.matrix(fit2, pars = "y_rep") y_rep2
Here, we assess the predictive performance of both models with leave-one-out cross-validation (LOOCV). LOOCV holds out a single observation form the sample, fits the model on the remaining observations, uses the held out data point as validation, and calculates the prediction error. This process is repeated \(n\) times, such that each observation has been held out exactly once. We make use of the hand loo() function, and compare the two models:
<- extract_log_lik(fit, merge_chains = FALSE)
log_lik1 <- relative_eff(exp(log_lik1))
r_eff1 <- loo(log_lik1, r_eff = r_eff1)) (loo1
##
## Computed from 2000 by 73 log-likelihood matrix
##
## Estimate SE
## elpd_loo -203.1 14.5
## p_loo 2.4 0.5
## looic 406.2 29.0
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
<- extract_log_lik(fit2, merge_chains = FALSE)
log_lik2 <- relative_eff(exp(log_lik2))
r_eff2 <- loo(log_lik2, r_eff = r_eff2)) (loo2
##
## Computed from 2000 by 73 log-likelihood matrix
##
## Estimate SE
## elpd_loo -183.2 10.8
## p_loo 2.7 0.4
## looic 366.5 21.7
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
compare(loo1, loo2)
## Warning: 'compare' is deprecated.
## Use 'loo_compare' instead.
## See help("Deprecated")
## elpd_diff se
## 19.9 8.5
Some final questions to consider:
10 points: 1 point for each exercise.
This lab was adapted from this tutorial by Jordan Bryan and Becky Tang.