Lab 10: Metropolis-Hastings

STA 602L: Bayesian and Modern Statistics

Due: 11:59pm, Sunday, June 20

Housekeeping

R/RStudio

You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.

R Markdown

You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very basic R Markdown template: https://sta-360-602l-su21.github.io/Course-Website/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://sta-360-602l-su21.github.io/Course-Website/resources/ ) for links to help you learn how to use R markdown.

Gradescope

You MUST submit both your .Rmd and .pdf files to the course site on Gradescope here: https://www.gradescope.com/courses/190490/assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.

Getting started

You will need the following R packages. If you do not already have them installed, please do so first using the install.packages function.

require(tidyverse)
require(rstanarm)
require(magrittr)
require(rstan)
require(bayesplot)
require(loo)
require(readxl)
require(plyr)
require(ggrepel)
library(cowplot)
require(ggtern)
library(caret)
library(e1071)

Metropolis Hastings (Recap)

Here is the general idea for Metropolis Hastings: suppose we have a target density function \(p_0(x)\) and a proposal distribution \(g_x(x^* | x^{(s)})\), both defined on a random variable \(X\). To produce draws from \(p_0(x)\), the update steps in a Metropolis-Hastings algorithm can be written as:

The Metropolis-Hastings algorithm gives us a way to sample from the posterior distribution of random variables. The power of Metropolis-Hastings for doing posterior inference is that we only need to be able to evaluate the likelihood density function and the prior density functions. We are only truly generating candidate samples from the proposal density \(g_x(x^* | x^{(s)})\), which often has a simple form.

Logistic regression

Logistic regression models a binary outcome as a Bernoulli random variable \(Y\) with success probability given by a tranformation of a linear combination of predictors \(X\). Each observation is modeled as

\[ y_i \sim \text{Bernoulli} \left( \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}}}{1 + e^{\boldsymbol{x}_i^T \boldsymbol{\beta}}} \right) \]

so that the log-odds of success are given by

\[ \log \left( \frac{\text{Pr}(y_i = 1)}{1 - \text{Pr}(y_i = 1)}\right) = \boldsymbol{x}_i^T\boldsymbol{\beta} \]

Alternative ways of writing the model are: \[ \begin{split} \Pr[y_i = 1 | \boldsymbol{x}_i] = \pi_i \ \ \textrm{and} \ \ \Pr[y_i = 0 | \boldsymbol{x}_i] = 1-\pi_i; \ \ \ & \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \boldsymbol{x}_i^T\boldsymbol{\beta}, \\ \textrm{OR } \ \ \ y_i | \boldsymbol{x}_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ & \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \boldsymbol{x}_i^T\boldsymbol{\beta}. \end{split} \]

Though the logistic regression model is easy to write down, the task of doing Bayesian inference on the regression coefficients \(\boldsymbol{\beta}\) is comparatively difficult. The reason is that there is no closed-form for the full-conditional density function \(p(\boldsymbol{\beta} | Y)\).


  1. Suppose you specify a multivariate normal prior for \(\boldsymbol{\beta}\). Try to find the full conditional density function \(p(\boldsymbol{\beta} | Y)\). Can you sample from it directly? You don’t need to simplify completely.

Multiclass logistic regression

The logistic regression model can be extended to classification of \(K\) classes. Suppose that \(Y\) is a random variable that can take on \(K\) possible categorical values. Then the sampling model for one observation \(y_i\) may be written as

\[ y_i \sim \text{Multinomial} \left( \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_1 }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}}, \dots, \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_K }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}} \right) \]

Here, we have \(K\) coefficient vectors \(\boldsymbol{\beta}_1, \dots, \boldsymbol{\beta}_K\), each corresponding to the linear combination of predictor variables that best fits the \(k^\text{th}\) category. Since the probabilities \(e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_k } / \sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}\) must sum to \(1\), we only need to fit \(K-1\) coefficient vectors. In practice we set one of the coefficent vectors, say \(\boldsymbol{\beta}_K\) to zero, and define the other coefficient vectors as contrasts to this baseline.

What kinds of problems require a multiclass model? Consider the following dataset from base R, called iris. From the description in the R help pages:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Our goal will be to model the flower species as a function of the predictors sepal length, sepal width, petal length, and petal width. To illustrate the effect of different shrinkage priors, we will also add \(25\) predictors that are independent of the flower species.

n_noise_features <- 25
noise_features <- rnorm(n_noise_features*nrow(iris)) %>% 
                  matrix(nrow = nrow(iris)) %>% 
                  magrittr::set_colnames(paste0("V", 1:n_noise_features)) %>%
                  data.frame()
modified_iris <- data.frame(noise_features, iris)
X <- modified_iris[, 1:(ncol(modified_iris)-1)] %>% as.matrix()
y <- as.integer(as.factor(modified_iris[, ncol(modified_iris)]))
#head(X)
#head(y)

Inference with Metropolis Hastings

If we place independent normal priors on our coefficient vectors, the full model is

\[ \begin{aligned} y_i &\sim \text{Multinomial} \left( \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_1 }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}}, \dots, \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_K }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}} \right) \\ \boldsymbol{\beta}_k &\overset{iid}\sim N(0, 0.25 I) \end{aligned} \]


  1. What is the sampling density function \(p(y_i | \boldsymbol{\beta}_1, \dots, \boldsymbol{\beta}_K)\) for this model? Write it down.
  2. What is the likelihood function for \(y_1, \ldots, y_n\)? Write it down.

To run Metropolis-Hastings, we need to choose a proposal density from which to draw candidate posterior samples. Here, we will use the following density:

\[ \boldsymbol{\beta}_k^* | \boldsymbol{\beta}_k^{(s)} \sim N(\boldsymbol{\beta}_k^{(s)}, \delta^2 I), \] with \(\delta = 0.05\).

Below, we define functions that return the three ingredients of an M-H step: (1) the proposal density, (2) the prior density, and (3) the likelihood density. In order to avoid numerical issues, we will first work in terms of the logarithm of the probability densities and exponentiate them later.

proposal_sd <- 0.05
#
proposal_lpdf <- function(beta, beta0){
  K <- length(beta)
  lpdf <- 0
  for(k in 2:K){
    lpdf <- lpdf + sum(dnorm(beta[[k]], mean = beta0[[k]], sd = proposal_sd, log = T))
  }
  return(lpdf)
}
#
prior_lpdf <- function(beta){
  K <- length(beta)
  lpdf <- 0
  for(k in 2:K){
    lpdf <- lpdf + sum(dnorm(beta[[k]], sd = 0.5, log = T))
  }
  return(lpdf)
}
#
likelihood_lpdf <- function(y, X, beta){
  K <- length(beta)
  lpdf <- 0
  for(k in 1:K){
    lpdf <- lpdf + sum(sapply(which(y == k), function(i){
      sum(X[i, ]*beta[[k]]) - log(sum(sapply(1:K, function(b){
          exp(sum(X[i, ]*beta[[b]]))
        })))
      })
    )
  }
  return(lpdf)
}

The code below produces a Markov Chain that produces draws from \(p(\boldsymbol{\beta}_k | Y)\). It allows \(5000\) iterations of the chain to pass before storing the last \(5000\) draws.

S <- 10000
burn <- 5000
beta0 <- list(a = 0, b = rnorm(ncol(X)), c = rnorm(ncol(X)))
beta_list <- list(a = rep(0, S - burn), 
                  b = matrix(NA, nrow = S - burn, ncol = ncol(X)),
                  c = matrix(NA, nrow = S - burn, ncol = ncol(X)))
for(s in 1:S){
  #
  beta_star <- list(a = 0,
                    b = rnorm(length(beta0[[2]]), mean = beta0[[2]], sd = proposal_sd),
                    c = rnorm(length(beta0[[3]]), mean = beta0[[3]], sd = proposal_sd))
  #
  r <- exp((likelihood_lpdf(y, X, beta_star) + prior_lpdf(beta_star) + proposal_lpdf(beta0, beta_star)) -
           (likelihood_lpdf(y, X, beta0) + prior_lpdf(beta0) + proposal_lpdf(beta_star, beta0)))
  #
  if(runif(1) < r){
    if(s > burn){
      beta_list[["a"]][s - burn] <- beta_star[["a"]]
      beta_list[["b"]][s - burn, ] <- beta_star[["b"]]
      beta_list[["c"]][s - burn, ] <- beta_star[["c"]]
    }
    beta0 <- beta_star
  } else {
    if(s > burn){
      beta_list[["a"]][s - burn] <- beta0[["a"]]
      beta_list[["b"]][s - burn, ] <- beta0[["b"]]
      beta_list[["c"]][s - burn, ] <- beta0[["c"]]
    }
  }
}

Let’s look at some traceplots of our posterior samples to see how well the Markov Chain is mixing:

par(mfrow = c(3, 2))
for(j in sample(ncol(beta_list$b), 6)){
  plot(beta_list$b[, j], type = 'l', ylab = paste("Beta", j), xlab = "Iter", col = "purple",
       main = paste("Acceptance ratio =", round(length(unique(beta_list$b[,j])) / nrow(beta_list$b), 3)))
  acf(beta_list$b[, j], ylab = paste("Beta", j), main = paste("Series Beta", j))
}


  1. Based on the traceplots, do you think the chain has converged?
  2. Our hope was to have independent samples from the posterior. Is that the case here?

Play around with a few values of \(\delta\). Pick a \(\delta\) value that gets the acceptance ratios somewhere between 0.40 and 0.45. Make new traceplots and autocorrelation plots for your new \(\delta\) value.


  1. Based on the new traceplots, do you think the chain has converged?
  2. What can you say about independence?

Change the \(\delta\) value back to 0.05 and rerun the sampler. We can also try (1) changing the proposal density entirely, (2) running the chain for more iterations, and/or (3) subsampling or thinning our posterior samples. Let’s thin our chain by a factor of \(10\) and see whether the resulting draws show more independence.

thin_seq <- seq(1, nrow(beta_list$b), by = 10)
par(mfrow = c(3, 2))
for(j in sample(ncol(beta_list$b), 6)){
  plot(beta_list$b[thin_seq, j], type = 'l', ylab = paste("Beta", j), xlab = "Iter", col = "purple",
       main = paste("Acceptance ratio =", round(length(unique(beta_list$b[thin_seq,j])) / nrow(beta_list$b[thin_seq, ]), 3)))
  acf(beta_list$b[thin_seq, j], ylab = paste("Beta", j), main = paste("Series Beta", j))
}


  1. Based on the new traceplots, do you think the chain has converged?
  2. What can you say about independence here?

For now, we will work with the thinned samples from our original Metropolis-Hastings algorithm. Below we plot the thinned samples from the posterior distribution of each coefficient. Note that coefficients 26, 27, 28, and 29 correspond to the predictors in the iris dataset, and the other \(25\) coefficients correspond to the independently simulated predictors. How does our inference look?

((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2) %>%
  reshape2::melt() %>%
  dplyr::mutate(cat = "1") %>%
  rbind(reshape2::melt(beta_list$b[thin_seq, ] + ((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2)) %>% dplyr::mutate(cat = "2")) %>%
  rbind(reshape2::melt(beta_list$c[thin_seq, ] + ((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2)) %>% dplyr::mutate(cat = "3")) %>%
  ggplot2::ggplot() +
  geom_boxplot(aes(x = as.factor(Var2), y = value, fill = cat))  +
  scale_fill_brewer(palette = "Set1", name = "Species", labels = c("Setosa", "Versicolor", "Virginica")) +
  labs(x = "Coefficient", y = expression(beta))

How well does our model fit the data? To see whether the logistic regression is able to separate the three flower species, we can evaluate our model’s predictions at the posterior mean of our coefficients:

denom <- (1 + exp(X%*%t(beta_list$b[thin_seq, ])) + exp(X%*%t(beta_list$c[thin_seq, ])))
mod_preds <- data.frame(setosa = rowMeans(1 / denom),
                        versicolor = rowMeans(exp(X%*%t(beta_list$b[thin_seq, ])) / denom),
                        virginica = rowMeans(exp(X%*%t(beta_list$c[thin_seq, ])) / denom))

predicted_species <- apply(mod_preds,1,function(x) unique(modified_iris$Species)[which(x==max(x))])
Conf_mat <- confusionMatrix(predicted_species,modified_iris$Species)
Conf_mat$table
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         48          0         0
##   versicolor      2         31        15
##   virginica       0         19        35
Conf_mat$overall["Accuracy"];
## Accuracy 
##     0.76
Conf_mat$byClass[,c("Sensitivity","Specificity")]
##                   Sensitivity Specificity
## Class: setosa            0.96        1.00
## Class: versicolor        0.62        0.83
## Class: virginica         0.70        0.81

Laplace prior

Above, we used a normal prior. In general, we saw that the coefficients corresponding to the simulated predictors had values that were small, as desired. What if we used a prior that placed higher density at zero? Would we see different shrinkage effects?

The Laplace (or double-exponential) density has probability density function

\[ p(x | \mu, \sigma) = \frac{1}{2\sigma}e^{-\frac{|x - \mu|}{\sigma}} \]

with shapes for varying values of \(\sigma\) given below

data.frame(sigma = rep(c(1, 2, 4), each = 1000)) %>%
  dplyr::group_by(sigma) %>%
  dplyr::mutate(x = seq(-10, 10, length.out = 1000),
                dens = (1/(2*sigma))*exp(-abs(seq(-10, 10, length.out = 1000))/sigma)) %>%
  dplyr::ungroup() %>%
  ggplot2::ggplot() +
  geom_line(aes(x = x, y = dens, colour = as.factor(sigma), group = as.factor(sigma))) +
  scale_colour_manual(values = c("#74a9cf", "#0570b0", "#023858"), name = expression(sigma))

Consider a modification to our original model in which we place a Laplace prior on our coefficient vectors:

\[ \begin{aligned} y_i &\sim \text{Multinomial} \left( \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_1 }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}}, \dots, \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_K }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}} \right) \\ \boldsymbol{\beta}_k &\overset{iid}\sim L(0, \sqrt{2}(0.5) I) \end{aligned} \]

Note that our choice of scale parameter gives the Laplace prior the same variance as the Gaussian prior above. Changing the prior on our coefficients requires only one modification to the M-H code above. Specifically, it requires a change to the function prior_lpdf.

Using the code chunk below, modify the function prior_lpdf to return the log density of the Laplace prior given above. Follow the syntax of the Gaussian prior_lpdf function.

#
prior_lpdf <- function(beta){
  K <- length(beta)
  lpdf <- 0
  b <- 0.5 / sqrt(2)
  for(k in 2:K){
    lpdf <- lpdf + sum(-abs(beta[[k]])/b - log(2*b))
  }
  return(lpdf)
}

We will use the same code as before to run Metropolis-Hastings and the same code as before to plot samples from the posterior distribution of our coefficient vectors:

((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2) %>%
  reshape2::melt() %>%
  dplyr::mutate(cat = "1") %>%
  rbind(reshape2::melt(beta_list$b[thin_seq, ] + ((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2)) %>% dplyr::mutate(cat = "2")) %>%
  rbind(reshape2::melt(beta_list$c[thin_seq, ] + ((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2)) %>% dplyr::mutate(cat = "3")) %>%
  ggplot2::ggplot() +
  geom_boxplot(aes(x = as.factor(Var2), y = value, fill = cat))  +
  scale_fill_brewer(palette = "Set1", name = "Species", labels = c("Setosa", "Versicolor", "Virginica")) +
  labs(x = "Coefficient", y = expression(beta))


  1. Compute the accuracy for the new model. How does our inference under the Laplace prior compare to that under the Gaussian prior?

Grading

10 points: 1 point for each question.

Acknowledgement

This lab was created by Jordan Bryan and Becky Tang.