Logistic regression

Logistic regression models a binary outcome as a Bernoulli random variable \(Y\) with success probability given by a tranformation of a linear combination of predictors \(X\). Each observation is modeled as

\[ y_i \sim \text{Bernoulli} \left( \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}}}{1 + e^{\boldsymbol{x}_i^T \boldsymbol{\beta}}} \right) \]

so that the log-odds of success are given by

\[ \log \left( \frac{\text{Pr}(y_i = 1)}{1 - \text{Pr}(y_i = 1)}\right) = \boldsymbol{x}_i^T\boldsymbol{\beta} \]

Alternative ways of writing the model are: \[ \begin{split} \Pr[y_i = 1 | \boldsymbol{x}_i] = \pi_i \ \ \textrm{and} \ \ \Pr[y_i = 0 | \boldsymbol{x}_i] = 1-\pi_i; \ \ \ & \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \boldsymbol{x}_i^T\boldsymbol{\beta}, \\ \textrm{OR } \ \ \ y_i | \boldsymbol{x}_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ & \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \boldsymbol{x}_i^T\boldsymbol{\beta}. \end{split} \]

Though the logistic regression model is easy to write down, the task of doing Bayesian inference on the regression coefficients \(\boldsymbol{\beta}\) is comparatively difficult. The reason is that there is no closed-form for the full-conditional density function \(p(\boldsymbol{\beta} | Y)\).

Suppose you specify a multivariate normal prior for \(\boldsymbol{\beta}\). Try to find the full conditional density function \(p(\boldsymbol{\beta} | Y)\). Can you sample from it directly? You don’t need to simplify completely.

Multiclass logistic regression

The logistic regression model can be extended to classification of \(K\) classes. Suppose that \(Y\) is a random variable that can take on \(K\) possible categorical values. Then the sampling model for one observation \(y_i\) may be written as

\[ y_i \sim \text{Multinomial} \left( \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_1 }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}}, \dots, \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_K }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}} \right) \]

Here, we have \(K\) coefficient vectors \(\boldsymbol{\beta}_1, \dots, \boldsymbol{\beta}_K\), each corresponding to the linear combination of predictor variables that best fits the \(k^\text{th}\) category. Since the probabilities \(e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_k } / \sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}\) must sum to \(1\), we only need to fit \(K-1\) coefficient vectors. In practice we set one of the coefficent vectors, say \(\boldsymbol{\beta}_K\) to zero, and define the other coefficient vectors as contrasts to this baseline.

What kinds of problems require a multiclass model? Consider the following dataset from base R, called iris. From the description in the R help pages:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Our goal will be to model the flower species as a function of the predictors sepal length, sepal width, petal length, and petal width. To illustrate the effect of different shrinkage priors, we will also add \(25\) predictors that are independent of the flower species.

n_noise_features <- 25
noise_features <- rnorm(n_noise_features*nrow(iris)) %>% 
                  matrix(nrow = nrow(iris)) %>% 
                  magrittr::set_colnames(paste0("V", 1:n_noise_features)) %>%
                  data.frame()
modified_iris <- data.frame(noise_features, iris)
X <- modified_iris[, 1:(ncol(modified_iris)-1)] %>% as.matrix()
y <- as.integer(as.factor(modified_iris[, ncol(modified_iris)]))
#head(X)
#head(y)

Inference with Metropolis Hastings

If we place independent normal priors on our coefficient vectors, the full model is

\[ \begin{aligned} y_i &\sim \text{Multinomial} \left( \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_1 }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}}, \dots, \frac{e^{ \boldsymbol{x}_i^T \boldsymbol{\beta}_K }}{\sum_{k=1}^K e^{\boldsymbol{x}_i^T \boldsymbol{\beta}_k}} \right) \\ \boldsymbol{\beta}_k &\overset{iid}\sim N(0, 0.25 I) \end{aligned} \]

What is the sampling density function \(p(y_i | \boldsymbol{\beta}_1, \dots, \boldsymbol{\beta}_K)\) for this model? Write it down.
What is the likelihood function for \(y_1, \ldots, y_n\)? Write it down.

To run Metropolis-Hastings, we need to choose a proposal density from which to draw candidate posterior samples. Here, we will use the following density:

\[ \boldsymbol{\beta}_k^* | \boldsymbol{\beta}_k^{(s)} \sim N(\boldsymbol{\beta}_k^{(s)}, \delta^2 I), \] with \(\delta = 0.05\).

Below, we define functions that return the three ingredients of an M-H step: (1) the proposal density, (2) the prior density, and (3) the likelihood density. In order to avoid numerical issues, we will first work in terms of the logarithm of the probability densities and exponentiate them later.

proposal_sd <- 0.05
#
proposal_lpdf <- function(beta, beta0){
  K <- length(beta)
  lpdf <- 0
  for(k in 2:K){
    lpdf <- lpdf + sum(dnorm(beta[[k]], mean = beta0[[k]], sd = proposal_sd, log = T))
  }
  return(lpdf)
}
#
prior_lpdf <- function(beta){
  K <- length(beta)
  lpdf <- 0
  for(k in 2:K){
    lpdf <- lpdf + sum(dnorm(beta[[k]], sd = 0.5, log = T))
  }
  return(lpdf)
}
#
likelihood_lpdf <- function(y, X, beta){
  K <- length(beta)
  lpdf <- 0
  for(k in 1:K){
    lpdf <- lpdf + sum(sapply(which(y == k), function(i){
      sum(X[i, ]*beta[[k]]) - log(sum(sapply(1:K, function(b){
          exp(sum(X[i, ]*beta[[b]]))
        })))
      })
    )
  }
  return(lpdf)
}

The code below produces a Markov Chain that produces draws from \(p(\boldsymbol{\beta}_k | Y)\). It allows \(5000\) iterations of the chain to pass before storing the last \(5000\) draws.

S <- 10000
burn <- 5000
beta0 <- list(a = 0, b = rnorm(ncol(X)), c = rnorm(ncol(X)))
beta_list <- list(a = rep(0, S - burn), 
                  b = matrix(NA, nrow = S - burn, ncol = ncol(X)),
                  c = matrix(NA, nrow = S - burn, ncol = ncol(X)))
for(s in 1:S){
  #
  beta_star <- list(a = 0,
                    b = rnorm(length(beta0[[2]]), mean = beta0[[2]], sd = proposal_sd),
                    c = rnorm(length(beta0[[3]]), mean = beta0[[3]], sd = proposal_sd))
  #
  r <- exp((likelihood_lpdf(y, X, beta_star) + prior_lpdf(beta_star) + proposal_lpdf(beta0, beta_star)) -
           (likelihood_lpdf(y, X, beta0) + prior_lpdf(beta0) + proposal_lpdf(beta_star, beta0)))
  #
  if(runif(1) < r){
    if(s > burn){
      beta_list[["a"]][s - burn] <- beta_star[["a"]]
      beta_list[["b"]][s - burn, ] <- beta_star[["b"]]
      beta_list[["c"]][s - burn, ] <- beta_star[["c"]]
    }
    beta0 <- beta_star
  } else {
    if(s > burn){
      beta_list[["a"]][s - burn] <- beta0[["a"]]
      beta_list[["b"]][s - burn, ] <- beta0[["b"]]
      beta_list[["c"]][s - burn, ] <- beta0[["c"]]
    }
  }
}

Let’s look at some traceplots of our posterior samples to see how well the Markov Chain is mixing:

par(mfrow = c(3, 2))
for(j in sample(ncol(beta_list$b), 6)){
  plot(beta_list$b[, j], type = 'l', ylab = paste("Beta", j), xlab = "Iter", col = "purple",
       main = paste("Acceptance ratio =", round(length(unique(beta_list$b[,j])) / nrow(beta_list$b), 3)))
  acf(beta_list$b[, j], ylab = paste("Beta", j), main = paste("Series Beta", j))
}

Based on the traceplots, do you think the chain has converged?
Our hope was to have independent samples from the posterior. Is that the case here?

Play around with a few values of \(\delta\). Pick a \(\delta\) value that gets the acceptance ratios somewhere between 0.40 and 0.45. Make new traceplots and autocorrelation plots for your new \(\delta\) value.

Based on the new traceplots, do you think the chain has converged?
What can you say about independence?

Change the \(\delta\) value back to 0.05 and rerun the sampler. We can also try (1) changing the proposal density entirely, (2) running the chain for more iterations, and/or (3) subsampling or thinning our posterior samples. Let’s thin our chain by a factor of \(10\) and see whether the resulting draws show more independence.

thin_seq <- seq(1, nrow(beta_list$b), by = 10)
par(mfrow = c(3, 2))
for(j in sample(ncol(beta_list$b), 6)){
  plot(beta_list$b[thin_seq, j], type = 'l', ylab = paste("Beta", j), xlab = "Iter", col = "purple",
       main = paste("Acceptance ratio =", round(length(unique(beta_list$b[thin_seq,j])) / nrow(beta_list$b[thin_seq, ]), 3)))
  acf(beta_list$b[thin_seq, j], ylab = paste("Beta", j), main = paste("Series Beta", j))
}

Based on the new traceplots, do you think the chain has converged?
What can you say about independence here?

For now, we will work with the thinned samples from our original Metropolis-Hastings algorithm. Below we plot the thinned samples from the posterior distribution of each coefficient. Note that coefficients 26, 27, 28, and 29 correspond to the predictors in the iris dataset, and the other \(25\) coefficients correspond to the independently simulated predictors. How does our inference look?

((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2) %>%
  reshape2::melt() %>%
  dplyr::mutate(cat = "1") %>%
  rbind(reshape2::melt(beta_list$b[thin_seq, ] + ((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2)) %>% dplyr::mutate(cat = "2")) %>%
  rbind(reshape2::melt(beta_list$c[thin_seq, ] + ((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2)) %>% dplyr::mutate(cat = "3")) %>%
  ggplot2::ggplot() +
  geom_boxplot(aes(x = as.factor(Var2), y = value, fill = cat))  +
  scale_fill_brewer(palette = "Set1", name = "Species", labels = c("Setosa", "Versicolor", "Virginica")) +
  labs(x = "Coefficient", y = expression(beta))

How well does our model fit the data? To see whether the logistic regression is able to separate the three flower species, we can evaluate our model’s predictions at the posterior mean of our coefficients:

denom <- (1 + exp(X%*%t(beta_list$b[thin_seq, ])) + exp(X%*%t(beta_list$c[thin_seq, ])))
mod_preds <- data.frame(setosa = rowMeans(1 / denom),
                        versicolor = rowMeans(exp(X%*%t(beta_list$b[thin_seq, ])) / denom),
                        virginica = rowMeans(exp(X%*%t(beta_list$c[thin_seq, ])) / denom))

predicted_species <- apply(mod_preds,1,function(x) unique(modified_iris$Species)[which(x==max(x))])
Conf_mat <- confusionMatrix(predicted_species,modified_iris$Species)
Conf_mat$table

##             Reference
## Prediction   setosa versicolor virginica
##   setosa         48          0         0
##   versicolor      2         31        15
##   virginica       0         19        35

Conf_mat$overall["Accuracy"];

## Accuracy 
##     0.76

Conf_mat$byClass[,c("Sensitivity","Specificity")]

##                   Sensitivity Specificity
## Class: setosa            0.96        1.00
## Class: versicolor        0.62        0.83
## Class: virginica         0.70        0.81

Laplace prior

Above, we used a normal prior. In general, we saw that the coefficients corresponding to the simulated predictors had values that were small, as desired. What if we used a prior that placed higher density at zero? Would we see different shrinkage effects?

The Laplace (or double-exponential) density has probability density function

\[ p(x | \mu, \sigma) = \frac{1}{2\sigma}e^{-\frac{|x - \mu|}{\sigma}} \]

with shapes for varying values of \(\sigma\) given below

data.frame(sigma = rep(c(1, 2, 4), each = 1000)) %>%
  dplyr::group_by(sigma) %>%
  dplyr::mutate(x = seq(-10, 10, length.out = 1000),
                dens = (1/(2*sigma))*exp(-abs(seq(-10, 10, length.out = 1000))/sigma)) %>%
  dplyr::ungroup() %>%
  ggplot2::ggplot() +
  geom_line(aes(x = x, y = dens, colour = as.factor(sigma), group = as.factor(sigma))) +
  scale_colour_manual(values = c("#74a9cf", "#0570b0", "#023858"), name = expression(sigma))

Consider a modification to our original model in which we place a Laplace prior on our coefficient vectors:

Note that our choice of scale parameter gives the Laplace prior the same variance as the Gaussian prior above. Changing the prior on our coefficients requires only one modification to the M-H code above. Specifically, it requires a change to the function prior_lpdf.

Using the code chunk below, modify the function prior_lpdf to return the log density of the Laplace prior given above. Follow the syntax of the Gaussian prior_lpdf function.

#
prior_lpdf <- function(beta){
  K <- length(beta)
  lpdf <- 0
  b <- 0.5 / sqrt(2)
  for(k in 2:K){
    lpdf <- lpdf + sum(-abs(beta[[k]])/b - log(2*b))
  }
  return(lpdf)
}

We will use the same code as before to run Metropolis-Hastings and the same code as before to plot samples from the posterior distribution of our coefficient vectors:

((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2) %>%
  reshape2::melt() %>%
  dplyr::mutate(cat = "1") %>%
  rbind(reshape2::melt(beta_list$b[thin_seq, ] + ((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2)) %>% dplyr::mutate(cat = "2")) %>%
  rbind(reshape2::melt(beta_list$c[thin_seq, ] + ((-beta_list$b[thin_seq, ] - beta_list$c[thin_seq, ])/2)) %>% dplyr::mutate(cat = "3")) %>%
  ggplot2::ggplot() +
  geom_boxplot(aes(x = as.factor(Var2), y = value, fill = cat))  +
  scale_fill_brewer(palette = "Set1", name = "Species", labels = c("Setosa", "Versicolor", "Virginica")) +
  labs(x = "Coefficient", y = expression(beta))

Compute the accuracy for the new model. How does our inference under the Laplace prior compare to that under the Gaussian prior?

Lab 10: Metropolis-Hastings

STA 602L: Bayesian and Modern Statistics

Housekeeping

R/RStudio

R Markdown

Gradescope

Getting started

Metropolis Hastings (Recap)

Logistic regression

Multiclass logistic regression

Inference with Metropolis Hastings

Laplace prior

Grading

Acknowledgement