class: center, middle, inverse, title-slide # STA 360/602L: Module 6.3 ## Bayesian linear regression: weakly informative priors ### Dr. Olanrewaju Michael Akande --- ## Bayesian linear regression recap - Sampling model: .block[ `$$\boldsymbol{Y} \sim \mathcal{N}_n(\boldsymbol{X}\boldsymbol{\beta}, \sigma^2\boldsymbol{I}_{n\times n}).$$` ] -- - Semi-conjugate prior for `\(\boldsymbol{\beta}\)`: .block[ `$$\pi(\boldsymbol{\beta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Sigma_0).$$` ] -- - Semi-conjugate prior for `\(\sigma^2\)`: .block[ `$$\pi(\sigma^2) = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)$$` ] --- ## Full conditional .block[ $$ `\begin{split} \pi(\boldsymbol{\beta} | \boldsymbol{y}, \boldsymbol{X}, \sigma^2) & = \ \mathcal{N}_p(\boldsymbol{\mu}_n, \Sigma_n),\\ \end{split}` $$ ] -- where .block[ $$ `\begin{split} \Sigma_n & = \left[\Sigma_0^{-1} + \frac{1}{\sigma^2} \boldsymbol{X}^T \boldsymbol{X} \right]^{-1}\\ \boldsymbol{\mu}_n & = \Sigma_n \left[\Sigma_0^{-1}\boldsymbol{\mu}_0 + \frac{1}{\sigma^2} \boldsymbol{X}^T \boldsymbol{y} \right], \end{split}` $$ ] -- and .block[ $$ `\begin{split} \pi(\sigma^2 | \boldsymbol{y}, \boldsymbol{X}, \boldsymbol{\beta}) & = \mathcal{IG} \left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right), \\ \end{split}` $$ ] -- where .block[ $$ `\begin{split} \nu_n & = \nu_0 + n\\ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[ \nu_0 \sigma_0^2 + (\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta})^T (\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}) \right] = \dfrac{1}{\nu_n} \left[ \nu_0 \sigma_0^2 + \text{SSR}(\boldsymbol{\beta}) \right]. \end{split}` $$ ] --- ## Weakly informative priors - Specifying hyperparameters that represent actual prior information can be challenging, especially for `\(\boldsymbol{\beta}\)`. -- - It can therefore be desirable use weakly informative priors when possible. The Hoff book discusses a few different options, one of which is the Zellner's g-prior (there are other options but we will not cover them in this course). -- - Note that we can also use Jefferys prior here to be completely non-informative. -- - Zellner's g-prior is .block[ .small[ $$ `\begin{split} \pi(\boldsymbol{\beta} | \sigma^2) & = \mathcal{N}_p\left(\boldsymbol{\mu}_0= \boldsymbol{0}, \Sigma_0 = g\sigma^2 \left[\boldsymbol{X}^T \boldsymbol{X} \right]^{-1} \right)\\ \pi(\sigma^2) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\ \end{split}` $$ ] ] for some positive value `\(g\)`, which is often commonly set to the sample size `\(n\)`. --- ## Weakly informative priors - Note that the g-prior uses a part of the data. As I have mentioned before, using your data to construct your prior is usually a no-no. -- - However, the g-prior actually does not use the information in `\(\boldsymbol{y}\)`, the response variable of interest, just the information in `\(\boldsymbol{X}\)`. -- - Observe that the prior specification actually looks like the conjugate prior we first used for the univariate normal model, that is, with .block[ .small[ $$ `\begin{split} \sigma^2 \ & \sim \mathcal{IG}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\ \mu|\sigma^2 & \sim \mathcal{N}\left(\mu_0, \dfrac{\sigma^2}{\kappa_0}\right).\\ \end{split}` $$ ] ] -- - Turns out that we also have conjugacy with the g-prior, so that we don't actually need Gibbs sampling to obtain posterior samples. `\(\pi(\boldsymbol{\beta} | \boldsymbol{y}, \boldsymbol{X}, \sigma^2)\)` takes the same form as before but now we also have `\(\pi(\sigma^2 | \boldsymbol{y}, \boldsymbol{X})\)`. --- ## Weakly informative priors - With the g-prior, we have .block[ .small[ $$ `\begin{split} \pi(\boldsymbol{\beta} | \boldsymbol{y}, \boldsymbol{X}, \sigma^2) & = \mathcal{N}_p(\boldsymbol{\mu}_n, \Sigma_n)\\ \pi(\sigma^2 | \boldsymbol{y}, \boldsymbol{X}) & = \mathcal{IG} \left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right)\\ \end{split}` $$ ] ] where .block[ .small[ $$ `\begin{split} \Sigma_n & = \left[\Sigma_0^{-1} + \frac{1}{\sigma^2} \boldsymbol{X}^T \boldsymbol{X} \right]^{-1} = \left[\dfrac{1}{g\sigma^2} \boldsymbol{X}^T \boldsymbol{X} + \frac{1}{\sigma^2} \boldsymbol{X}^T \boldsymbol{X} \right]^{-1} = \dfrac{g}{g+1} \sigma^2 \left[\boldsymbol{X}^T \boldsymbol{X} \right]^{-1}\\ \\ \boldsymbol{\mu}_n & = \Sigma_n \left[\Sigma_0^{-1}\boldsymbol{\mu}_0 + \frac{1}{\sigma^2} \boldsymbol{X}^T \boldsymbol{y} \right] = \dfrac{g}{g+1} \sigma^2 \left[\boldsymbol{X}^T \boldsymbol{X} \right]^{-1} \left[\frac{1}{\sigma^2} \boldsymbol{X}^T \boldsymbol{y} \right]\\ & = \dfrac{g}{g+1} \left[\boldsymbol{X}^T \boldsymbol{X} \right]^{-1} \boldsymbol{X}^T \boldsymbol{y} = \dfrac{g}{g+1} \hat{\boldsymbol{\beta}}_{\text{ols}}\\ \\ \nu_n & = \nu_0 + n; \ \ \ \ \ \ \ \sigma_n^2 = \dfrac{1}{\nu_n} \left[ \nu_0 \sigma_0^2 + \text{SSR}(g) \right], \end{split}` $$ ] ] where `\(\text{SSR}(g) = \boldsymbol{y}^T (\boldsymbol{I} - \frac{g}{g+1} \boldsymbol{X} \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T) \boldsymbol{y}\)`. See the Hoff book for the proof, and see homework for illustration. --- ## Example - Health plans use many tools to try to control the cost of prescription medicines. -- - For older drugs, generic substitutes that are the equivalent to name-brand drugs are available at considerable savings. -- - Another tool that may lower costs is restricting drugs that the physician may prescribe. -- - For example if three similar drugs for treating the same condition are available, a health plan may require the physician to prescribe only one of them, allowing the plan to negotiate discounts based on a higher volume of sales. -- - We have data from 29 health plans can be used to explore the effectiveness of these two strategies in controlling drug costs. -- - The response is COST, the average cost of the prescriptions to the plan per day (in dollars). --- ## Example - Explanatory variables are: + RXPM: Average number of prescriptions per member per year + GS: Percent generic substitute used by the plan + RI: Restrictiveness Index, from 0 (no restrictions) to 100 (total restrictions on the physician) + COPAY: Average member copay on prescriptions + AGE: Average member age + F: percent female members + MM: Member months, a measure of the size of the plan + ID: an identifier for the name of the plan -- - The data is in the file `costs.txt` on Sakai. -- - For this illustration, we will restrict ourselves to GS and AGE. We will use the other variables later. --- ## Data ```r #require(lattice) #library(pls) #library(calibrate) #library(mvtnorm) ###### Data Data <- read.table("data/costs.txt",header=TRUE)[,-9] head(Data) ``` ``` ## COST RXPM GS RI COPAY AGE F MM ## 1 1.34 4.2 36 45.6 10.87 29.7 52.3 1158096 ## 2 1.34 5.4 37 45.6 8.66 29.7 52.3 1049892 ## 3 1.38 7.0 37 45.6 8.12 29.7 52.3 96168 ## 4 1.22 7.1 40 23.6 5.89 28.7 53.4 407268 ## 5 1.08 3.5 40 23.6 6.05 28.7 53.4 13224 ## 6 1.16 7.2 46 22.3 5.05 29.1 52.2 303312 ``` --- ## Very basic EDA <img src="6-3-linear-regression-III_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Very basic EDA ```r levelplot(cor(Data[,c("COST","GS","AGE")])) #Check correlation ``` <img src="6-3-linear-regression-III_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Very basic EDA Without outlier: <img src="6-3-linear-regression-III_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Very basic EDA Without outlier: ```r levelplot(cor(Data[-19,c("COST","GS","AGE")])) #Check correlation ``` <img src="6-3-linear-regression-III_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Posterior computation ```r ###### g-Prior: with g=n using full model # Data summaries X <- cbind(1,as.matrix(Data[-19,c("GS","AGE")])) #remove potential outlier Y <- matrix(Data$COST[-19],ncol=1) n <- length(Y) p <- ncol(X) g <- n # OLS estimates beta_ols <- solve(t(X)%*%X)%*%t(X)%*%Y round(t(beta_ols),4) ``` ``` ## GS AGE ## [1,] 2.7047 -0.02 -0.0231 ``` ```r SSR_beta_ols <- (t(Y - (X%*%beta_ols)))%*%(Y - (X%*%beta_ols)) sigma_ols <- SSR_beta_ols/(n-p) sigma_ols ``` ``` ## [,1] ## [1,] 0.005247074 ``` ```r # Hyperparameters for the priors #sigma_0_sq <- sigma_ols sigma_0_sq <- 1/100 nu_0 <- 1 # Set number of iterations S <- 10000 ``` --- ## Posterior computation ```r set.seed(1234) # Sample sigma_sq nu_n <- nu_0 + n Hg <- (g/(g+1))* X%*%solve(t(X)%*%X)%*%t(X) SSRg <- t(Y)%*%(diag(1,nrow=n) - Hg)%*%Y nu_n_sigma_n_sq <- nu_0*sigma_0_sq + SSRg sigma_sq <- 1/rgamma(S,(nu_n/2),(nu_n_sigma_n_sq/2)) # Sample beta mu_n <- g*beta_ols/(g+1) beta <- matrix(nrow=S,ncol=p) for(s in 1:S){ Sigma_n <- g*sigma_sq[s]*solve(t(X)%*%X)/(g+1) beta[s,] <- rmvnorm(1,mu_n,Sigma_n) } #posterior summaries colnames(beta) <- colnames(X) mean_beta <- apply(beta,2,mean) round(mean_beta,4) ``` ``` ## GS AGE ## 2.6057 -0.0193 -0.0221 ``` ```r round(apply(beta,2,function(x) quantile(x,c(0.025,0.975))),4) ``` ``` ## GS AGE ## 2.5% 0.4392 -0.0432 -0.0935 ## 97.5% 4.7903 0.0044 0.0460 ``` --- class: center, middle # What's next? ### Move on to the readings for the next module!