class: center, middle, inverse, title-slide # STA 360/602L: Module 6.5 ## Bayesian model selection ### Dr. Olanrewaju Michael Akande --- ## Bayesian model selection - Now that we have a general sense of how Bayesian hypothesis works, let's get into model selection, and use some of the same ideas. -- - .hlight[General setting]: 1. Define a list of models. That is, let `\(\Gamma\)` be a "finite" set of different possible models. -- 2. Each model `\(\gamma\)` is in `\(\Gamma\)`, including the "true" model. Also, let `\(\theta_\gamma\)` represent the parameters in model `\(\gamma\)`. -- 3. Put a prior over the set `\(\Gamma\)`. Let `\(\Pi_\gamma = p[\gamma] = \Pr[\gamma \text{ is true}]\)`, for all `\(\gamma \in \Gamma\)`. Most common choice is the uniform prior, that is, `\(\Pi_\gamma = \frac{1}{\#\Gamma}\)`, for all `\(\gamma \in \Gamma\)`, where `\(\#\Gamma\)` is the total number of models in `\(\Gamma\)`. -- 4. Put a prior on the parameters in each model, that is, each `\(\pi(\theta_\gamma)\)`. -- 5. Compute marginal posterior probabilities `\(\Pr[\gamma | Y]\)` for each model, and select a model based on the posterior probabilities --- ## Bayesian model selection - For each model `\(\gamma \in \Gamma\)`, we need to compute `\(\Pr[\gamma | Y]\)`. -- - Let `\(p_\gamma(Y)\)` denote the marginal likelihood of the data under model `\(\gamma\)`, that is, `\(\mathcal{p}[Y | \gamma]\)`. As before, .block[ .small[ $$ `\begin{split} \hat{\Pi}_\gamma = \Pr[\gamma | Y] & = \frac{\mathcal{p}[Y | \gamma] \cdot p[\gamma]}{\sum_{\gamma^\star \in \Gamma} \mathcal{p}[Y | \gamma^\star] \cdot p[\gamma^\star]} = \frac{ p_\gamma(Y) \Pi_\gamma }{ \sum_{\gamma^\star \in \Gamma} p_{\gamma^\star}(Y) \Pi_{\gamma^\star} }\\ \\ & = \frac{ \Pi_\gamma \cdot \left[ \int_{\Theta_\gamma} p_\gamma(Y | \theta_\gamma) \cdot \pi(\theta_\gamma) \text{d}\theta_\gamma \right] }{ \sum_{\gamma^\star \in \Gamma} \Pi_{\gamma^\star} \cdot \left[ \int_{\Theta_{\gamma^\star}} p_{\gamma^\star}(Y | \theta_{\gamma^\star}) \cdot \pi(\theta_{\gamma^\star}) \text{d}\theta_{\gamma^\star} \right] }.\\ \end{split}` $$ ] ] -- - If we assume a uniform prior on `\(\Gamma\)`, that is, `\(\Pi_\gamma = \frac{1}{\#\Gamma}\)`, for all `\(\gamma \in \Gamma\)`, then .block[ .small[ $$ `\begin{split} \hat{\Pi}_\gamma & = \frac{ p_\gamma(Y) }{ \sum_{\gamma^\star \in \Gamma} p_{\gamma^\star}(Y) }\\ \\ & = \frac{ \left[ \int_{\Theta_\gamma} p_\gamma(Y | \theta_\gamma) \cdot \pi(\theta_\gamma) \text{d}\theta_\gamma \right] }{ \sum_{\gamma^\star \in \Gamma} \left[ \int_{\Theta_{\gamma^\star}} p_{\gamma^\star}(Y | \theta_{\gamma^\star}) \cdot \pi(\theta_{\gamma^\star}) \text{d}\theta_{\gamma^\star} \right] }.\\ \end{split}` $$ ] ] --- ## Bayesian model selection - <div class="question"> How should we choose the Bayes optimal model? </div> -- - We can specify a loss function. The most natural is .block[ .small[ $$ `\begin{split} L(\hat{\gamma},\gamma) = \boldsymbol{1(\hat{\gamma} \neq \gamma)}, \end{split}` $$ ] ] that is, 1. Loss equals zero if the correct model is chosen; and 2. Loss equals one if incorrect model is chosen. -- - Next, select `\(\hat{\gamma}\)` to minimize Bayes risk. Here, Bayes risk (expected loss over posterior) is .block[ .small[ $$ `\begin{split} R(\hat{\gamma}) = \sum_{\gamma \in \Gamma} \boldsymbol{1(\hat{\gamma} \neq \gamma)} \cdot \hat{\Pi}_\gamma = 0 \cdot \hat{\Pi}_{\gamma_{\text{true}}} + \sum_{\gamma \neq \gamma_{\text{true}}} \hat{\Pi}_\gamma = \sum_{\gamma \neq \hat{\gamma}} \hat{\Pi}_\gamma = 1 - \hat{\Pi}_{\hat{\gamma}} \end{split}` $$ ] ] -- - To minimize `\(R(\hat{\gamma})\)`, choose `\(\hat{\gamma}\)` such that `\(\hat{\Pi}_{\hat{\gamma}}\)` is the largest! That is, select the model with the largest posterior probability. --- ## Inference vs prediction - What if the goal is prediction? Then maybe we should care more about predictive accuracy, rather than selecting specific variables. -- - For predictions, we care about the posterior predictive distribution, that is .block[ .small[ $$ `\begin{split} p(y_{n+1}|Y = (y_1, \ldots, y_n)) & = \int_\Gamma \int_{\Theta_\gamma} p(y_{n+1}|\gamma, \theta_\gamma) \cdot \pi(\gamma, \theta_\gamma | Y) \ \text{d}\theta_\gamma \text{d}\gamma \\ & = \int_\Gamma \int_{\Theta_\gamma} p(y_{n+1}|\gamma, \theta_\gamma) \cdot \pi(\theta_\gamma | Y, \gamma) \cdot \Pr[\gamma | Y] \ \text{d}\theta_\gamma \text{d}\gamma \\ & = \sum_{\gamma \in \Gamma} \int_{\Theta_\gamma} p(y_{n+1}|\gamma, \theta_\gamma) \cdot \pi(\theta_\gamma | Y, \gamma) \cdot \hat{\Pi}_\gamma \ \text{d}\theta_\gamma \\ & = \sum_{\gamma \in \Gamma} \hat{\Pi}_\gamma \cdot \int_{\Theta_\gamma} p(y_{n+1}|\gamma, \theta_\gamma) \cdot \pi(\theta_\gamma | Y, \gamma) \ \text{d}\theta_\gamma \\ & = \sum_{\gamma \in \Gamma} \hat{\Pi}_\gamma \cdot p(y_{n+1} | Y, \gamma), \\ \end{split}` $$ ] ] which is just averaging out the predictions from each model, over all possible models in `\(\Gamma\)`, with the posterior probability of each model, and this is known as .hlight[Bayesian model averaging (BMA)]. --- ## Back to Bayesian linear regression - So what does this mean specifically in the context of linear regression? -- - First, recall that for model `\(\gamma\)`, the posterior probability that the model is the right model is .block[ .small[ $$ `\begin{split} \hat{\Pi}_\gamma & = \frac{ \Pi_\gamma p_\gamma(Y) }{ \sum_{\gamma^\star \in \Gamma} \Pi_{\gamma^\star} p_{\gamma^\star}(Y) }.\\ \end{split}` $$ ] ] -- - *Practical issues* + We need to calculate marginal likelihoods for ALL models in `\(\Gamma\)`. -- + In general for, we cannot calculate the marginal likelihoods unless we have a proper or conjugate priors. -- + For linear regression, that would mean looking to priors like Zellner's g-prior, the horseshoe prior you were introduced to in the lab, and so on. --- ## Bayesian variable selection - To explore Bayesian variable selection, rewrite each model `\(\gamma \in \Gamma\)` as .block[ `$$\boldsymbol{Y} \sim \mathcal{N}_n(\boldsymbol{X}_{\gamma}\boldsymbol{\beta}_{\gamma}, \sigma^2\boldsymbol{I}_{n\times n}).$$` ] -- - `\(\gamma\)` represents the set of predictors we want to throw into our model. -- - Using the notation as before, each `\(\gamma = (\gamma_0, \gamma_1, \ldots, \gamma_{p-1}) \in \{0,1\}^p\)`, so that the cardinality of `\(\Gamma\)` is `\(2^p\)`, that is, the number of models in `\(\Gamma\)`. -- - That is, + `\(\gamma_j = 1\)` means the `\(j\)`'th predictor is included in the model, but `\(\gamma_j = 0\)` means it is not; + `\(\boldsymbol{X}_{\gamma}\)` is the matrix of predictors with `\(\gamma_j = 1\)`; + `\(\boldsymbol{\beta}_{\gamma}\)` is the corresponding vector of predictors with `\(\gamma_j = 1\)`. -- - Set `\(p_\gamma = \sum^p_{j=1} \gamma_j\)`, so that `\(p_\gamma\)` is the number of predictors included in model `\(\gamma\)`, then `\(\boldsymbol{X}_{\gamma}\)` is `\(n \times p_\gamma\)` and `\(\boldsymbol{\beta}_{\gamma}\)` is `\(p_\gamma \times 1\)`. --- ## Bayesian variable selection - Recall that we can also write each model as .block[ .small[ `$$Y_i = \boldsymbol{\beta}^T_{\gamma} \boldsymbol{x}_{i\gamma} + \epsilon_i; \ \ \ \ \epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2).$$` ] ] -- - As an example, suppose we had data with 6 potential predictors including the intercept, so that each `\(\boldsymbol{x}_i = (1, x_{i1}, x_{i2}, x_{i3},x_{i4},x_{i5})\)`, and `\(\boldsymbol{\beta} = (\beta_0, \beta_1, \beta_2, \beta_3, \beta_4, \beta_5)\)`. -- - Then for model with `\(\gamma = (1,1,0,0,0,0)\)`, `\(Y_i = \boldsymbol{\beta}^T_{\gamma} \boldsymbol{x}_{i\gamma} + \epsilon_i\)` .block[ .small[ `$$\implies Y_i = \beta_0 + \beta_1 x_{i1} + \epsilon_i; \ \ \ \ \epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2),$$` ] ] with `\(p_\gamma = 2\)`. - Whereas for model with `\(\gamma = (1,0,0,1,1,0)\)`, `\(Y_i = \boldsymbol{\beta}^T_{\gamma} \boldsymbol{x}_{i\gamma} + \epsilon_i\)` .block[ .small[ `$$\implies Y_i = \beta_0 + \beta_3 x_{i3} + \beta_4 x_{i4} + \epsilon_i; \ \ \ \ \epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2),$$` ] ] with `\(p_\gamma = 3\)`. --- ## Bayesian variable selection - The outline for variable selection would be as follows: 1. Write down likelihood under model `\(\gamma\)`. That is, .block[ .small[ $$ `\begin{split} p(\boldsymbol{y} | \boldsymbol{X}, \gamma, \boldsymbol{\beta}_{\gamma}, \sigma^2) & \propto (\sigma^2)^{-\frac{n}{2}} \ \textrm{exp} \left\{-\frac{1}{2\sigma^2} (\boldsymbol{y} - \boldsymbol{X}_{\gamma}\boldsymbol{\beta}_{\gamma})^T (\boldsymbol{y} - \boldsymbol{X}_{\gamma}\boldsymbol{\beta}_{\gamma})\right\}\\ \end{split}` $$ ] ] -- 2. Define a prior for `\(\gamma\)`, `\(\Pi_\gamma = \Pr[\gamma]\)`. For example, (i) uniform over all `\(2^p\)` possible models, or even (ii) beta prior (since each `\(\gamma_j \in \{0,1\}\)`). -- 3. Put a prior on the parameters in each model. Using the g-prior, we have .block[ .small[ $$ `\begin{split} \pi(\boldsymbol{\beta}_{\gamma} | \sigma^2) & = \mathcal{N}_p\left(\boldsymbol{\beta}_{0\gamma}= \boldsymbol{0}, \Sigma_{0\gamma} = g\sigma^2 \left[\boldsymbol{X}^T_{\gamma} \boldsymbol{X}_{\gamma} \right]^{-1} \right)\\ \pi(\sigma^2) & = \mathcal{IG} \left(\frac{\nu_0}{2}, \frac{\nu_0\sigma_0^2}{2}\right)\\ \end{split}` $$ ] ] --- ## Bayesian variable selection - With those pieces, the conditional posteriors are straightforward. -- - We can then compute marginal posterior probabilities `\(\Pr[\gamma | Y]\)` for each model and select model with the highest posterior probability. -- - We can also compute posterior `\(\Pr[\gamma_j = 1| Y]\)`, the posterior probability of including the `\(j\)`'th predictor, often called .hlight[marginal inclusion probability (MIP)], allowing for uncertainty in the other predictors. -- - Also straightforward to do model averaging once we all have posterior samples. -- - The Hoff book works through one example and you can find the Gibbs sampler for doing inference there. I strongly recommend you go through it carefully! -- - In this course however, we will focus on using R packages for doing the same. --- class: center, middle # What's next? ### Move on to the readings for the next module!