class: center, middle, inverse, title-slide # STA 360/602L: Module 2.1 ## Conjugacy; beta-Bernoulli and beta-binomial models ### Dr. Olanrewaju Michael Akande --- ## Outline - Conjugacy - Kernels - Bernoulli data - Binomial data --- ## Bayesian inference - Once again, given **data** `\(y\)` and an **unknown population parameter** `\(\theta\)`, estimate `\(\theta\)`. -- - As a Bayesian, you update some prior information for `\(\theta\)` with information in the data `\(y\)`, to obtain the posterior density `\(p(\theta | y)\)`. -- - Personally, I prefer being able to obtain posterior densities that describe my parameter, instead of estimated summaries (usually measures of central tendency) along with confidence intervals. -- - Bayes' theorem - reminder: .block[ .small[ `$$p(\theta | y) = \frac{p(\theta)p(y|\theta)}{\int_{\Theta}p(\tilde{\theta})p(y| \tilde{\theta}) \textrm{d}\tilde{\theta}} = \frac{p(\theta)p(y|\theta)}{p(y)}$$` ] ] --- ## Comments on the posterior density - The posterior density is more concentrated than the prior & quantifies learning about `\(\theta\)`. -- - In fact, this is the optimal way to learn from data - see discussion in Hoff chapter 1. -- - As more & more data become available, posterior density will converge to a normal (Gaussian) density centered on the MLE (Bayes central limit theorem). -- - In finite samples for limited data, the posterior can be highly skewed & noticeably non-Gaussian. --- ## Conjugacy - Starting with an arbitrary prior density `\(p(\theta)\)` & sampling density `\(p(y|\theta)\)` we may encounter problems in calculating the posterior density `\(p(\theta | y)\)`. -- - In particular, you may notice the denominator in the Bayes' rule: .block[ .small[ `$$p(y) = \int_{\Theta}p(\theta)p(y| \theta) \textrm{d}\theta.$$` ] ] This integral may not be analytically tractable! -- - When the prior is .hlight[conjugate] however, the marginal likelihood can be calculated analytically. -- - .hlight[Conjugacy] `\(\Rightarrow\)` the posterior density (or mass) function has the same form as the prior density (or mass) function. -- - Conjugate priors make calculations easy but may not represent our prior information well. --- ## Kernels - In Bayesian statistics, the .hlight[kernel] of a pdf or pmf omits any multipliers that do not depend on the random variable or parameter we care about. -- - For many distributions, the kernel is in a simple form but the normalizing constant complicates calculations. -- - If one recognizes the kernel as that matching a known distribution, then the normalizing factor can be reinstated. This is a very MAJOR TRICK we will use to calculate posterior distributions. -- - For example, the normal density is given by .block[ .small[ `$$p(y|\mu,\sigma^2) = \dfrac{1}{\sqrt{2\pi\sigma^2}}e^{-\dfrac{(y-\mu)^2}{2\sigma^2}}$$` ] ] -- but the kernel is just .block[ .small[ `$$p(y|\mu,\sigma^2) \propto e^{-\dfrac{(y-\mu)^2}{2\sigma^2}}.$$` ] ] --- ## Bernoulli data - Back to our example: suppose `\(\theta \in (0,1)\)` is the population proportion of individuals with diabetes in the US. -- - Suppose we take a sample of `\(n\)` individuals and record whether or not they have diabetes (as binary: 0,1). -- - Then we can use the Bernoulli distribution as the sampling distribution. -- - Also, we already established that we can use a beta prior for `\(\theta\)`. --- ## Bernoulli data - Generally, it turns out that if + `\(p(y_i| \theta): y_i \overset{iid}{\sim} \textrm{Bernoulli}(\theta)\)` for `\(i = 1,\ldots,n\)`, and + `\(\pi(\theta): \theta \sim \textrm{Beta}(a,b)\)`, then the posterior distribution is also a beta distribution. -- - <div class="question"> Can we derive the posterior distribution and its parameters? Let's do some work on the board! </div> -- - Updating a beta prior with a Bernoulli likelihood leads to a beta posterior - we have conjugacy! -- - Let `\(y = (y_1,\ldots,y_n)\)`. Specifically, we have. .block[ .small[ `$$p(\theta | y) = \textrm{Beta}\left(a+\sum_{i=1}^n y_i,b+n-\sum_{i=1}^n y_i\right).$$` ] ] -- - This is the .hlight[beta-Bernoulli model]. More generally, this is actually the .hlight[beta-binomial model]. --- ## Beta-binomial in more detail - Suppose the sampling density of the data is .block[ .small[ `$$p(y|\theta) = {n \choose y} \theta^y(1-\theta)^{n-y}.$$` ] ] -- - Suppose also that we have a `\(\textrm{Beta}(a,b)\)` prior on the probability `\(\theta\)`. -- - Then the posterior density then has the beta form .block[ .small[ `$$\pi(\theta | y) = \textrm{Beta}(a+y,b+n-y).$$` ] ] -- - The posterior has expectation .block[ .small[ `$$\mathbb{E}(\theta | y) = \dfrac{a+y}{a+b+n} = \dfrac{a+b}{a+b+n} \times \textrm{prior mean} + \dfrac{n}{a+b+n} \times \textrm{sample mean}.$$` ] ] -- - For this specification, **sometimes `\(a\)` and `\(b\)` are interpreted as "prior data" with a interpreted as the prior number of 1's, `\(b\)` as the prior number of 0's, and `\(a + b\)` as the prior sample size.** -- - As we get more and more data, the majority of our information about `\(\theta\)` comes from the data as opposed to the prior. --- ## Binomial data - For example, suppose you want to find the Bayesian estimate of the probability `\(\theta\)` that a coin comes up heads. -- - Before you see the data, you express your uncertainty about `\(\theta\)` through the prior `\(p(\theta) = \textrm{Beta}(2,2)\)` -- - Now suppose you observe 10 tosses, of which only 1 was heads. -- - Then, the posterior density `\(p(\theta \,|\, y)\)` is `\(\mbox{Beta}(3, 11)\)`. --- ## Binomial data - Recall that the mean of `\(\mbox{Beta}(a,b)\)` is `\(\frac{a}{a+b}\)`. -- - So, before you saw the data, you thought the mean for `\(\theta\)` was `\(\frac{2}{2+2} = 0.50\)`. -- - However, after seeing the data, you believe it is `\(\frac{3}{3+11} = 0.21\)`. -- - The variance of `\(\mbox{Beta}(a,b)\)` is `\(\frac{ab}{(a+b)^2(a+b+1)}\)`. -- - So before you saw data, your uncertainty about `\(\theta\)`, in terms of the standard deviation, was `\(\sqrt{\frac{4}{4^2 \times 5}} = 0.22\)`. -- - However, after seeing 1 Heads in 10 tosses, your standard deviation gets updated to `\(\sqrt{\frac{33}{14^2 \times 15}} = 0.11\)`. -- - Clearly, as the number of tosses goes to infinity, your uncertainty goes to zero. --- class: center, middle # What's next? ### Move on to the readings for the next module!