class: center, middle, inverse, title-slide # STA 360/602L: Module 8.3 ## Finite mixture models: univariate continuous data ### Dr. Olanrewaju Michael Akande --- ## Continuous data -- univariate case - Suppose we have univariate continuous data `\(y_i \overset{iid}{\sim} f\)`, for `\(i, \ldots, n\)`, where `\(f\)` is an unknown density. -- - Turns out that we can approximate "almost" any `\(f\)` with a .hlight[mixture of normals]. Usual choices are -- 1. .hlight[Location mixture] (multimodal): .block[ .small[ `$$f(y) = \sum_{k=1}^K \lambda_k \mathcal{N}\left( \mu_k, \sigma^2 \right)$$` ] ] -- 2. .hlight[Scale mixture] (unimodal and symmetric about the mean, but fatter tails than a regular normal distribution): .block[ .small[ `$$f(y) = \sum_{k=1}^K \lambda_k \mathcal{N}\left( \mu, \sigma^2_k \right)$$` ] ] -- 3. .hlight[Location-scale mixture] (multimodal with potentially fat tails): .block[ .small[ `$$f(y) = \sum_{k=1}^K \lambda_k \mathcal{N}\left( \mu_k, \sigma^2_k \right)$$` ] ] --- ## Location mixture example `$$f(y) = 0.55 \mathcal{N}\left(-10, 4 \right) + 0.30 \mathcal{N}\left(0, 4 \right) + 0.15 \mathcal{N}\left(10, 4 \right)$$` <img src="8-3-finite-mixture-univariate-continuous_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Scale mixture example `$$f(y) = 0.55 \mathcal{N}\left(0, 1 \right) + 0.30 \mathcal{N}\left(0, 5 \right) + 0.15 \mathcal{N}\left(0, 10 \right)$$` <img src="8-3-finite-mixture-univariate-continuous_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Location-scale mixture example `$$f(y) = 0.55 \mathcal{N}\left(-10, 1 \right) + 0.30 \mathcal{N}\left(0, 5 \right) + 0.15 \mathcal{N}\left(10, 10 \right)$$` <img src="8-3-finite-mixture-univariate-continuous_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Location mixture of normals - Consider the location mixture `\(f(y) = \sum_{k=1}^K \lambda_k \mathcal{N}\left( \mu_k, \sigma^2 \right)\)`. How can we do inference? -- - Right now, we only have three unknowns: `\(\boldsymbol{\lambda} = (\lambda_1, \ldots, \lambda_K)\)`, `\(\boldsymbol{\mu} = (\mu_1, \ldots, \mu_K)\)`, and `\(\sigma^2\)`. -- - For priors, the most obvious choices are + `\(\pi[\boldsymbol{\lambda}] = \textrm{Dirichlet}(\alpha_1,\ldots,\alpha_K)\)`, + `\(\mu_k \sim \mathcal{N}(\mu_0,\gamma^2_0)\)`, for each `\(k = 1, \ldots, K\)`, and + `\(\sigma^2 \sim \mathcal{IG}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}\right)\)`. -- - However, we do not want to use the likelihood with the sum in the mixture. We prefer products! --- ## Data augmentation - This brings us the to concept of .hlight[data augmentation], which we actually already used in the mixture of multinomials. -- - Data augmentation is a commonly-used technique for designing MCMC samplers using .hlight[auxiliary/latent/hidden variables]. Again, we have already seen this. -- - **Idea**: introduce variable `\(Z\)` that depends on the distribution of the existing variables in such a way that the resulting conditional distributions, with `\(Z\)` included, are easier to sample from and/or result in better mixing. -- - `\(Z\)`'s are just latent/hidden variables that are introduced for the purpose of simplifying/improving the sampler. --- ## Data augmentation - For example, suppose we want to sample from `\(p(x,y)\)`, but `\(p(x|y)\)` and/or `\(p(y|x)\)` are complicated. -- - Choose `\(p(z|x,y)\)` such that `\(p(x|y,z)\)`, `\(p(y|x,z)\)`, and `\(p(z|x,y)\)` are easy to sample from. Note that we have `\(p(x,y,z) = p(z|x,y)p(x,y)\)`. -- - Alternatively, rewrite the model as `\(p(x,y | z)\)` and specify `\(p(z)\)` such that `$$p(x,y) = \int p(x, y | z) p(z) \text{d}z,$$` where the resulting `\(p(x|y,z)\)`, `\(p(y|x,z)\)`, and `\(p(z|x,y)\)` from the joint `\(p(x,y,z)\)` are again easy to sample from. -- - Next, construct a Gibbs sampler to sample all three variables `\((X,Y,Z)\)` from `\(p(x,y,z)\)`. -- - Finally, throw away the sampled `\(Z\)`'s and from what we know about Gibbs sampling, the samples `\((X,Y)\)` are from the desired `\(p(x,y)\)`. --- ## Location mixture of normals - Back to location mixture `\(f(y) = \sum_{k=1}^K \lambda_k \mathcal{N}\left( \mu_k, \sigma^2 \right)\)`. -- - Introduce latent variable `\(z_i \in \{1,\ldots, K\}\)`. -- - Then, we have + `\(y_i | z_i \sim \mathcal{N}\left( \mu_{z_i}, \sigma^2 \right)\)`, and + `\(\Pr(z_i = k) = \lambda_k \equiv \prod\limits_{k=1}^K \lambda_k^{\mathbb{1}[z_i = k]}\)`. -- - How does that help? Well, the observed data likelihood is now .block[ .small[ $$ `\begin{split} p\left[Y = (y_1, \ldots, y_n) | Z = (z_1, \ldots, z_n), \boldsymbol{\lambda}, \boldsymbol{\mu}, \sigma^2 \right] & = \prod_{i=1}^n p\left(y_i | z_i, \mu_{z_i}, \sigma^2 \right)\\ \\ & = \prod_{i=1}^n \dfrac{1}{\sqrt{2\pi\sigma^2}} \ \textrm{exp}\left\{-\frac{1}{2\sigma^2} (y_i-\mu_{z_i})^2\right\}\\ \end{split}` $$ ] ] which is much easier to work with. --- ## Posterior inference - The joint posterior is .block[ .small[ $$ `\begin{split} \pi\left(Z, \boldsymbol{\mu}, \sigma^2, \boldsymbol{\lambda} | Y \right) & \propto \left[ \prod_{i=1}^n p\left(y_i | z_i, \mu_{z_i}, \sigma^2 \right) \right] \cdot \Pr(Z| \boldsymbol{\mu}, \sigma^2, \boldsymbol{\lambda}) \cdot \pi(\boldsymbol{\mu}, \sigma^2, \boldsymbol{\lambda}) \\ \\ & \propto \left[ \prod_{i=1}^n p\left(y_i | z_i, \mu_{z_i}, \sigma^2 \right) \right] \cdot \Pr(Z| \boldsymbol{\lambda}) \cdot \pi(\boldsymbol{\lambda}) \cdot \pi(\boldsymbol{\mu}) \cdot \pi(\sigma^2) \\ \\ & \propto \left[ \prod_{i=1}^n \dfrac{1}{\sqrt{2\pi\sigma^2}} \ \textrm{exp}\left\{-\frac{1}{2\sigma^2} (y_i-\mu_{z_i})^2\right\} \right] \\ & \ \ \ \ \ \times \left[ \prod_{i=1}^n \prod\limits_{k=1}^K \lambda_k^{\mathbb{1}[z_i = k]} \right] \\ & \ \ \ \ \ \times \left[ \prod\limits_{k=1}^K \lambda_k^{\alpha_k-1} \right]. \\ & \ \ \ \ \ \times \left[ \prod\limits_{k=1}^K \mathcal{N}(\mu_k; \mu_0,\gamma^2_0) \right] \\ & \ \ \ \ \ \times \left[ \mathcal{IG}\left(\sigma^2; \dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}\right) \right]. \\ \end{split}` $$ ] ] --- ## Full conditionals - For `\(i = 1, \ldots, n\)`, sample `\(z_i \in \{1,\ldots, K\}\)` from a categorical distribution (multinomial distribution with sample size one) with probabilities .block[ .small[ $$ `\begin{split} \Pr[z_i = k | \dots ] & = \dfrac{ \Pr[y_i, z_i = k | \mu_k, \sigma^2, \lambda_k] }{ \sum\limits^K_{l=1} \Pr[y_i, z_i = l | \mu_l, \sigma^2, \lambda_l] } \\ \\ & = \dfrac{ \Pr[y_i | z_i = k, \mu_k, \sigma^2] \cdot \Pr[z_i = k| \lambda_k] }{ \sum\limits^K_{l=1} \Pr[y_i | z_i = l, \mu_l, \sigma^2] \cdot \Pr[z_i = l| \lambda_l] } \\ \\ & = \dfrac{ \lambda_k \cdot \mathcal{N}\left(y_i; \mu_k, \sigma^2 \right) }{ \sum\limits^K_{l=1} \lambda_l \cdot \mathcal{N}\left(y_i; \mu_l, \sigma^2 \right) }. \\ \end{split}` $$ ] ] -- - Note that `\(\mathcal{N}\left(y_i; \mu_k, \sigma^2 \right)\)` just means evaluating the density `\(\mathcal{N}\left(\mu_k, \sigma^2 \right)\)` at the value `\(y_i\)`. --- ## Full conditionals - Next, sample `\(\boldsymbol{\lambda} = (\lambda_1, \ldots, \lambda_K)\)` from .block[ .small[ $$ `\begin{split} \pi[\boldsymbol{\lambda} | \dots ] \equiv \textrm{Dirichlet}\left(\alpha_1 + n_1,\ldots,\alpha_K + n_K\right),\\ \end{split}` $$ ] ] where `\(n_k = \sum\limits_{i=1}^n \mathbb{1}[z_i = k]\)`, the number of individuals assigned to cluster `\(k\)`. -- - Sample the mean `\(\mu_k\)` for each cluster from .block[ .small[ $$ `\begin{split} \pi[\mu_k | \dots ] & \equiv \mathcal{N}(\mu_{k,n},\gamma^2_{k,n});\\ \gamma^2_{k,n} & = \dfrac{1}{ \dfrac{n_k}{\sigma^2} + \dfrac{1}{\gamma_0^2} }; \ \ \ \ \ \ \ \ \mu_{k,n} = \gamma^2_{k,n} \left[ \dfrac{n_k}{\sigma^2} \bar{y}_k + \dfrac{1}{\gamma_0^2} \mu_0 \right], \end{split}` $$ ] ] -- - Finally, sample `\(\sigma^2\)` from .block[ .small[ $$ `\begin{split} \pi(\sigma^2 | \dots ) & \boldsymbol{=} \mathcal{IG}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right).\\ \nu_n & = \nu_0 + n; \ \ \ \ \ \ \ \sigma_n^2 = \dfrac{1}{\nu_n} \left[ \nu_0 \sigma_0^2 + \sum^n_{i=1} (y_i - \mu_{z_i})^2 \right].\\ \end{split}` $$ ] ] --- class: center, middle # What's next? ### Move on to the readings for the next module!