class: center, middle, inverse, title-slide # STA 360/602L: Module 8.6 ## Finite mixture models: multivariate continuous data ### Dr. Olanrewaju Michael Akande --- ## Finite mixture of univariate normal (recap) - For a location-scale mixture of univariate normals, we can specify + `\(y_i | z_i \sim \mathcal{N}\left( \mu_{z_i}, \sigma^2_{z_i} \right)\)`, and + `\(\Pr(z_i = k) = \lambda_k \equiv \prod\limits_{k=1}^K \lambda_k^{\mathbb{1}[z_i = k]}\)`. - Priors: + `\(\pi[\boldsymbol{\lambda}] = \textrm{Dirichlet}(a_1,\ldots,a_K)\)`, + `\(\mu_k \sim \mathcal{N}(\mu_0,\gamma^2_0)\)`, for each `\(k = 1, \ldots, K\)`, and + `\(\sigma^2_k \sim \mathcal{IG}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}\right)\)`, for each `\(k = 1, \ldots, K\)`. --- ## Finite mixture of multivariate normals - It is relatively easy to extend this to the multivariate case. -- - As with the univariate case, given a sufficiently large number of mixture components, a scale-location multivariate normal mixture model can be used to approximate any multivariate density. -- - We have $$ `\begin{split} \textbf{y}_i & \overset{iid}{\sim} \sum\limits_{k = 1}^K \lambda_k \cdot \mathcal{N}_p(\boldsymbol{\mu}_k, \Sigma_k) \end{split}` $$ -- - Or equivalently, $$ `\begin{split} \textbf{y}_i | z_i, \boldsymbol{\mu}_{z_i}, \Sigma_{z_i} & \sim \mathcal{N}_p(\boldsymbol{\mu}_{z_i}, \Sigma_{z_i})\\ \Pr(z_i = k) & = \lambda_k \equiv \prod\limits_{k=1}^K \lambda_k^{\mathbb{1}[z_i = k]}\\ \end{split}` $$ --- ## Posterior inference - We can then specify priors as $$ `\begin{split} \pi(\boldsymbol{\mu}_k) & = \mathcal{N}_p\left(\boldsymbol{\mu}_0, \Lambda_0 \right) \ \ \ \ \text{for } k = 1, \ldots, K; \\ \\ \pi(\Sigma_k) & = \mathcal{IW}_p\left(\nu_0, S_0\right) \ \ \ \ \text{for } k = 1, \ldots, K; \\ \\ \pi[\boldsymbol{\lambda}] & = \textrm{Dirichlet}(a_1,\ldots,a_K).\\ \end{split}` $$ -- - We can also just use the conjugate option for `\(\pi(\boldsymbol{\mu}_k, \Sigma_k)\)` to avoid specifying `\(\Lambda_0\)`, so that we have $$ `\begin{split} \pi(\boldsymbol{\mu}_k, \Sigma_k) & = \pi(\boldsymbol{\mu}_k | \Sigma_k) \cdot \pi(\Sigma_k)\\ & = \mathcal{N}_p\left(\boldsymbol{\mu}_0, \frac{1}{\kappa_0}\Sigma_k\right) \cdot \mathcal{IW}_p\left(\nu_0, S_0\right) \ \ \ \ \text{for } k = 1, \ldots, K; \\ \\ \pi[\boldsymbol{\lambda}] & = \textrm{Dirichlet}(a_1,\ldots,a_K).\\ \end{split}` $$ -- - Gibbs sampler for both options follow directly from what we have covered so far. --- ## Label switching again - To avoid label switching when fitting the model, we can constrain the order of the `\(\boldsymbol{\mu}_k\)`'s. -- - Here are three of many approaches: -- 1. Constrain the prior on the `\(\boldsymbol{\mu}_k\)`'s to be `$$\boldsymbol{\mu}_k | \boldsymbol{\Sigma}_k \sim \mathcal{N}_p(\boldsymbol{\mu}_0, \frac{1}{\kappa_0}\Sigma_k ) \ \ \ \boldsymbol{\mu}_{k-1} < \boldsymbol{\mu}_k < \boldsymbol{\mu}_{k+1},$$` which does not always seem reasonable. -- 2. Relax option 1 above to only the first component of the mean vectors `$$\boldsymbol{\mu}_k | \boldsymbol{\Sigma}_k \sim \mathcal{N}_p(\boldsymbol{\mu}_0, \frac{1}{\kappa_0}\Sigma_k ) \ \ \ {\mu}_{1,k-1} < {\mu}_{1,k} < {\mu}_{1,k+1}.$$` -- 3. Try an ad-hoc fix. After sampling the `\(\boldsymbol{\mu}_k\)`'s, rearrange the labels to satisfy `\({\mu}_{1,k-1} < {\mu}_{1,k} < {\mu}_{1,k+1}\)` and reassign the labels on `\(\boldsymbol{\Sigma}_k\)` accordingly. --- ## DP mixture of normals (teaser) - To avoid setting `\(K\)` apriori, we can extend this finite mixture of normals to a .hlight[Dirichlet process (DP) mixture of normals]. -- - The first level of the model remains the same. That is, $$ `\begin{split} \textbf{y}_i | z_i, \boldsymbol{\mu}_{z_i}, \Sigma_{z_i} & \sim \mathcal{N}_p(\boldsymbol{\mu}_{z_i}, \Sigma_{z_i}) \ \ \ \ \text{for each }i;\\ \\ \pi(\boldsymbol{\mu}_k, \Sigma_k) & = \pi(\boldsymbol{\mu}_k | \Sigma_k) \cdot \pi(\Sigma_k)\\ \\ & = \mathcal{N}_p\left(\boldsymbol{\mu}, \frac{1}{\kappa_0}\Sigma_k\right) \cdot \mathcal{IW}_p\left(\nu_0, S_0\right) \ \ \ \ \text{for each } k.\\ \end{split}` $$ --- ## DP mixture of normals (teaser) - For the prior on `\(\boldsymbol{\lambda} = (\lambda_1,\ldots,\lambda_K)\)`, use the following .hlight[stick breaking representation of the Dirichlet process]. $$ `\begin{split} P(z_i = k) & = \lambda_k;\\ \lambda_k & = V_k \prod\limits_{l < k}^{} (1 - V_l) \ \ \text{for} \ \ k = 1, \ldots, \infty;\\ V_k & \overset{iid}{\sim} \text{Beta}(1, \alpha);\\ \alpha & \sim \text{Gamma}(a, b).\\ \end{split}` $$ -- - As an approximation, use `\(\lambda_k = V_k \prod\limits_{l < k}^{} (1 - V_l) \ \ \textrm{for} \ \ k = 1, \ldots, K^{\star}\)` with `\(K^{\star}\)` set to be as large as possible! -- - This specification forces the model to only use as many components as needed, and usually, no more. Also, the Gibbs sampler is relatively straightforward. -- - Other details are beyond the scope of this course, but I am happy to provide resources for those interested! --- class: center, middle # What's next? ### Well.........nothing! ### You made it to the end of this course. ### Hope you enjoyed the course and that you have learned a lot about Bayesian inference.