For a location-scale mixture of univariate normals, we can specify
yi|zi∼N(μzi,σ2zi), and
Pr.
Priors:
\pi[\boldsymbol{\lambda}] = \textrm{Dirichlet}(a_1,\ldots,a_K),
\mu_k \sim \mathcal{N}(\mu_0,\gamma^2_0), for each k = 1, \ldots, K, and
\sigma^2_k \sim \mathcal{IG}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}\right), for each k = 1, \ldots, K.
It is relatively easy to extend this to the multivariate case.
As with the univariate case, given a sufficiently large number of mixture components, a scale-location multivariate normal mixture model can be used to approximate any multivariate density.
It is relatively easy to extend this to the multivariate case.
As with the univariate case, given a sufficiently large number of mixture components, a scale-location multivariate normal mixture model can be used to approximate any multivariate density.
We have \begin{split} \textbf{y}_i & \overset{iid}{\sim} \sum\limits_{k = 1}^K \lambda_k \cdot \mathcal{N}_p(\boldsymbol{\mu}_k, \Sigma_k) \end{split}
It is relatively easy to extend this to the multivariate case.
As with the univariate case, given a sufficiently large number of mixture components, a scale-location multivariate normal mixture model can be used to approximate any multivariate density.
We have \begin{split} \textbf{y}_i & \overset{iid}{\sim} \sum\limits_{k = 1}^K \lambda_k \cdot \mathcal{N}_p(\boldsymbol{\mu}_k, \Sigma_k) \end{split}
Or equivalently, \begin{split} \textbf{y}_i | z_i, \boldsymbol{\mu}_{z_i}, \Sigma_{z_i} & \sim \mathcal{N}_p(\boldsymbol{\mu}_{z_i}, \Sigma_{z_i})\\ \Pr(z_i = k) & = \lambda_k \equiv \prod\limits_{k=1}^K \lambda_k^{\mathbb{1}[z_i = k]}\\ \end{split}
We can then specify priors as \begin{split} \pi(\boldsymbol{\mu}_k) & = \mathcal{N}_p\left(\boldsymbol{\mu}_0, \Lambda_0 \right) \ \ \ \ \text{for } k = 1, \ldots, K; \\ \\ \pi(\Sigma_k) & = \mathcal{IW}_p\left(\nu_0, S_0\right) \ \ \ \ \text{for } k = 1, \ldots, K; \\ \\ \pi[\boldsymbol{\lambda}] & = \textrm{Dirichlet}(a_1,\ldots,a_K).\\ \end{split}
We can also just use the conjugate option for \pi(\boldsymbol{\mu}_k, \Sigma_k) to avoid specifying \Lambda_0, so that we have \begin{split} \pi(\boldsymbol{\mu}_k, \Sigma_k) & = \pi(\boldsymbol{\mu}_k | \Sigma_k) \cdot \pi(\Sigma_k)\\ & = \mathcal{N}_p\left(\boldsymbol{\mu}_0, \frac{1}{\kappa_0}\Sigma_k\right) \cdot \mathcal{IW}_p\left(\nu_0, S_0\right) \ \ \ \ \text{for } k = 1, \ldots, K; \\ \\ \pi[\boldsymbol{\lambda}] & = \textrm{Dirichlet}(a_1,\ldots,a_K).\\ \end{split}
We can then specify priors as \begin{split} \pi(\boldsymbol{\mu}_k) & = \mathcal{N}_p\left(\boldsymbol{\mu}_0, \Lambda_0 \right) \ \ \ \ \text{for } k = 1, \ldots, K; \\ \\ \pi(\Sigma_k) & = \mathcal{IW}_p\left(\nu_0, S_0\right) \ \ \ \ \text{for } k = 1, \ldots, K; \\ \\ \pi[\boldsymbol{\lambda}] & = \textrm{Dirichlet}(a_1,\ldots,a_K).\\ \end{split}
We can also just use the conjugate option for \pi(\boldsymbol{\mu}_k, \Sigma_k) to avoid specifying \Lambda_0, so that we have \begin{split} \pi(\boldsymbol{\mu}_k, \Sigma_k) & = \pi(\boldsymbol{\mu}_k | \Sigma_k) \cdot \pi(\Sigma_k)\\ & = \mathcal{N}_p\left(\boldsymbol{\mu}_0, \frac{1}{\kappa_0}\Sigma_k\right) \cdot \mathcal{IW}_p\left(\nu_0, S_0\right) \ \ \ \ \text{for } k = 1, \ldots, K; \\ \\ \pi[\boldsymbol{\lambda}] & = \textrm{Dirichlet}(a_1,\ldots,a_K).\\ \end{split}
Gibbs sampler for both options follow directly from what we have covered so far.
To avoid label switching when fitting the model, we can constrain the order of the \boldsymbol{\mu}_k's.
Here are three of many approaches:
To avoid label switching when fitting the model, we can constrain the order of the \boldsymbol{\mu}_k's.
Here are three of many approaches:
To avoid label switching when fitting the model, we can constrain the order of the \boldsymbol{\mu}_k's.
Here are three of many approaches:
Constrain the prior on the \boldsymbol{\mu}_k's to be \boldsymbol{\mu}_k | \boldsymbol{\Sigma}_k \sim \mathcal{N}_p(\boldsymbol{\mu}_0, \frac{1}{\kappa_0}\Sigma_k ) \ \ \ \boldsymbol{\mu}_{k-1} < \boldsymbol{\mu}_k < \boldsymbol{\mu}_{k+1}, which does not always seem reasonable.
Relax option 1 above to only the first component of the mean vectors \boldsymbol{\mu}_k | \boldsymbol{\Sigma}_k \sim \mathcal{N}_p(\boldsymbol{\mu}_0, \frac{1}{\kappa_0}\Sigma_k ) \ \ \ {\mu}_{1,k-1} < {\mu}_{1,k} < {\mu}_{1,k+1}.
To avoid label switching when fitting the model, we can constrain the order of the \boldsymbol{\mu}_k's.
Here are three of many approaches:
Constrain the prior on the \boldsymbol{\mu}_k's to be \boldsymbol{\mu}_k | \boldsymbol{\Sigma}_k \sim \mathcal{N}_p(\boldsymbol{\mu}_0, \frac{1}{\kappa_0}\Sigma_k ) \ \ \ \boldsymbol{\mu}_{k-1} < \boldsymbol{\mu}_k < \boldsymbol{\mu}_{k+1}, which does not always seem reasonable.
Relax option 1 above to only the first component of the mean vectors \boldsymbol{\mu}_k | \boldsymbol{\Sigma}_k \sim \mathcal{N}_p(\boldsymbol{\mu}_0, \frac{1}{\kappa_0}\Sigma_k ) \ \ \ {\mu}_{1,k-1} < {\mu}_{1,k} < {\mu}_{1,k+1}.
Try an ad-hoc fix. After sampling the \boldsymbol{\mu}_k's, rearrange the labels to satisfy {\mu}_{1,k-1} < {\mu}_{1,k} < {\mu}_{1,k+1} and reassign the labels on \boldsymbol{\Sigma}_k accordingly.
To avoid setting K apriori, we can extend this finite mixture of normals to a Dirichlet process (DP) mixture of normals.
The first level of the model remains the same. That is, \begin{split} \textbf{y}_i | z_i, \boldsymbol{\mu}_{z_i}, \Sigma_{z_i} & \sim \mathcal{N}_p(\boldsymbol{\mu}_{z_i}, \Sigma_{z_i}) \ \ \ \ \text{for each }i;\\ \\ \pi(\boldsymbol{\mu}_k, \Sigma_k) & = \pi(\boldsymbol{\mu}_k | \Sigma_k) \cdot \pi(\Sigma_k)\\ \\ & = \mathcal{N}_p\left(\boldsymbol{\mu}, \frac{1}{\kappa_0}\Sigma_k\right) \cdot \mathcal{IW}_p\left(\nu_0, S_0\right) \ \ \ \ \text{for each } k.\\ \end{split}
For the prior on \boldsymbol{\lambda} = (\lambda_1,\ldots,\lambda_K), use the following stick breaking representation of the Dirichlet process. \begin{split} P(z_i = k) & = \lambda_k;\\ \lambda_k & = V_k \prod\limits_{l < k}^{} (1 - V_l) \ \ \text{for} \ \ k = 1, \ldots, \infty;\\ V_k & \overset{iid}{\sim} \text{Beta}(1, \alpha);\\ \alpha & \sim \text{Gamma}(a, b).\\ \end{split}
As an approximation, use \lambda_k = V_k \prod\limits_{l < k}^{} (1 - V_l) \ \ \textrm{for} \ \ k = 1, \ldots, K^{\star} with K^{\star} set to be as large as possible!
For the prior on \boldsymbol{\lambda} = (\lambda_1,\ldots,\lambda_K), use the following stick breaking representation of the Dirichlet process. \begin{split} P(z_i = k) & = \lambda_k;\\ \lambda_k & = V_k \prod\limits_{l < k}^{} (1 - V_l) \ \ \text{for} \ \ k = 1, \ldots, \infty;\\ V_k & \overset{iid}{\sim} \text{Beta}(1, \alpha);\\ \alpha & \sim \text{Gamma}(a, b).\\ \end{split}
As an approximation, use \lambda_k = V_k \prod\limits_{l < k}^{} (1 - V_l) \ \ \textrm{for} \ \ k = 1, \ldots, K^{\star} with K^{\star} set to be as large as possible!
This specification forces the model to only use as many components as needed, and usually, no more. Also, the Gibbs sampler is relatively straightforward.
For the prior on \boldsymbol{\lambda} = (\lambda_1,\ldots,\lambda_K), use the following stick breaking representation of the Dirichlet process. \begin{split} P(z_i = k) & = \lambda_k;\\ \lambda_k & = V_k \prod\limits_{l < k}^{} (1 - V_l) \ \ \text{for} \ \ k = 1, \ldots, \infty;\\ V_k & \overset{iid}{\sim} \text{Beta}(1, \alpha);\\ \alpha & \sim \text{Gamma}(a, b).\\ \end{split}
As an approximation, use \lambda_k = V_k \prod\limits_{l < k}^{} (1 - V_l) \ \ \textrm{for} \ \ k = 1, \ldots, K^{\star} with K^{\star} set to be as large as possible!
This specification forces the model to only use as many components as needed, and usually, no more. Also, the Gibbs sampler is relatively straightforward.
Other details are beyond the scope of this course, but I am happy to provide resources for those interested!
For a location-scale mixture of univariate normals, we can specify
y_i | z_i \sim \mathcal{N}\left( \mu_{z_i}, \sigma^2_{z_i} \right), and
\Pr(z_i = k) = \lambda_k \equiv \prod\limits_{k=1}^K \lambda_k^{\mathbb{1}[z_i = k]}.
Priors:
\pi[\boldsymbol{\lambda}] = \textrm{Dirichlet}(a_1,\ldots,a_K),
\mu_k \sim \mathcal{N}(\mu_0,\gamma^2_0), for each k = 1, \ldots, K, and
\sigma^2_k \sim \mathcal{IG}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}\right), for each k = 1, \ldots, K.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |