BME | Machine Learning - Gaussian Mixture & EM

GMM（Gaussian Mixture Model）

What is a Gaussian Mixture Model?

Firstly, we have k Gaussian models, denoted as $N(\mu_1, \sigma_1^2), N(\mu_2, \sigma_2^2), ..., N(\mu_k, \sigma_k^2)$ .

According to certain proportions, we sampled some data from the Gaussian function $N(\mu_j, \sigma_j^2)$ with proportions $\phi_j$ .

Empirically, it can be observed that $\sum_{j=1}^{k}\phi_j=1$ , which means the sum of all proportions is 1.

$p(z^{(i)}=j)=\phi_j$ represents the probability of being in cluster $z=j$ . If it is a hard assignment, it is represented by $\mathbb{I}({z^{(i)}=j})$ . If $z^{(i)}=j$ in this cluster, the probability is 1; otherwise, it is 0 if not in this cluster.

$p(x^{(i)}|z^{(i)}=j)=p_{norm}(x^{(i)};\mu_j,\sigma_j)$ represents the probability of obtaining $x$ from Gaussian distribution $j$ , such as:

Each $x^{(i)}$ corresponds to k $z^{(i)}_j$ .

Basic parameters and distributions

input: $$ {x^{(1)}, …, x^{(n)}} $$

latent variable:

\{z^{(1)}, ..., z^{(n)}\},\ z^{(i)}=(z^{(i)}=1,z^{(i)}=2,...,z^{(i)}=k)

In other words, each $z^{(i)}$ corresponds to an $x^{(i)}$ . $z^{(i)}$ follows a multinomial distribution and can be divided into different z belonging to different k’s. By belonging to different z of k, larger z can be pieced together to obtain x.

joint distribution

p(x^{(i)}, z^{(i)})=p(x^{(i)}|z^{(i)})p(z^{(i)})

Where $z^{(i)} \sim \text{Multinomial}(\phi)$ , a multinomial distribution where $\phi_j \geq 0$ , and $\sum_{j=1}^{k}\phi_j=1$ .

x^{(i)}|z^{(i)}=j\sim N(\mu_j, \Sigma_j)

$z^{(i)}$ indicates which of the k Gaussians each $x^{(i)}$ comes from.

$p(z)$ represents the probability of choosing cluster z, which is the density of the mixture model.

$p(z^{(i)}=j)$ represents the probability in cluster z=j.

Here, i denotes the index of the data, x represents the data itself, j denotes the index of the cluster, and z represents the cluster.

The parameter $\phi_j$ provides $p(z^{(i)}=j)$ , $x^{(i)}|z^{(i)}=j \sim N(\mu_j, \Sigma_j)$ , meaning that x|z follows a certain Gaussian distribution, where N() represents the Gaussian distribution.

likelihood of data

\ell(\phi, \mu, \Sigma) = \sum_{i=1}^{n}\log p(x^{(i)};\phi, \mu, \Sigma) ---(1)\\ = \sum_{i=1}^{n}\log\sum_{z^{(i)}=1}^{k}p(x^{(i)}|z^{(i)}=j; \mu,\Sigma)p(z^{(i)}=j;\phi)---(2)\\

To address (1), we introduce z, thereby breaking down the probability into two parts multiplied together, resulting in (2).

Wherein the ; on the left side represents the variable and on the right side represents the parameter, while , denotes parallelism. In $x|z;\mu,\Sigma$ , the right side of | represents the overall range (i.e., the range where $z=j$ ), and the left side, x, signifies the probability of x within those specified ranges.

If $z^{(i)}$ is known, that is, if we know which Gaussian distribution zi comes from—put differently, when j is fixed—the equation can be simplified to:

\ell(\phi, \mu, \Sigma) = \sum_{i=1}^{n}\log p(x^{(i)}|z^{(i)}; \mu,\Sigma)+\log p(z^{(i)};\phi)---(3)

The maximum function (3) is obtained by taking partial derivatives and setting them to zero, resulting in the calculation of the result: (at this point, j is fixed)

\phi _ { j } = \frac { 1 } { n } \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \}

\mu _ { j } = \frac { \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \} x ^ { ( i ) } } { \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \} }

\Sigma _ { j } = \frac { \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \} ( x ^ { ( i ) } - \mu _ { j } ) ( x ^ { ( i ) } - \mu _ { j } ) ^ { T } } { \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \} }

From the above figure, it can be observed that $z^{(i)}$ can be considered as the role of a class label. Therefore, once $z^{(i)}$ is known, everything else can be derived. Thus, we use the Expectation-Maximization (EM) method to determine $z^{(i)}$ .

EM (Expectation-Maximization) algorithm for density estimation

（E-Step） Gaussing $z^{(i)}$ ，cluster assignment

For each $i, j$ , set

w _ { j } ^ { ( i ) } : = p ( z ^ { ( i ) } = j | x ^ { ( i ) } ; \phi , \mu , \Sigma )

This $\omega_j^{(i)}$ represents our guess for $z^{(i)}$ . Because we are trying to determine z, it’s placed on the left side of |.

Using Bayes’ rule, we can obtain:

p(z|x)=\frac{p(x|z)p(z)}{p(x)}

p ( z ^ { ( i ) } = j | x ^ { ( i ) } ; \phi , \mu , \Sigma ) = \frac { p ( x ^ { ( i ) } | z ^ { ( i ) } = j ; \mu, \Sigma ) p ( z ^ { ( i ) } = j ; \mu, \Sigma ) p ( z ^ { ( i ) } = j ; \phi ) } { \sum _ { l = 1 } ^ { k } p ( x ^ { ( i ) } | z ^ { ( i ) } = l ; \mu, \Sigma ) p ( z ^ { ( i ) } = l ; \phi )}

f(x)=\frac{1}{\sqrt{(2\pi)^k \det \Sigma}} \exp (-\frac{1}{2}(x-\mu)^{T}\Sigma ^{-1}(x-\mu))

$p(x^{(i)}|z^{(i)}=j;\mu,\Sigma)$ comes from the density of the Gaussian distribution with mean $\mu_j$ and covariance $\Sigma_j$ . It represents the probability of obtaining x from the model if $x^{(i)}$ comes from Gaussian distribution j.

$p(z^{(i)}=j;\phi)=\phi_j$ represents the probability of coming from Gaussian model j.

(M-Step) Update the model’s parameters based on the guess (Maximum Likelihood Estimation - MLE) — soft assignment.
Update the parameters:

\phi _ { j } : = \frac { 1 } { n } \sum _ { i = 1 } ^ { n } w _ { j } ^ { ( i ) }

\mu _ { j } : = \frac { \sum _ { i = 1 } ^ { n } w _ { j } ^ { ( i ) } x ^ { ( i ) } } { \sum _ { i = 1 } ^ { n } w _ { j } ^ { ( i ) } }

\Sigma _ { j } : = \frac { \sum _ { i = 1 } ^ { n } w _ { j } ^ { ( i ) } ( x ^ { ( i ) } - \mu _ { j } ) ( x ^ { ( i ) } - \mu _ { j } ) ^ { T } } { \sum _ { i = 1 } ^ { n } w_j ^{(i)} }

Restriction

\Sigma_j=\sigma^2I

Where $\sigma^2\rightarrow 0$ , the original function simplifies to k-means.

From a probabilistic perspective, a Gaussian function approaches an impulse function as $\sigma^2$ approaches zero. Only positions near the center have significant values; the rest become very small. Thus, for a specific Gaussian distribution, the probability is high, while for others, it is very low. Therefore, concerning the values of w, according to the Bayes’ formula, this data will have high values only within a specific Gaussian distribution, whereas the others tend towards zero.

From a clustering perspective, when sigma approaches zero, it means the boundaries of the ellipsoidal clusters in GMM become circular in k-means (or in other words, ellipsoids become spherical).

Advanced methods in K-means

Instead of using hard cluster assignment $c(i)$ , we’re using soft assignments $\omega_j^{(i)}$ .

In the code, w is an m x k matrix containing the probabilities of $x_i$ belonging to Gaussian distribution j. If we want to convert this into hard clustering, we iterate through the probabilities of $x_i$ across all k Gaussian distributions and find the Gaussian distribution with the highest probability. Then, we assign $x_i$ to that Gaussian distribution, setting all other probabilities to 0. Consequently, the matrix is reduced to an m x 1 vector.

GMM Solving

GMM can be solved using EM or by minimizing the loss function.

Other

The probability of observing x from cluster l:

P(x) = \sum P(x, z=l) = \sum P(x|z=l)P(z=l)

reference

https://mas-dse.github.io/DSE210/Additional Materials/gmm.pdf

什么是高斯混合模型

https://www.bilibili.com/video/BV12d4y167ud/?spm_id_from=333.337.search-card.all.click

https://www.bilibili.com/video/BV1Dd4y167pZ/?spm_id_from=333.788.recommend_more_video.-1&vd_source=2029748cdca0ccea25511170cf7a00c3

The probability of selecting model 1 is $\pi$ , and the probability of selecting model 2 is $1-\pi$ . Below is the probability of selecting model x.

p ( x ; \pi , \sigma _ { 1 } , \sigma _ { 2 } , \mu _ { 1 } , \mu _ { 2 } ) = \pi p _ { n o r m } ( x ; \mu _ { 1 } , \sigma _ { 1 } ) + ( 1 - \pi ) p _ { n o r m } ( x ; \mu _ { 2 } , \sigma _ { 2 } )

Taking the logarithm of it, because $\log(x)$ is a function, maximizing $x$ is equivalent to maximizing $\log(x)$ .

\log ( L ( x ; \pi , \mu _ { 1 } , \mu _ { 2 } , \sigma _ { 1 } , \sigma _ { 2 } ) ) = \sum \log ( \pi p _ { n o r m } ( x ; \mu _ { 1 } , \sigma _ { 1 } ) + ( 1 - \pi ) p _ { n o r m } ( x ; \mu _ { 2 }, \sigma_2 ) )

p ( x , z ) = p ( x | z ) p ( z )

where,

z \in \{ 0 , 1 \} , p ( z = 1 ) = \pi ,\\ p ( x | z = 1 ) = p _ {norm} ( x ; \mu _ { 1 } \sigma_1),\\ p ( x | z =0 ) = p _ {norm}(x; \mu_2, \sigma_2)

The probability of both z and x occurring simultaneously equals the probability of z occurring multiplied by the probability of x occurring, and then summed over all possible scenarios.

{ \log ( L ( p ( z , x ; \pi , \mu _ { 1 } , \mu _ { 2 } , \sigma_1, \sigma _ { 2 } ) ) ) } \\ { = \sum _ { i } z _ { i } \log ( \pi p _ { n } ( x _ { i } ; \mu _ { 1 } , \sigma _ { 1 } )) + ( 1 - z _ { i } ) \log ( ( 1 - \pi ) p_n (x_i; \mu_2,\sigma_2))}

The expectation of $\log(p(x,z))$ is the probability multiplied by whether it happens, i.e., $p(z=1|x)$ .

Announcement：This blog content serves as class notes and is solely for sharing purposes. Some images and content are sourced from textbooks, teacher presentations, and the internet. If there are any copyright infringements, please contact aursus.blog@gmail.com for removal.