GMM(Gaussian Mixture Model)

What is a Gaussian Mixture Model?

Firstly, we have k Gaussian models, denoted as N(μ1,σ12),N(μ2,σ22),...,N(μk,σk2)N(\mu_1, \sigma_1^2), N(\mu_2, \sigma_2^2), ..., N(\mu_k, \sigma_k^2).

According to certain proportions, we sampled some data from the Gaussian function N(μj,σj2)N(\mu_j, \sigma_j^2) with proportions ϕj\phi_j.

Empirically, it can be observed that j=1kϕj=1\sum_{j=1}^{k}\phi_j=1, which means the sum of all proportions is 1.

p(z(i)=j)=ϕjp(z^{(i)}=j)=\phi_j represents the probability of being in cluster z=jz=j. If it is a hard assignment, it is represented by I(z(i)=j)\mathbb{I}({z^{(i)}=j}). If z(i)=jz^{(i)}=j in this cluster, the probability is 1; otherwise, it is 0 if not in this cluster.

p(x(i)z(i)=j)=pnorm(x(i);μj,σj)p(x^{(i)}|z^{(i)}=j)=p_{norm}(x^{(i)};\mu_j,\sigma_j) represents the probability of obtaining xx from Gaussian distribution jj, such as:

Each x(i)x^{(i)} corresponds to k zj(i)z^{(i)}_j.

Basic parameters and distributions

input: $$ {x^{(1)}, …, x^{(n)}} $$

latent variable:

{z(1),...,z(n)}, z(i)=(z(i)=1,z(i)=2,...,z(i)=k)\{z^{(1)}, ..., z^{(n)}\},\ z^{(i)}=(z^{(i)}=1,z^{(i)}=2,...,z^{(i)}=k)

In other words, each z(i)z^{(i)} corresponds to an x(i)x^{(i)}. z(i)z^{(i)} follows a multinomial distribution and can be divided into different z belonging to different k’s. By belonging to different z of k, larger z can be pieced together to obtain x.

joint distribution

p(x(i),z(i))=p(x(i)z(i))p(z(i))p(x^{(i)}, z^{(i)})=p(x^{(i)}|z^{(i)})p(z^{(i)})

Where z(i)Multinomial(ϕ)z^{(i)} \sim \text{Multinomial}(\phi), a multinomial distribution where ϕj0\phi_j \geq 0, and j=1kϕj=1\sum_{j=1}^{k}\phi_j=1.

x(i)z(i)=jN(μj,Σj)x^{(i)}|z^{(i)}=j\sim N(\mu_j, \Sigma_j)

z(i)z^{(i)} indicates which of the k Gaussians each x(i)x^{(i)} comes from.

p(z)p(z) represents the probability of choosing cluster z, which is the density of the mixture model.

p(z(i)=j)p(z^{(i)}=j) represents the probability in cluster z=j.

Here, i denotes the index of the data, x represents the data itself, j denotes the index of the cluster, and z represents the cluster.

The parameter ϕj\phi_j provides p(z(i)=j)p(z^{(i)}=j), x(i)z(i)=jN(μj,Σj)x^{(i)}|z^{(i)}=j \sim N(\mu_j, \Sigma_j), meaning that x|z follows a certain Gaussian distribution, where N() represents the Gaussian distribution.

likelihood of data

(ϕ,μ,Σ)=i=1nlogp(x(i);ϕ,μ,Σ)(1)=i=1nlogz(i)=1kp(x(i)z(i)=j;μ,Σ)p(z(i)=j;ϕ)(2)\ell(\phi, \mu, \Sigma) = \sum_{i=1}^{n}\log p(x^{(i)};\phi, \mu, \Sigma) ---(1)\\ = \sum_{i=1}^{n}\log\sum_{z^{(i)}=1}^{k}p(x^{(i)}|z^{(i)}=j; \mu,\Sigma)p(z^{(i)}=j;\phi)---(2)\\

To address (1), we introduce z, thereby breaking down the probability into two parts multiplied together, resulting in (2).

Wherein the ; on the left side represents the variable and on the right side represents the parameter, while , denotes parallelism. In xz;μ,Σx|z;\mu,\Sigma, the right side of | represents the overall range (i.e., the range where z=jz=j), and the left side, x, signifies the probability of x within those specified ranges.

If z(i)z^{(i)} is known, that is, if we know which Gaussian distribution zi comes from—put differently, when j is fixed—the equation can be simplified to:

(ϕ,μ,Σ)=i=1nlogp(x(i)z(i);μ,Σ)+logp(z(i);ϕ)(3)\ell(\phi, \mu, \Sigma) = \sum_{i=1}^{n}\log p(x^{(i)}|z^{(i)}; \mu,\Sigma)+\log p(z^{(i)};\phi)---(3)

The maximum function (3) is obtained by taking partial derivatives and setting them to zero, resulting in the calculation of the result: (at this point, j is fixed)

ϕj=1ni=1n1{z(i)=j}\phi _ { j } = \frac { 1 } { n } \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \}

μj=i=1n1{z(i)=j}x(i)i=1n1{z(i)=j}\mu _ { j } = \frac { \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \} x ^ { ( i ) } } { \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \} }

Σj=i=1n1{z(i)=j}(x(i)μj)(x(i)μj)Ti=1n1{z(i)=j}\Sigma _ { j } = \frac { \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \} ( x ^ { ( i ) } - \mu _ { j } ) ( x ^ { ( i ) } - \mu _ { j } ) ^ { T } } { \sum _ { i = 1 } ^ { n } 1 \{ z ^ { ( i ) } = j \} }

From the above figure, it can be observed that z(i)z^{(i)} can be considered as the role of a class label. Therefore, once z(i)z^{(i)} is known, everything else can be derived. Thus, we use the Expectation-Maximization (EM) method to determine z(i)z^{(i)}.

EM (Expectation-Maximization) algorithm for density estimation

(E-Step) Gaussingz(i)z^{(i)},cluster assignment

For each i,ji, j, set

wj(i):=p(z(i)=jx(i);ϕ,μ,Σ)w _ { j } ^ { ( i ) } : = p ( z ^ { ( i ) } = j | x ^ { ( i ) } ; \phi , \mu , \Sigma )

This ωj(i)\omega_j^{(i)} represents our guess for z(i)z^{(i)}. Because we are trying to determine z, it’s placed on the left side of |.

Using Bayes’ rule, we can obtain:

p(zx)=p(xz)p(z)p(x)p(z|x)=\frac{p(x|z)p(z)}{p(x)}

p(z(i)=jx(i);ϕ,μ,Σ)=p(x(i)z(i)=j;μ,Σ)p(z(i)=j;μ,Σ)p(z(i)=j;ϕ)l=1kp(x(i)z(i)=l;μ,Σ)p(z(i)=l;ϕ)p ( z ^ { ( i ) } = j | x ^ { ( i ) } ; \phi , \mu , \Sigma ) = \frac { p ( x ^ { ( i ) } | z ^ { ( i ) } = j ; \mu, \Sigma ) p ( z ^ { ( i ) } = j ; \mu, \Sigma ) p ( z ^ { ( i ) } = j ; \phi ) } { \sum _ { l = 1 } ^ { k } p ( x ^ { ( i ) } | z ^ { ( i ) } = l ; \mu, \Sigma ) p ( z ^ { ( i ) } = l ; \phi )}

f(x)=1(2π)kdetΣexp(12(xμ)TΣ1(xμ))f(x)=\frac{1}{\sqrt{(2\pi)^k \det \Sigma}} \exp (-\frac{1}{2}(x-\mu)^{T}\Sigma ^{-1}(x-\mu))

p(x(i)z(i)=j;μ,Σ)p(x^{(i)}|z^{(i)}=j;\mu,\Sigma) comes from the density of the Gaussian distribution with mean μj\mu_j and covariance Σj\Sigma_j. It represents the probability of obtaining x from the model if x(i)x^{(i)} comes from Gaussian distribution j.

p(z(i)=j;ϕ)=ϕjp(z^{(i)}=j;\phi)=\phi_j represents the probability of coming from Gaussian model j.

(M-Step) Update the model’s parameters based on the guess (Maximum Likelihood Estimation - MLE) — soft assignment.
Update the parameters:

ϕj:=1ni=1nwj(i)\phi _ { j } : = \frac { 1 } { n } \sum _ { i = 1 } ^ { n } w _ { j } ^ { ( i ) }

μj:=i=1nwj(i)x(i)i=1nwj(i)\mu _ { j } : = \frac { \sum _ { i = 1 } ^ { n } w _ { j } ^ { ( i ) } x ^ { ( i ) } } { \sum _ { i = 1 } ^ { n } w _ { j } ^ { ( i ) } }

Σj:=i=1nwj(i)(x(i)μj)(x(i)μj)Ti=1nwj(i)\Sigma _ { j } : = \frac { \sum _ { i = 1 } ^ { n } w _ { j } ^ { ( i ) } ( x ^ { ( i ) } - \mu _ { j } ) ( x ^ { ( i ) } - \mu _ { j } ) ^ { T } } { \sum _ { i = 1 } ^ { n } w_j ^{(i)} }

Restriction

Σj=σ2I\Sigma_j=\sigma^2I

Where σ20\sigma^2\rightarrow 0, the original function simplifies to k-means.

From a probabilistic perspective, a Gaussian function approaches an impulse function as σ2\sigma^2 approaches zero. Only positions near the center have significant values; the rest become very small. Thus, for a specific Gaussian distribution, the probability is high, while for others, it is very low. Therefore, concerning the values of w, according to the Bayes’ formula, this data will have high values only within a specific Gaussian distribution, whereas the others tend towards zero.

From a clustering perspective, when sigma approaches zero, it means the boundaries of the ellipsoidal clusters in GMM become circular in k-means (or in other words, ellipsoids become spherical).

Advanced methods in K-means

Instead of using hard cluster assignment c(i)c(i), we’re using soft assignments ωj(i)\omega_j^{(i)}.

In the code, w is an m x k matrix containing the probabilities of xix_i belonging to Gaussian distribution j. If we want to convert this into hard clustering, we iterate through the probabilities of xix_i across all k Gaussian distributions and find the Gaussian distribution with the highest probability. Then, we assign xix_i to that Gaussian distribution, setting all other probabilities to 0. Consequently, the matrix is reduced to an m x 1 vector.

GMM Solving

GMM can be solved using EM or by minimizing the loss function.

Other

The probability of observing x from cluster l:

P(x)=P(x,z=l)=P(xz=l)P(z=l)P(x) = \sum P(x, z=l) = \sum P(x|z=l)P(z=l)

reference

https://mas-dse.github.io/DSE210/Additional Materials/gmm.pdf

什么是高斯混合模型

https://www.bilibili.com/video/BV12d4y167ud/?spm_id_from=333.337.search-card.all.click

https://www.bilibili.com/video/BV1Dd4y167pZ/?spm_id_from=333.788.recommend_more_video.-1&vd_source=2029748cdca0ccea25511170cf7a00c3


The probability of selecting model 1 is π\pi, and the probability of selecting model 2 is 1π1-\pi. Below is the probability of selecting model x.

p(x;π,σ1,σ2,μ1,μ2)=πpnorm(x;μ1,σ1)+(1π)pnorm(x;μ2,σ2)p ( x ; \pi , \sigma _ { 1 } , \sigma _ { 2 } , \mu _ { 1 } , \mu _ { 2 } ) = \pi p _ { n o r m } ( x ; \mu _ { 1 } , \sigma _ { 1 } ) + ( 1 - \pi ) p _ { n o r m } ( x ; \mu _ { 2 } , \sigma _ { 2 } )

Taking the logarithm of it, because log(x)\log(x) is a function, maximizing xx is equivalent to maximizing log(x)\log(x).

log(L(x;π,μ1,μ2,σ1,σ2))=log(πpnorm(x;μ1,σ1)+(1π)pnorm(x;μ2,σ2))\log ( L ( x ; \pi , \mu _ { 1 } , \mu _ { 2 } , \sigma _ { 1 } , \sigma _ { 2 } ) ) = \sum \log ( \pi p _ { n o r m } ( x ; \mu _ { 1 } , \sigma _ { 1 } ) + ( 1 - \pi ) p _ { n o r m } ( x ; \mu _ { 2 }, \sigma_2 ) )

p(x,z)=p(xz)p(z)p ( x , z ) = p ( x | z ) p ( z )

where,

z{0,1},p(z=1)=π,p(xz=1)=pnorm(x;μ1σ1),p(xz=0)=pnorm(x;μ2,σ2)z \in \{ 0 , 1 \} , p ( z = 1 ) = \pi ,\\ p ( x | z = 1 ) = p _ {norm} ( x ; \mu _ { 1 } \sigma_1),\\ p ( x | z =0 ) = p _ {norm}(x; \mu_2, \sigma_2)

The probability of both z and x occurring simultaneously equals the probability of z occurring multiplied by the probability of x occurring, and then summed over all possible scenarios.

log(L(p(z,x;π,μ1,μ2,σ1,σ2)))=izilog(πpn(xi;μ1,σ1))+(1zi)log((1π)pn(xi;μ2,σ2)){ \log ( L ( p ( z , x ; \pi , \mu _ { 1 } , \mu _ { 2 } , \sigma_1, \sigma _ { 2 } ) ) ) } \\ { = \sum _ { i } z _ { i } \log ( \pi p _ { n } ( x _ { i } ; \mu _ { 1 } , \sigma _ { 1 } )) + ( 1 - z _ { i } ) \log ( ( 1 - \pi ) p_n (x_i; \mu_2,\sigma_2))}

The expectation of log(p(x,z))\log(p(x,z)) is the probability multiplied by whether it happens, i.e., p(z=1x)p(z=1|x).


Announcement:This blog content serves as class notes and is solely for sharing purposes. Some images and content are sourced from textbooks, teacher presentations, and the internet. If there are any copyright infringements, please contact aursus.blog@gmail.com for removal.