EM algorithm and GMM model: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 07:46, 16 July 2020 edit Nmrenyi (talk \| contribs) 36 edits No edit summary ← Previous edit		Latest revision as of 09:51, 19 March 2025 edit undo 2a02:a468:a993:1:aa93:5f70:d43b:e5de (talk) →EM algorithm in GMM: Removed the double block
(31 intermediate revisions by 23 users not shown)
Line 1: In statistics, [[Expectation–maximization algorithm\|EM (expectation maximization)]] algorithm handles latent variables, while [[Mixture model#Gaussian mixture model\|GMM]] is the Gaussian mixture model. ~~{{Multiple issues\|{{unreferenced\|date=July 2020}}{{lead missing\|date=July 2020}}{{essay-like\|date=July 2020}}}}~~ In machine learning and statistics, [[Expectation%E2%80%93maximization_algorithm\|EM]] Algorithm is the abbreviation for Expectation Maximization Algorithm, which is used to handle latent variables. [[Mixture_model#Gaussian_mixture_model\|GMM]] Model means Gaussian mixture model. In this article, we'll have an introduction on how to use EM Algorithm to handle GMM model. == Background == In the picture below, are shown the [[red blood cell]] hemoglobin concentration and the red blood cell volume data of two groups of people, the Anemia group and the Control Group (i.e. the group of people without [[Anemia]]). As expected, people with Anemia have lower red blood cell volume and lower red blood cell [[hemoglobin]] concentration than those without Anemia. ~~First let's warm up with a simple scenario. In the picture below,~~ ~~we have the Red Blood Cell Hemoglobin Concentration and the Red Blood Cell Volume~~ ~~data of two groups of people, the Anemia Group and the Control Group(i.e. the group of people without Anemia).~~ ~~It's clear that people with Anemia have lower red blood cell volume and lower red blood cell hemoglobin concentration~~ ~~than those without Anemia.~~ [[File:Labeled GMM.png\|thumb\|GMM model with labels]] <math>x</math> is a [[random vector]] such as <math>x:=\big(\text{red blood cell volume}, \text{red blood cell hemoglobin concentration}\big)</math>, and from medical studies{{Citation needed\|date=July 2022}} it is known that <math>x</math> are [[Normal distribution\|normally distributed]] in each group, i.e. <math>x \sim \mathcal N(\mu, \Sigma)</math>. ~~And denote~~ <math>z</math> is denoted as the group where <math>x</math> belongs.(, with <math>z_i = 0</math> when <math>x_i</math> belongs to Anemia Group and <math>z_i=1</math> when <math>x_i</math> belongs to Control Group. Also <math>z \sim \operatorname{Categorical}(k, \phi)</math> where <math>k=2</math>, <math>\phi_j \geq 0,</math> and <math>\sum_{j=1}^k\phi_j=1</math>. See [[Categorical distribution]].▼ ~~To make it simple, let <math>x</math> be a random vector:~~ ~~<math>x:=(the Red Blood Cell Volume, the Red Blood Cell Hemoglobin Concentration)</math>~~ ▲And denote <math>z</math> as the group where <math>x</math> belongs.(<math>z_i = 0</math> when <math>x_i</math> belongs to Anemia Group and <math>z_i=1</math> when <math>x_i</math> belongs to Control Group). ~~And from medical knowledge, we believe that <math>x</math> are normally distributed in each group, i.e. <math>x \sim \mathcal N(\mu, \Sigma)</math>.~~ ~~Also <math>z \sim Multinomial(\phi)</math>, where <math>\phi_j \geq 0, and \sum_{j=1}^k\phi_j=1</math>(in this scenario, <math>k=2</math>).~~ Now we'd like to estimate <math>\phi, \mu , \Sigma</math>.▼ ▲~~Now~~The ~~we'd~~following ~~like~~procedure can be used to estimate <math>\phi, \mu , \Sigma</math>. ~~We can use maximum likelihood estimation on this question. The log likelihood function is~~ ~~shown below.~~ A maximum likelihood estimation can be applied: <math>\ell(\phi,\mu,\Sigma)=\sum_{i=1}^m \log (p(x^{(i)};\phi,\mu,\Sigma))▼ =\sum_{i=1}^{m} \log \sum_{z^{(i)}=1}^{k} p\left(x^{(i)} \| z^{(i)} ; \mu, \Sigma\right) p\left(z^{(i)} ; \phi\right)▼ ▲: <math>\ell(\phi,\mu,\Sigma)=\sum_{i=1}^m \log (p(x^{(i)};\phi,\mu,\Sigma)) ▲=\sum_{i=1}^{m} \log \sum_{z^{(i)}=1}^{k} p\left(x^{(i)} \|\mid z^{(i)} ; \mu, \Sigma\right) p~~\left~~(z^{(i)} ; \phi~~\right~~) </math> As ~~we know~~ the <math>z_i</math> for each <math>x_i</math> are known, the [[Log-likelihood function\|log likelihood function]] can be simplified as below: ~~simplified as below:~~ : <math>\ell(\phi, \mu, \Sigma)=\sum_{i=1}^{m} \log p\left(x^{(i)} \|\mid z^{(i)} ; \mu, \Sigma\right)+\log p\left(z^{(i)} ; \phi\right)</math> Now ~~we can maximize~~ the likelihood function can be maximized by making [[partial derivative]] over <math>\mu, \Sigma, \phi</math>., obtaining: ~~Since this step only involves some simple algebra calculation, I'll directly show the result.~~ : <math>\phi_{j} =\frac{1}{m} \sum_{i=1}^{m} 1\{z^{(i)}=j\}</math> : <math>\~~mu_{j}~~mu_j =\frac{\sum_{i=1}^{m} 1~~\left~~\{z^{(i)}=j~~\right~~\} x^{(i)}}{\sum_{i=1}^{m} 1\left\{z^{(i)}=j\right\}}</math> : <math>\~~Sigma_{j}~~Sigma_j =\frac{\sum_{i=1}^{m} 1~~\left~~\{z^{(i)}=j~~\right~~\}~~\left~~ (x^{(i)}-\~~mu_{j}\right~~mu_j)~~\left~~(x^{(i)}-\~~mu_{j}\right~~mu_j)^{T}}{\sum_{i=1}^{m} 1~~\left~~\{z^{(i)}=j~~\right~~\}}</math><ref name="Stanford CS229 Notes">{{cite web \|last1=Ng \|first1=Andrew \|title=CS229 Lecture notes \|url=https://cs229.stanford.edu/summer2023/cs229-notes8.pdf}}</ref> <ref name="Stanford CS229 Notes">{{cite web \|last1=Ng \|first1=Andrew \|title=CS229 Lecture notes \|url=http://cs229.stanford.edu/notes/cs229-notes8.pdf}}</ref>▼ If <math>z_i</math> is known, the estimation of the parameters results to be quite simple with [[maximum likelihood estimation]]. But if <math>z_i</math> is unknown it is much more complicated.<ref name="Machine Learning —Expectation-Maximization Algorithm (EM)">{{cite web \|last1=Hui \|first1=Jonathan \|title=Machine Learning —Expectation-Maximization Algorithm (EM) \|url=https://medium.com/@jonathan_hui/machine-learning-expectation-maximization-algorithm-em-2e954cb76959 \|website=Medium \|language=en \|date=13 October 2019}}</ref>▼ ~~In the example above, we can see that if <math>z_i</math> is known to us, the estimation of~~ ~~parameters can be quite simple with maximum likelihood estimation. But what if~~ ~~<math>z_i</math> is '''unknown''' to us? It'll be hard to estimate the parameters.~~ ▲<ref name="Machine Learning —Expectation-Maximization Algorithm (EM)">{{cite web \|last1=Hui \|first1=Jonathan \|title=Machine Learning —Expectation-Maximization Algorithm (EM) \|url=https://medium.com/@jonathan_hui/machine-learning-expectation-maximization-algorithm-em-2e954cb76959 \|website=Medium \|language=en \|date=13 October 2019}}</ref> [[File:Unlabeled GMM.png\|thumb\|GMM without labels]] Being <math>z</math> a [[latent variable]] (i.e. not observed), with unlabeled scenario, the Expectation Maximization [[Algorithm]] is needed to estimate <math>z</math> as well as other parameters. Generally, this problem is set as a GMM since the data in each group is normally distributed. <ref name="Multivariate normal distribution">{{cite web \|last1=Tong \|first1=Y. L. \|title=Multivariate normal distribution \|url=https://en.wikipedia.org/wiki/Multivariate_normal_distribution \|website=Wikipedia \|language=en \|date=2 July 2020}}</ref>{{Circular reference\|date=July 2020}}▼ In [[machine learning]], the latent variable <math>z</math> is considered as a latent pattern lying under the data, which the observer is not able to see very directly. <math>x_i</math> is the known data, while <math>\phi, \mu, \Sigma</math> are the parameter of the model. With the EM algorithm, some underlying pattern <math>z</math> in the data <math>x_i</math> can be found, along with the estimation of the parameters. The wide application of this circumstance in machine learning is what makes EM algorithm so important. ~~In this case, we call <math>z</math> a latent variable(i.e. not observed). With unlabled scenario,~~ ~~we need the Expectation Maximization Algorithm to estimate $z$ as well as other parameters.~~ ~~Generally, we would name the problem setting above as '''Gaussian Mixture Models(i.e. GMM)'''~~ ~~since the data in each group is normally distributed.~~ ▲<ref name="Multivariate normal distribution">{{cite web \|last1=Tong \|first1=Y. L. \|title=Multivariate normal distribution \|url=https://en.wikipedia.org/wiki/Multivariate_normal_distribution \|website=Wikipedia \|language=en \|date=2 July 2020}}</ref> [[File:GMM Training on artificial data.gif\|thumb\|alt=Animation of updates to a GMM at each update to the distribution in the EM algorithm.\|GMM Training on artificial data]] ~~In a general circumstance in machine learning, we can~~ ~~see the latent variable <math>z</math> as some latent pattern lying under the data, which we cannot~~ ~~see very directly. And we can see <math>x_i</math> as our data, <math>\phi, \mu, \Sigma</math> as the parameter of the model.~~ ~~With EM algorithm, we may find some underlying pattern <math>z</math> in the data <math>x_i</math>, along with the estimation~~ ~~of parameters. The wide application of this circumstance in machine learning makes EM algorithm very important.~~ <ref name="Inference using EM algorithm">{{cite web \|last1=Misra \|first1=Rishabh \|title=Inference using EM algorithm \|url=https://towardsdatascience.com/inference-using-em-algorithm-d71cccb647bc \|website=Medium \|language=en \|date=7 June 2020}}</ref> == EM ~~Algorithm~~algorithm in GMM == The EM algorithm consists of two steps: the E-step and the M-step. Firstly, the model parameters and the <math>z^{(i)}</math> can be randomly initialized. In the E-step, the [[algorithm]] tries to guess the value of <math>z^{(i)}</math> based on the parameters, while in the M-step, the algorithm updates the value of the [[Model parameter\|model parameters]] based on the guess of <math>z^{(i)}</math>of the E-step. These two steps are repeated until convergence is reached. ~~The Expectation Maximization Algorithm consists of two steps: the E-step and the M-step.~~ ~~Firstly, we can randomly initialize the value of our model parameters and the <math>z^{(i)}</math>~~ ~~In the E-step, the algorithm tries to guess the value of <math>z^{(i)}</math> based on the parameters.~~ ~~In the M-step, the algorithm updates the value of the model parameters based on the guess of <math>z^{(i)}</math>~~ ~~in the E-step. These two steps will repeat until convergence. Let's see the algorithm in GMM first.~~ The algorithm in GMM is: Repeat until convergence: {▼ ▲Repeat until convergence: { 1. (E-step) For each <math>i, j</math>, set ▼ ▲ 1. (E-step) For each <math>i, j</math>, set <math>w_{j}^{(i)}:=p\left(z^{(i)}=j \| x^{(i)} ; \phi, \mu, \Sigma\right)</math> Line 78 ⟶ 53: <math>\Sigma_{j} :=\frac{\sum_{i=1}^{m} w_{j}^{(i)}\left(x^{(i)}-\mu_{j}\right)\left(x^{(i)}-\mu_{j}\right)^{T}}{\sum_{i=1}^{m} w_{j}^{(i)}}</math> ▲<ref name="Stanford CS229 Notes">{{cite web \|last1=Ng \|first1=Andrew \|title=CS229 Lecture notes \|url=~~http~~https://cs229.stanford.edu/~~notes~~summer2023/cs229-notes8.pdf}}</ref> } ~~<ref name="Stanford CS229 Notes">{{cite web \|last1=Ng \|first1=Andrew \|title=CS229 Lecture notes \|url=http://cs229.stanford.edu/notes/cs229-notes8.pdf}}</ref>~~ With [[Bayes' Rule\|Bayes Rule]], the following result is obtained by the E-step: ~~We can take a closer look at the E-step. In fact, with Bayes Rule, we can get the following result:~~ <math>p\left(z^{(i)}=j \| x^{(i)} ; \phi, \mu, \Sigma\right)=\frac{p\left(x^{(i)} \| z^{(i)}=j ; \mu, \Sigma\right) p\left(z^{(i)}=j ; \phi\right)}{\sum_{l=1}^{k} p\left(x^{(i)} \| z^{(i)}=l ; \mu, \Sigma\right) p\left(z^{(i)}=l ; \phi\right)}</math> According to GMM setting, ~~we can have~~ these following formulas are obtained: <math>p\left(x^{(i)} \| z^{(i)}=j ; \mu, \Sigma\right)=\frac{1}{(2 \pi)^{n / 2}\left\|\Sigma_{j}\right\|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x^{(i)}-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x^{(i)}-\mu_{j}\right)\right)</math> <math>p\left(z^{(i)}=j ; \phi\right)=\phi_j</math> In this way, wea ~~can~~switch ~~clearly move on to~~between the E-step and the M-step is possible, according to the randomly initialized parameters. ~~{{uncategorised\|date=July 2020}}~~ == References == {{Reflist}} ~~[[Category:Systems engineering]]~~ [[Category:Machine ~~Learning~~learning]] [[Category:Regression models]]