A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models


Jeff A. Bilmes


Berkeley California

April, 1998


http://lasa.epfl.ch/teaching/lectures/ML_Phd/Notes/GP-GMM.pdf





Maximum-likelihood

In the maximum likelihood problem, our goal is to find the Θ that maximizes L.

(Often we maximize log-likelihood instead because it is analytically easier.)



  • p — Gaussian distribution
  • Θ — means and covariances: θ = (μ,σ2)
  • N — size of dataset X
  • L(Θ|X) — likelihood function of the parameters given the data



Basic EM

There are two main applications of the EM algorithm.

  • The first occurs when the data indeed has missing values, due to problems with or limitations of the observation process.
  • The second occurs when optimizing the likelihood function is analytically intractable but when the likelihood function can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters.

The latter application is more common in the computational pattern recognition community.


Expectation step

We must assume a joint relationship between the missing and observed values.




  • f(y|X(i-1)) is the marginal distribution of the unobserved data
  • X — observed data (incomplete)
  • Y — missing (or hidden) information (unknown, random)
  • Υ — the space of values y


ref. Consider h(θ,Y) where θ is a constant and Y is a random variable governed by some distribution fY(y).



Maximization step

Maximize the expectation we computed in the first step.




Finding Maximum Likelihood Mixture Densities Parameters via EM

The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. In this case, we assume the following probabilistic model:



  • x — dataset = {x1, x2, …, xN}
  • Θ = (α1, …, αM, θ1, …, θM)
  • pi — density function
  • αi — mixing coefficients (∑ αi = 1)


The incomplete-data log-likelihood expression for this density from the data X is given by:



  • N — size of dataset
  • M — number of mixtures (component densities mixed together)


However, it is difficult to optimize because it contains the log of the sum.


We consider X as incomplete, and posit the existence of unobserved data items Y = {y1, …, yN}.

(Y tells us which mixture generated each data item.)

We assume that yi ∈ {1, …, M} for each i, and yi = k if the ith sample was generated by the kth mixture component. If we know the value of Y, the likelihood becomes:



But, the problem is that we do NOT KNOW the values of Y.

However, if we assume Y is a random vector, we can proceed.


First, we guess Θg = (α1g, …, αMg, θ1g, …, θMg) for the likelihood Lg|X,Y).

Given Θg, we can easily compute pj(xi|θjg) for each i and j. In addition, the mixing parameters, αj can be though of as prior probabilities of each mixture component, that is αj = p(component j). Therefore, using Bayes's rule, we can compute:




where y = (y1, …, yN) is an instance of the unobserved data independently drawn.


p(y|Xg) is the desired marginal density by assuming the existence of the hidden variables and making a guess at the initial parameters of their distribution.


We can write Q(Θ,Θg) as:



  • N — size of dataset
  • M — number of mixtures (component densities mixed together)
  • δl,yi — coefficient for each current mixture to determine new mixture


This form may look fairly daunting, yet it can be greatly simplified. We first note that for l ∈ 1, …, M,



since ∑yj p(yj|xjg) = 1.


Regularization parameter p(l|xig) is called a "responsibility".

(How much is this Gaussian l responsible for this data xi?)


Using the above equation, we can rewrite Q(Θ,Θg) as:



To maximize this expression, we can maximize the term containing αl and the term containing θl independently since they are not related.


To find the expression for αl, we introduce the Lagrange multiplier λ with the constraint that ∑l αl = 1, and solve the following equation:




Summing both size over l, we get that λ = -N resulting in:



For some distributions, it is possible to get an analytical expressions for θl as functions of everything else. For example, if we assume d-dimensional Gaussian component distributions with mean μ and covariance matrix Σ, i.e., θ = (μ,Σ) then



Taking the log of pl(x|θl), ignoring any constant terms (since they disappear after taking derivatives), and substituting into the right side of Q(Θ,Θg), we get:



Taking the derivative of the above with respect to μl and setting it equal to zero, we get:



with which we can easily for μl to obtain:



Then, Taking the derivative of the above with respect to Σl-1 and setting it equal to zero, we can obtain:



Summarization

Summarizing, the estimates of the new parameters in terms of the old parameters are as follow:


Note that the above equations perform both the expectation step and the maximization step simultaneously. The algorithm proceeds by using the newly derived parameters as the guess for the next iteration.



Learning the parameters of an HMM, EM, and the Baum-Welch algorithm

Not ready yet



References