The Expectation Maximization Algorithm


Frank Dellaert


College of Computing, Georgia Institute of Technology Technical Report

February, 2002


https://smartech.gatech.edu/bitstream/handle/1853/3281/02-20.pdf





Intuitive Explanation of EM

We want to maximize the posterior probability of the parameters Θ = {θ1, …, θn} given the data U, marginalizing over J:


 (1)


Instead of finding the best JJ given an estimate Θ at each iteration, EM computes a distribution over the n-dimensional space Jn.



EM as Lower Bound Maximization

EM can be derived in many different ways, one of the most insightful being in terms of lower bound maximization (Neal and Hinton, 1998; Minka, 1998).


The goal is to maximize the posterior probability (1) of the parameters Θ given the data U, in the presence of hidden data J. Equivalently, we can maximize the logarithm of the joint distribution (which is proportional to the posterior):


 (2)


Note that the key problem with maximizing (2) is that it involves the logarithm of a (big) sum, which is difficult to deal with.

To derive the bound, first trivially rewrite log P(U,Θ) as



where ft(J) is an arbitrary probability distribution over the space Jn of hidden variable J.

By Jensen's inequality, we have



Finding an Optimal Bound

EM tries to find the best bound, defined as the bound B(Θ;Θt) that touches the objective function log P(U,Θ).



Introducing a Lagrange multiplier λ to enforce the constraint



the objective becomes



Taking the derivative




We see that the resulting optimal bound at Θt indeed touches the objective function:



Maximizing the Bound


 (3)


where $<.>$ denotes the expectation with respect to $f^t(\textbf{J}) = P(\textbf{J}|\textbf{U},\Theta^t)$.

$P(\Theta)$ is the prior on the parameters $\Theta$

$H := -<\log f^t(\textbf{J})>$ is the entropy of the distribution $f^t(\textbf{J})$


Since $H$ does not depend on $\Theta$, we consider the first two terms only:



For your reference, on the third line of (3), the last term in $<>$ does not depend on $\Theta$.

So, $<\log P(\textbf{U},\textbf{J},\Theta)>$ can be rewritten as $<\log P(\textbf{U},\textbf{J}|\Theta)> + \log P(\Theta)$.


The EM algorithm

The bound is expressed as an expectation.

  1. Expectation step: first find an optimal lower bound B(Θ;Θt) at the current guess Θt
  2. Maximization step: and then maximize this bound to obtain an improved estimate Θt+1.


The EM algorithm can thus be conveniently summarized as:

  • E-step: calculate 

  • M-step: find 



References