The Expectation Maximization Algorithm '02

2013. 3. 9. 00:45

The Expectation Maximization Algorithm

Frank Dellaert

College of Computing, Georgia Institute of Technology Technical Report

February, 2002

https://smartech.gatech.edu/bitstream/handle/1853/3281/02-20.pdf

Intuitive Explanation of EM

We want to maximize the posterior probability of the parameters Θ = {θ₁, …, θ_n} given the data U, marginalizing over J:

(1)

Instead of finding the best J ∈ J given an estimate Θ at each iteration, EM computes a distribution over the n-dimensional space Jⁿ.

EM as Lower Bound Maximization

EM can be derived in many different ways, one of the most insightful being in terms of lower bound maximization (Neal and Hinton, 1998; Minka, 1998).

The goal is to maximize the posterior probability (1) of the parameters Θ given the data U, in the presence of hidden data J. Equivalently, we can maximize the logarithm of the joint distribution (which is proportional to the posterior):

(2)

Note that the key problem with maximizing (2) is that it involves the logarithm of a (big) sum, which is difficult to deal with.

To derive the bound, first trivially rewrite log P(U,Θ) as

where f^t(J) is an arbitrary probability distribution over the space Jⁿ of hidden variable J.

By Jensen's inequality, we have

Finding an Optimal Bound

EM tries to find the best bound, defined as the bound B(Θ;Θ^t) that touches the objective function log P(U,Θ).

Introducing a Lagrange multiplier λ to enforce the constraint

the objective becomes

Taking the derivative

We see that the resulting optimal bound at Θ^t indeed touches the objective function:

Maximizing the Bound

(3)

where $<.>$ denotes the expectation with respect to $f^t(\textbf{J}) = P(\textbf{J}|\textbf{U},\Theta^t)$.

$P(\Theta)$ is the prior on the parameters $\Theta$

$H := -<\log f^t(\textbf{J})>$ is the entropy of the distribution $f^t(\textbf{J})$

Since $H$ does not depend on $\Theta$, we consider the first two terms only:

For your reference, on the third line of (3), the last term in $<>$ does not depend on $\Theta$.

So, $<\log P(\textbf{U},\textbf{J},\Theta)>$ can be rewritten as $<\log P(\textbf{U},\textbf{J}|\Theta)> + \log P(\Theta)$.

The EM algorithm

The bound is expressed as an expectation.

Expectation step: first find an optimal lower bound B(Θ;Θ^t) at the current guess Θ^t
Maximization step: and then maximize this bound to obtain an improved estimate Θ^t+1.

The EM algorithm can thus be conveniently summarized as: