A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models '98

2013. 3. 22. 10:32

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Jeff A. Bilmes

Berkeley California

April, 1998

http://lasa.epfl.ch/teaching/lectures/ML_Phd/Notes/GP-GMM.pdf

Maximum-likelihood

In the maximum likelihood problem, our goal is to find the Θ that maximizes L.

(Often we maximize log-likelihood instead because it is analytically easier.)

p — Gaussian distribution
Θ — means and covariances: θ = (μ,σ²)
N — size of dataset X
L(Θ|X) — likelihood function of the parameters given the data

Basic EM

There are two main applications of the EM algorithm.

The first occurs when the data indeed has missing values, due to problems with or limitations of the observation process.
The second occurs when optimizing the likelihood function is analytically intractable but when the likelihood function can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters.

The latter application is more common in the computational pattern recognition community.

Expectation step

We must assume a joint relationship between the missing and observed values.

f(y|X,Θ^(i-1)) is the marginal distribution of the unobserved data
X — observed data (incomplete)
Y — missing (or hidden) information (unknown, random)
Υ — the space of values y

ref. Consider h(θ,Y) where θ is a constant and Y is a random variable governed by some distribution f_Y(y).

Maximization step

Maximize the expectation we computed in the first step.

Finding Maximum Likelihood Mixture Densities Parameters via EM

The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. In this case, we assume the following probabilistic model:

x — dataset = {x₁, x₂, …, x_N}
Θ = (α₁, …, α_M, θ₁, …, θ_M)
p_i — density function
α_i — mixing coefficients (∑ α_i = 1)

The incomplete-data log-likelihood expression for this density from the data X is given by:

N — size of dataset
M — number of mixtures (component densities mixed together)

However, it is difficult to optimize because it contains the log of the sum.

We consider X as incomplete, and posit the existence of unobserved data items Y = {y₁, …, y_N}.

(Y tells us which mixture generated each data item.)

We assume that y_i ∈ {1, …, M} for each i, and y_i = k if the i^th sample was generated by the k^th mixture component. If we know the value of Y, the likelihood becomes:

But, the problem is that we do NOT KNOW the values of Y.

However, if we assume Y is a random vector, we can proceed.

First, we guess Θ^g = (α₁^g, …, α_M^g, θ₁^g, …, θ_M^g) for the likelihood L(Θ^g|X,Y).

Given Θ^g, we can easily compute p_j(x_i|θ_j^g) for each i and j. In addition, the mixing parameters, α_j can be though of as prior probabilities of each mixture component, that is α_j = p(component j). Therefore, using Bayes's rule, we can compute:

where y = (y₁, …, y_N) is an instance of the unobserved data independently drawn.

p(y|X,Θ^g) is the desired marginal density by assuming the existence of the hidden variables and making a guess at the initial parameters of their distribution.

We can write Q(Θ,Θ^g) as:

N — size of dataset
M — number of mixtures (component densities mixed together)
δ_{l,y_i} — coefficient for each current mixture to determine new mixture

This form may look fairly daunting, yet it can be greatly simplified. We first note that for l ∈ 1, …, M,

since ∑_{y_j} p(y_j|x_j,Θ^g) = 1.

Regularization parameter p(l|x_i,Θ^g) is called a "responsibility".

(How much is this Gaussian l responsible for this data x_i?)

Using the above equation, we can rewrite Q(Θ,Θ^g) as:

To maximize this expression, we can maximize the term containing α_l and the term containing θ_l independently since they are not related.

To find the expression for α_l, we introduce the Lagrange multiplier λ with the constraint that ∑_l α_l = 1, and solve the following equation:

Summing both size over l, we get that λ = -N resulting in:

For some distributions, it is possible to get an analytical expressions for θ_l as functions of everything else. For example, if we assume d-dimensional Gaussian component distributions with mean μ and covariance matrix Σ, i.e., θ = (μ,Σ) then

Taking the log of p_l(x|θ_l), ignoring any constant terms (since they disappear after taking derivatives), and substituting into the right side of Q(Θ,Θ^g), we get:

Taking the derivative of the above with respect to μ_l and setting it equal to zero, we get:

with which we can easily for μ_l to obtain:

Then, Taking the derivative of the above with respect to Σ_l^-1 and setting it equal to zero, we can obtain:

Summarization

Summarizing, the estimates of the new parameters in terms of the old parameters are as follow:

Note that the above equations perform both the expectation step and the maximization step simultaneously. The algorithm proceeds by using the newly derived parameters as the guess for the next iteration.

Learning the parameters of an HMM, EM, and the Baum-Welch algorithm

Not ready yet

References

저작자표시 비영리 변경금지

SENS

Programming Note