In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.



Mixture model

In statistics, a mixture model is a probabilistic model for representing the presence of sub-populations within an overall population, without requiring that an observed data-set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population.



The above is an animation demonstrating the EM algorithm fitting a two component Gaussian mixture model to the Old Faithful data-set. The algorithm steps through from a random initialization to convergence.


Gaussian Mixture p(U)

http://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model




  • U: observed data (n = 1, ... , N)
  • f: normal distribution function (pdf)
  • πj: mixing coefficient (j = 1, ... , J)



EM for Gaussian Mixtures
  1. Initialize Gaussian (ref. There are j Gaussians.) parameters: means μj, covariances σj, and mixing coefficient πj
    (tip: Use k-means result to initialize: μjμj, σj2cov(cluster j), πj ← Number of data in cluster j / N)

              

  2. E Step: Assign each data Un an assignment score(regularization parameterγnj for each cluster j
    (γnj is called a "responsibility": how much is this Gaussian j responsible for this data Un?)

              

  3. M Step: Given scores, adjust μj, σj2πj for each cluster j

              
              

              

  4. Evaluate likelihood. If likelihood or parameters converge, stop iteration.



Terminology
Likelihood function

In statistics, a likelihood function (often simply the likelihood) is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values θ given some observed outcomes x is equal to the probability of those observed outcomes given those parameter values, i.e.



The likelihood L(θ|x) is often written as L(θ;x) to emphasize the fact that this value is not a conditional probability, because θ is a parameter and not a random variable.


Definition:

(Data x is generated from the probability distribution defined by a parameter θ.)

  • Let X be a random variable with a discrete probability distribution p depending on a parameter θ.



  • Let X be a random variable with a continuous probability distribution with density function f depending on a parameter θ. Then the function



Likelihood functions play a key role in statistical inference, especially methods of estimating a parameter from a set of statistics.

In non-technical parlance, "likelihood" is usually a synonym for "probability." But in statistical usage, a clear technical distinction is made depending on the roles of the outcome or parameter.

  • Use probability when describing a function of the outcome given a fixed parameter value.
    - "Given that I have flipped a coin 100 times and it is a fair coin, what is the probability of it landing heads-up every time?"
  • Use likelihood when describing a function of a parameter given a fixed outcome.
    - "Given that I have flipped a coin 100 times and it has landed heads-up 100 times, what is the likelihood that the coin is fair?"


Incomplete data

(Defined in <A. P Dempster - Maximum Likelihood from Incomplete Data via the EM Algorithm - The Royal Statistical Society (1977)>)

The term "incomplete data" in its general form implies the existence of two sample spaces, i.e., X and Y, and a many-one mapping from X to Y.


Missing data

In statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Missing data can occur because of nonresponse: no information is provided for several items or no information is provided for a whole unit.

Missing data reduce the representativeness of the sample and can therefore distort inferences about the population.


Latent variable

In statistics, latent variables (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models. Latent variable models are used in many disciplines, including psychology, economics, machine learning/artificial intelligence, bioinformatics, natural language processing, management and the social sciences.



Algorithms

Given a statistical model consisting of a set X of observed data, a set of unobserved latent data or missing values Z, and a vector of unknown parameters θ, along with a likelihood function
L(θ;X,Z) = P(X,Z|θ), the maximum-likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data.



However, this quantity is often intractable.

The EM algorithm seeks to find the maximum-likelihood estimate (MLE) of the marginal likelihood by iteratively applying the following two steps:


  • Expectation step
    Calculate the expected value of the log likelihood function, with respect to the conditional distribution of Z given X under the current estimate of the parameters θ.
  • Maximization step
    Find the parameter that maximizes the quantity calculated in Expectation step.


If we know the value of the parameters θ, we can usually find the value of the latent variables Z by maximizing the log-likelihood over all possible values of Z, either simply by iterating over Z or through an algorithm such as the Viterbi algorithm for Hidden Markov models. Conversely, if we know the value of the latent variables Z, we can find an estimate of the parameters θ fairly easily, typically by simply grouping the observed data points according to the value of the associated latent variable and averaging the values, or some function of the values, of the points in each group. This suggests an iterative algorithm, in the case where both θ and Z are unknown:


  • First, initialize the parameters θ to some random values.
  • Compute the best value for Z given these parameter values.
  • Then, use the just-computed values of Z to compute a better estimate for the parameters θ. Parameters associated with a particular value of Z will use only those data points whose associated latent variable has that value.
  • Iterate steps 2 and 3 until convergence.


The algorithm as just described monotonically approaches a local minimum of the cost function, and is commonly called hard EM. The k-means algorithm is an example of this class of algorithms.



Demonstration of the standard algorithm

Input: cluster number k, a database, stopping tolerance ε (> 0)

Output: A set of k clusters with weight that maximize Log-likelihood function.

(1) Initialization

Initialize the parameters to some random values.



(2) Expectation step

(ref. Bayes' theorem)

For each database record x, Compute the probability that x belongs to each cluster (h = 1, 2, …, k).

The probability that a record x belongs to cluster A:



f(x;μ,σ) is the probability density function of a multivariate normal distribution.

f(x;μA,σA)pA represents the normal distribution of cluster A.




(3) Maximization step

Update mixture model parameter (compute probability weight).



(4) Iteration

Iterate steps 2 and 3 until convergence has been reached or stopping criteria is satisfied.




MLE and EM
Maximum Likelihood Estimation, MLE

Estimate the parameters Θ given complete data X.




Expectation Maximization, EM

Estimate the parameters Θ given incomplete data X and latent (hidden) variables Y.



Because we don't know as to Y, we cannot resolve the above formula. Instead, the iterative method starts from initial parameters and obtains completions using the current parameter estimates. Then, it assumes these completions to be correct, and apply the regular MLE procedure to estimate next parameters.





k-means and EM algorithms

The k-means algorithm is a special case of a general procedure of EM algorithm.

EM algorithm uses weighted training examples rather than choosing the single best completion.


k-means algorithm

EM algorithm

  • Hard clustering
    - A instance belong to only one cluster
  • Based on Euclidean distance
  • Not robust on outlier, value range

  • Soft clustering
    - A instance belong to several clusters with membership probability
  • Based on density probability
  • Can handle both numeric and nominal attributes



Example

(http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html)

Consider a simple coin-flipping experiment in which we are given a pair of coins A and B of unknown biases, θA and θB, respectively (that is, on any given flip, coin A will land on heads with probability θA and tails with probability 1-θA and similarly for coin B). Our goal is to estimate θ = (θA, θB) by repeating the following procedure five times: randomly choose one of the two coins (with equal probability), and perform ten independent coin tosses with the selected coin. Thus, the entire procedure involves a total of 50 coin tosses. (below)


[Figure] Parameter estimation for complete and incomplete data


(a) Maximum likelihood estimation, MLE (for complete data)

(http://en.wikipedia.org/wiki/Maximum_likelihood_estimate)

For each set of ten tosses, the maximum likelihood procedure accumulates the counts of heads and tails for coins A and B separately. These counts are then used to estimate the coin biases.


(b) Expectation maximization (for incomplete data)

Expectation Maximization attempts to find the parameters θ that maximize the log probability logP(x;θ) of the observed data.

  1. EM starts with an initial guess of the parameters.
    Initial parameters: θ(t) = (θA(t), θB(t))
  2. In the E-step, a probability distribution over possible completions is computed using the current parameters. The counts shown in the table are the expected numbers of heads and tails according to this distribution.

    A vector of observed head counts x = (x1, x2, …, x5), where xi ∈ {0, 1, …, 10}.

    A vector of coin types z = (z1, z2, …, z5), where zi ∈ {A, B}.

    (We refer to z as hidden variables or latent factors.)


    Computing proportions of heads for each coin is no longer possible, because we don't know the coin used for each set of tosses. However, if we had some way of completing the data (in our case, guessing correctly which coin was used in each of the five sets), then we could reduce parameter estimation for this problem with incomplete data to maximum likelihood estimation with complete data.



  3. In the M-step, new parameters are determined using the current completions.
  4. After several repetitions of the E-step and M-step, the algorithm converges.


(c) Mathematical foundations

How does the expectation maximization algorithm work? More importantly, why is it even necessary?


The expectation maximization algorithm is a natural generalization of maximum likelihood estimation to the incomplete data case. In particular, expectation maximization attempts to find the parameters θ that maximize the log probability logP(x;θ) of the observed data. Generally speaking, the optimization problem addressed by the expectation maximization algorithm is more difficult than the optimization used in maximum likelihood estimation.

In the complete data case, the objective function logP(x,z;θ) has a single global optimum, which can often be found in closed form (e.g., equation in (a)).

In contrast, in the incomplete data case the function logP(x;θ) has multiple local maxima and no closed form solution.


To deal with this, the expectation maximization algorithm reduces the difficult task of optimizing logP(x;θ) into a sequence of simpler optimization sub-problems, whose objective functions have unique global maxima that can often be computed in closed form. These subproblems are chosen in a way that guarantees their corresponding solutions θ(1), θ(2), … and will converge to a local optimum of logP(x;θ).


More specifically, the expectation maximization algorithm alternates between two phases.

During the E-step, expectation maximization chooses a function gt that lower bounds logP(x;θ) everywhere, and for which gt(θ(t)) = logP(x;θ(t)).

During the M-step, the expectation maximization algorithm moves to a new parameter set θ(t+1), that maximizes gt. As the value of the lower-bound gt matches the objective function at θ(t), it follows that logP(x;θ(t)) = gt(θ(t)) ≤ gt(θ(t+1)) = logP(x;θ(t+1))—so the objective function monotonically increases during each iteration of expectation maximization!


Other examples: EM examples.pdf



References