Kullback-Leibler divergence

In probability theory and information theory, the Kullback–Leibler divergence (also information divergence, information gain, relative entropy, or KLIC) is a non-symmetric measure of the difference between two probability distributions P and Q. Specifically, the Kullback–Leibler divergence of Q from P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P: KL measures the expected number of extra bits required to code samples from P when using a code based on Q, rather than using a code based on P. Typically P represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory, model, description, or approximation of P.

Although it is often intuited as a metric or distance, the KL divergence is not a true metric — for example, it is not symmetric: the KL from P to Q is generally not the same as the KL from Q to P. However, its infinitesimal form, specifically its Hessian, is a metric tensor: it is the Fisher information metric.

KL divergence is a special case of a broader class of divergences called f-divergences. It was originally introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions. It can be derived from a Bregman divergence.



where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P.



Illustration of the Kullback–Leibler (KL) divergence for two normal Gaussian distributions.
Note the typical asymmetry for the KL divergence is clearly visible.


Computing the closed form

For many common families of distributions, the KL-divergence between two distributions in the family can be derived in closed form. This can often be done most easily using the form of the KL-divergence in terms of expected values or in terms of information entropy:



where H(P) = - E[ln p(x)] is the information entropy of P, and H(P,Q) is the cross entropy of P and Q.



Cross entropy

In information theory, the cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p.




References