In statistics and in statistical physics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximately from a specified multivariate probability distribution (i.e. from the joint probability distribution of two or more random variables), when direct sampling is difficult. This sequence can be used to approximate the joint distribution (e.g., to generate a histogram of the distribution); to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variables); or to compute an integral (such as the expected value of one of the variables). Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.


Gibbs sampling is commonly used as a means of statistical inference, especially Bayesian inference. It is a randomized algorithm (i.e. an algorithm that makes use of random numbers, and hence may produce different results each time it is run), and is an alternative to deterministic algorithms for statistical inference such as variational Bayes or the expectation-maximization algorithm (EM).


As with other MCMC algorithms, Gibbs sampling generates a Markov chain of samples, each of which is correlated with nearby samples. As a result, care must be taken if independent samples are desired (typically by thinning the resulting chain of samples by only taking every nth value, e.g. every 100th value). In addition (again, as in other MCMC algorithms), samples from the beginning of the chain (the burn-in period) may not accurately represent the desired distribution.


Implementation

Gibbs sampling, in its basic incarnation, is a special case of the Metropolis–Hastings algorithm. The point of Gibbs sampling is that given a multivariate distribution it is simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution.

Suppose we want to obtain k samples of X = {x1, ..., xn} from a joint distribution p(x1, ..., xn). Denote the ith sample by X(i) = {x1(i), ... xn(i)}. We proceed as follows:

  • We begin with some initial value X(0) for each variable.
  • For each sample i = {1, ..., k}, sample each variable xj(i) from the conditional distribution p(xj|x1(i), ..., xj-1(i), xj+1(i-1), ..., xn(i-1)). That is, sample each variable from the distribution of that variable conditioned on all other variables, making use of the most recent values and updating the variable with its new value as soon as it has been sampled.


  • Initial Step
    X(0) = {x1(0), x2(0), ..., xn(0)}
  • Step #1
    x1(i+1) ~ p(x1|x2(i), x3(i), x4(i), ..., xn(i))
    x2(i+1) ~ p(x2|x1(i+1), x3(i), x4(i), ..., xn(i))
    x3(i+1) ~ p(x3|x1(i+1), x2(i+1), x4(i), ..., xn(i))
    ...
    xn(i+1) ~ p(xn|x1(i+1), x2(i+1), x3(i+1), ..., xn-1(i+1))
  • Step #2
    i  i + 1, back to Step #1


The samples then approximate the joint distribution of all variables. Furthermore, the marginal distribution of any subset of variables can be approximated by simply examining the samples for that subset of variables, ignoring the rest. In addition, the expected value of any variable can be approximated by averaging over all the samples.


Relation of conditional distribution and joint distribution

The conditional distribution of one variable given all others is proportional to the joint distribution:




Lectures



References