Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis. PLSA evolved from latent semantic analysis.

Compared to standard latent semantic analysis which stems from linear algebra and downsizes the occurrence tables (usually via a singular value decomposition), probabilistic latent semantic analysis is based on a mixture decomposition derived from a latent class model.


The Aspect Model

Considering observations in the form of co-occurrences (w,d) of words and documents, PLSA models the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions:



The first formulation is the symmetric formulation, where w and d are both generated from the latent class c in similar ways (using the conditional probabilities P(d|c) and P(w|c)), whereas the second formulation is the asymmetric formulation, where, for each document d, a latent class is chosen conditionally to the document according to P(c|d), and a word is then generated from that class according to P(w|c). Although we have used words and documents in this example, the co-occurrence of any couple of discrete variables may be modelled in exactly the same way.

So, the number of parameters is equal to cd + wc. The number of parameters grows linearly with the number of documents. In addition, although PLSA is a generative model of the documents in the collection it is estimated on, it is not a generative model of new documents.

Their parameters are learned using the EM algorithm.



Plate notation representing the PLSA model. d is the document index variable, c is a word's topic drawn from the document's topic distribution, P(c|d), and w is a word drawn from the word distribution of this word's topic, P(w|c). The d and w are observable variables, the topic c is a latent variable.



Why PLSI is preferred over LSI

Here is quick explanation of why “Latent semantic indexing” LSI is unlikely to be used as is in modern search engines like Google.  It’s an old technique that has limitations, and solutions to these have been developed over the years since its introduction to IR research.


First, very quickly, a no-maths definition of LSI as a refresher:

“Semantic” = meaning

“Latent” = present but hidden

LSI = The analysis of the hidden meaning of words and how often they occur in a document.

Its advantage over simple keyword analysis (Boolean search = True or False) is that it can infer meaning from words which is not evident, and match words which would not normally happen with other methods. For example, “computer”, “PC”, “laptop” are all connected. Documents are put together even if it is not obvious that they are connected, because a “latent semantic space” is created.  

It uses vectors in a a high-dimensional vector space (lots of them). It creates a term-document matrix from all the documents. Then 3 matrices are created using SVD (singular value decomposition) (also the second matrix houses the singular values of the original matrix in a diagonal matrix). This means that sets of terms or documents can be represented as d-dimensional vectors. Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents. It can be used in any language because of the way that it’s constructed.


But... there are big problems with LSI:

  1. The resulting dimensions can be very difficult to interpret so there are mistakes. It’s unclear what the resulting similarities between terms really means.  
  2. The input is a bag-of-words so we don’t have any text structure information.
  3. A compound term (ex. bull-headed) is treated as 2 terms.
  4. Ambiguous terms create noise in the vector space.
  5. There’s no way to define the optimal dimensionality of the vector space.
  6. There’s a time complexity for SVD in dynamic collections.

So instead we use Probabilistic latent semantic indexing (PLSI).

It is better:

  1. It has a more robust statistical foundation and provides a proper generative data model.
  2. It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) – this makes it far more flexible.
  3. It can deal with domain specific synonymy and polysemous words.

LSI is different to PLSI:

  • It analyses co-occurrences of words and documents.
  • It gives us what terms give us a topic (term by topic matrix – topic is also called an “aspect”), and which topics are in which documents term by document matrix.
  • It uses aspects to create queries.
  • LSI uses the Gaussian model (normal distribution) that can generate negative values and you can’t have a negative number of words in a document – PLSI uses a multi-nominal model (probability distribution) which works better.

In short:

  • The order of the words is lost. (but results are still good due to word co-occurrence)
  • Documents can be represented by numeric vectors in a space of words.
  • It’s an unsupervised method. (no human involvement necessary)
  • It retrieves topics.
  • It uses aspects of LSI.
  • Each query uses the cosine similarity metric to find the similarity between vectors.

Thomas Hoffman invented it, you can read his SIGIR paper here if you want more information.

It is currently used not only in standard IR but in collaborative filtering for example and blog search.

LSI is still a fundamental standard techniques used, but PLSI is being preferred for obvious reasons. Saying that a search engine is using LSI is like saying that a car is using petrol/gas. It’s not environmentally friendly and it’s expensive. LSI also has drawbacks and newer techniques are being used to counter these.



References