Salton

Information Retrieval can be defined as clustering a document collection into a relevant subcollection and a irrelevant one.

  • Intra-cluster similarity: tf factor (term frequency)
  • Inter-cluster dissimilarity: idf factor (inverse document frequency)


Term frequency tf is based on a simple assumption "Frequent terms are more informative than rare terms within a document".

Inverse document frequency idf is based on a simple assumption "Rare terms are more informative than frequent terms across documents"


Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.



We want to use tf when computing query-document match scores but raw term frequency is not what we want:

  • A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. But not 10 times more relevant.
  • Relevance does not increase proportionally with term frequency.


Sometimes we can use the log frequency weight of term t and d:



The score is 0 if none of the query terms is present in the document.

Simply, we can calculate the score for a document-query pair by just summing over terms t in both q and d:



Or, we can use the normalized term frequency:




Inverse document frequency idf

The document frequency dft of term t is defined as the number of documents that contain t, which is a measure of the non-informativeness of t.



We define the idf(inverse document frequency) of a term t by



We use log N/dft instead of N/dft to "damping" the effect of idf because informativeness does not increase proportionally with inverse document frequency(N/dft). (ref. The base of the log is immaterial.)



tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight.



It is the best known weighting scheme in information retrieval.

We often use the product of logarithmically scaled term frequency weight and idf weight as tf-idf weight of a term.




Vector Model

(http://sens.tistory.com/299)

When tfi,j is the term frequency of term ti in document dj and idfi is the inverse document frequency of term ti,

  • document term-weighting scheme



  • Representing queries as vectors in the vector space using query-term weighting scheme:



  • Ranking documents according to their proximity to the query in vector space
    (proximity: cosine similarity between vectors)




Example
  • documents
    d1 = {"intelligent"2, "information", "agent"2}
    d2 = {"information"2, "travel"3, "agent"}
    d3 = {"intelligent", "mobile"3, "robot"3}
  • query
    q = {"mobile", "agent"}
  • a sequence of all terms
    k = <intelligent, information, agent, travel, mobile, robot>






sim(d3q) is largest, so document d3 is most relevant to the query q.



References