tf-idf

2013. 2. 19. 11:16

Salton

Information Retrieval can be defined as clustering a document collection into a relevant subcollection and a irrelevant one.

Intra-cluster similarity: tf factor (term frequency)
Inter-cluster dissimilarity: idf factor (inverse document frequency)

Term frequency tf is based on a simple assumption "Frequent terms are more informative than rare terms within a document".

Inverse document frequency idf is based on a simple assumption "Rare terms are more informative than frequent terms across documents"

Term frequency tf

The term frequency tf_t,d of term t in document d is defined as the number of times that t occurs in d.

We want to use tf when computing query-document match scores but raw term frequency is not what we want:

A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. But not 10 times more relevant.
Relevance does not increase proportionally with term frequency.

Sometimes we can use the log frequency weight of term t and d:

The score is 0 if none of the query terms is present in the document.

Simply, we can calculate the score for a document-query pair by just summing over terms t in both q and d:

Or, we can use the normalized term frequency:

Inverse document frequency idf

The document frequency df_t of term t is defined as the number of documents that contain t, which is a measure of the non-informativeness of t.

We define the idf(inverse document frequency) of a term t by

We use log N/df_t instead of N/df_t to "damping" the effect of idf because informativeness does not increase proportionally with inverse document frequency(N/df_t). (ref. The base of the log is immaterial.)

tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight.

It is the best known weighting scheme in information retrieval.

We often use the product of logarithmically scaled term frequency weight and idf weight as tf-idf weight of a term.

Vector Model

(http://sens.tistory.com/299)

When tf_i,j is the term frequency of term t_i in document d_j and idf_i is the inverse document frequency of term t_i,

document term-weighting scheme

Representing queries as vectors in the vector space using query-term weighting scheme:

Ranking documents according to their proximity to the query in vector space
(proximity: cosine similarity between vectors)

Example

documents
d₁ = {"intelligent"², "information", "agent"²}
d₂ = {"information"², "travel"³, "agent"}
d₃ = {"intelligent", "mobile"³, "robot"³}
query
q = {"mobile", "agent"}
a sequence of all terms
k = <intelligent, information, agent, travel, mobile, robot>