Latent Semantic Indexing, LSI

2013. 3. 20. 19:16

Topic Model

(http://en.wikipedia.org/wiki/Topic_model)

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

Latent Semantic Indexing, LSI

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.

Latent Semantic Indexing은 문서 검색 처리에 중요한 발전을 가져왔다고 할 수 있다.

차원축소된 vector("right" eigenvectors from the SVD algorithm)만을 가지고 유사도를 비교하기 때문에 검색 모델중에 Vector Space Model(VSM)을 보다 빠르게 구현할 수 있고, 단어의 의미를 표현하는 feature vector간의 유사도를 계산하는 데에도 활용할 수 있다. 즉, 단어 의미의 clustering에 활용할 수 있다. 나아가서는, 문서가 아니라 이미지 유사도 계산 모델에도 적용해 볼 수 있다.

LSI는 문서에 포함되어 있는 키워드를 기록할 뿐만 아니라, 문서 컬렉션 전체를 평가하여 어떤 문서가 비슷한 단어를 포함하고 있는 지를 찾아낸다. LSI는 많은 단어를 공유하는 문서들이 결과적으로 의미면에서 (semantically) 가까운 것으로 간주하며, 공유하는 단어가 적으면 의미적으로 먼 것으로 여긴다.

비록 LSI 알고리즘은 단어의 뜻을 이해하지는 못하지만, 단어들이 보여주는 패턴을 인식하기 때문에 놀랄만큼 똑똑해 보인다.

LSI-indexed DB를 검색하면, 검색 엔진은 모든 단어에 대해서 계산해 둔 유사도 값을 찾아보며, 자기가 생각하기에 쿼리에 가장 적합한 문서를 보여준다. 두 문서는 비록 공유하는 키워드가 하나도 없더라도 의미적으로 매우 가깝기 때문에, LSI는 이것들이 정확하게 매치되지 않더라도 유용한 결과라고 판단하는 것이다.

평범한 키워드로 검색을 하면 하나도 매치되는 결과가 없어서 실패할 수 있지만, LSI는 종종 그 키워드를 하나도 포함하지 않은 관련 문서들을 결과로 보여준다.

예를 들어 : "AR 뉴스 와이어 DB에서, 사담 후세인으로 검색을 하면 걸프전, UN 승인, 석유 수출금지, 이라크 등 '이라크 대통령'의 이름은 전혀 포함되지 않은 문헌들이 검색된다.

Example

Red line means low-rank approximation.

References

저작자표시 비영리 변경금지

SENS

Programming Note