kNN (k-Nearest Neighbors) Classifier - Document Classification

2013. 2. 6. 10:52

kNN Classifier

A method for classifying objects(documents) based on closest exemplary instances(documents) in the attribute vector(document vector) space

A type of instance-based learning, or lazy learning where classification function is only approximated locally and all computation is deferred until classification process is actually performed.

An object can be classified by a majority vote of its k neighbors.

ex) If k = 1, then the object is simply assigned to the class of its nearest neighbor.

[ k값에 따라 classification이 달라짐 ]

KNN - Revisited

Important Features

All instances(documents) correspond to points in an n-dimensional Euclidean space.
Classification is delayed till a new instance(document) arrives.
Classification is done by comparing vectors of the different document points.
Target function(classification function) may be discrete or real-valued.
(It depends on the representation type of instances on the space.)

Measuring k-Nearest Neighbors

An arbitrary instance(document) is represented by a vector <a₁, a₂, …, a_n>

a_i denotes ith attribute (e.g. ith component of tfxidf document vector)
all instances exist in an n-dimensional vector space
all instances may correspond to vector points in an n-dimensional Euclidean space

Euclidean distance measuring

Absolute distance measuring

Scaling Attribute Value

Sometimes, each attribute value may be scaled by normalization in order to mitigate the effect of the difference of each dimension size.

where:

Attribute scaling often improves the accuracy of kNN classification.

(만약 2차원 기본평면에서 x축 방향 scale이 y축 방향 scale보다 훨씬 크다면 x축에 영향을 많이 받아서 잘못된 classification이 수행될 수 있다. x, y축 방향에 동일하게 영향을 받도록 normalization해줘야 한다.)

Example

예제 슬라이드

Distance-Weighted Nearest Neighbor Algorithm

Assign weights to the neighbors inverse-proportionally on their distances from the query

(Usually, the weight 'may' be inverse square of the distance or just inverse of the distance.)

By doing so,

The nearer a neighbor is for the query, the more it influences the result.
If we use k = the number of all instances, all instances are neighbors so that they all may influence the result for a particular query.
(This kind of method is called Shepard's method. But, this method requires big classification time overhead.)

Voronoi Diagram

An example of decision surface(classification function) formed by a set of instances using a distance-weighted nearest neighbor classification

Conclusion

kNN classification is a Highly effective inductive classification method for noisy training data and complex target functions.
Target function for a whole attribute space may be described as a combination of less complex local approximations.
Learning is very simple.
(almost nothing to do except for transforming to numeric values and scaling for each instances)
But, classification may be time-consuming especially when k is big.
(So, this kind of method is called "lazy learning")

Lecture

References

http://taho-joo.tistory.com/63

저작자표시 비영리 변경금지 (새창열림)

SENS

Programming Note