kNN Classifier
A method for classifying objects(documents) based on closest exemplary instances(documents) in the attribute vector(document vector) space
A type of instance-based learning, or lazy learning where classification function is only approximated locally and all computation is deferred until classification process is actually performed.
An object can be classified by a majority vote of its k neighbors.
ex) If k = 1, then the object is simply assigned to the class of its nearest neighbor.
[ k값에 따라 classification이 달라짐 ]
KNN - Revisited
Important Features
- All instances(documents) correspond to points in an n-dimensional Euclidean space.
- Classification is delayed till a new instance(document) arrives.
- Classification is done by comparing vectors of the different document points.
- Target function(classification function) may be discrete or real-valued.
(It depends on the representation type of instances on the space.)
Measuring k-Nearest Neighbors
An arbitrary instance(document) is represented by a vector <a1, a2, …, an>
- ai denotes ith attribute (e.g. ith component of tfxidf document vector)
- all instances exist in an n-dimensional vector space
- all instances may correspond to vector points in an n-dimensional Euclidean space
Euclidean distance measuring |
Absolute distance measuring |
|
|
Scaling Attribute Value
Sometimes, each attribute value may be scaled by normalization in order to mitigate the effect of the difference of each dimension size.
where:
Attribute scaling often improves the accuracy of kNN classification.
(만약 2차원 기본평면에서 x축 방향 scale이 y축 방향 scale보다 훨씬 크다면 x축에 영향을 많이 받아서 잘못된 classification이 수행될 수 있다. x, y축 방향에 동일하게 영향을 받도록 normalization해줘야 한다.)
Example
Distance-Weighted Nearest Neighbor Algorithm
Assign weights to the neighbors inverse-proportionally on their distances from the query
(Usually, the weight 'may' be inverse square of the distance or just inverse of the distance.)
By doing so,
- The nearer a neighbor is for the query, the more it influences the result.
- If we use k = the number of all instances, all instances are neighbors so that they all may influence the result for a particular query.
(This kind of method is called Shepard's method. But, this method requires big classification time overhead.)
Voronoi Diagram
An example of decision surface(classification function) formed by a set of instances using a distance-weighted nearest neighbor classification
Conclusion
- kNN classification is a Highly effective inductive classification method for noisy training data and complex target functions.
- Target function for a whole attribute space may be described as a combination of less complex local approximations.
- Learning is very simple.
(almost nothing to do except for transforming to numeric values and scaling for each instances) - But, classification may be time-consuming especially when k is big.
(So, this kind of method is called "lazy learning")