Nearest neighbor classifiers versus random forests and support vector machines
Abstract
Recent experimental studies have shown that the support vector machine and the random forest are 'stand out' models that outperform most other supervised learning methods in benchmarked evaluations on real-world data sets. Studies have also established that both these models are closely related to nearest neighbor classifiers. Due to this connection, one would expect a similar performance from a nearest neighbor classifier; however, the latter lags greatly behind the support vector machine and the random forest in most evaluations. On the contrary, the nearest neighbor classifier has great theoretical potential because of its ability to model complex decision boundaries with low bias. Furthermore, it theoretically guarantees an error of no more than twice the Bayes error rate on an infinite data set. In this paper, we examine this obvious disconnect between the great theoretical promise and empirical reality of the nearest neighbor classifier. An additional point is that the random forest and support vector machine can be considered eager forms of the (lazy) nearest neighbor classifier. Inspired by the connections between nearest neighbor classifiers, support vector machines and random forest models, we propose a nearest neighbor classifier that is competitive to these models.