Have You Tried Using a ‘Nearest Neighbor Search’?

A Nearest Neighbor Search is perhaps the simplest procedure you might conceive of if presented with a machine-learning-type problem while under the influence of some sort of generalized “veil of ignorance”. Though there exist slightly more complicated variations in the algorithm, the basic principle of all of them is effectively the same.

Editor’s Update: Although this algorithm can be a useful starting point to understand concepts, it is worth noting that it can have trouble reaching the high recall levels we demand for eDiscovery. For more on this topic please see the excellent article TAR, Proportionality, and Bad Algorithms (1-NN) by Bill Dimm.

Extract from an article by Gregory Stein (Published on the Caches to Caches Blog)

Roughly a year and a half ago, I had the privilege of taking a graduate “Introduction to Machine Learning” course under the tutelage of the fantastic Professor Leslie Kaelbling. While I learned a great deal over the course of the semester, there was one minor point that she made to the class which stuck with me more than I expected it to at the time: before using a really fancy or sophisticated or “in-vogue” machine learning algorithm to solve your problem, try a simple Nearest Neighbor Search first.

Let’s say I gave you a bunch of data points, each with a location in space and a value, and then asked you to predict the value of a new point in space. What would you do? Perhaps the values of you data are binary (just +s and -s) and you’ve heard of Support Vector Machines. Should you give that a shot? Maybe the values are continuously valued (anything on the real number line) and you feel like giving Linear Regression a whirl. Or you’ve heard of the ongoing Deep Learning Revolution and feel sufficiently bold as to tackle this problem with a Neural Net.

…or you could simply find the closest point in your dataset to the one you’re interested in and offer up the value of this “nearest neighbor”. A Nearest Neighbor Search is perhaps the simplest procedure you might conceive of if presented with a machine-learning-type problem while under the influence of some sort of generalized “veil of ignorance”. Though there exist slightly more complicated variations in the algorithm, the basic principle of all of them is effectively the same.

Additional Reading

Source: ComplexDiscovery