Have You Tried Using a ‘Nearest Neighbor Search’?

A Nearest Neighbor Search is perhaps the simplest procedure you might conceive of if presented with a machine-learning-type problem while under the influence of some sort of generalized “veil of ignorance”. Though there exist slightly more complicated variations in the algorithm, the basic principle of all of them is effectively the same.

Editor’s Update: Although this algorithm can be a useful starting point to understand concepts, it is worth noting that it can have trouble reaching the high recall levels we demand for eDiscovery. For more on this topic please see the excellent article TAR, Proportionality, and Bad Algorithms (1-NN) by Bill Dimm.

Extract from an article by Gregory Stein (Published on the Caches to Caches Blog)

Roughly a year and a half ago, I had the privilege of taking a graduate “Introduction to Machine Learning” course under the tutelage of the fantastic Professor Leslie Kaelbling. While I learned a great deal over the course of the semester, there was one minor point that she made to the class which stuck with me more than I expected it to at the time: before using a really fancy or sophisticated or “in-vogue” machine learning algorithm to solve your problem, try a simple Nearest Neighbor Search first.

Let’s say I gave you a bunch of data points, each with a location in space and a value, and then asked you to predict the value of a new point in space. What would you do? Perhaps the values of you data are binary (just +s and -s) and you’ve heard of Support Vector Machines. Should you give that a shot? Maybe the values are continuously valued (anything on the real number line) and you feel like giving Linear Regression a whirl. Or you’ve heard of the ongoing Deep Learning Revolution and feel sufficiently bold as to tackle this problem with a Neural Net.

…or you could simply find the closest point in your dataset to the one you’re interested in and offer up the value of this “nearest neighbor”. A Nearest Neighbor Search is perhaps the simplest procedure you might conceive of if presented with a machine-learning-type problem while under the influence of some sort of generalized “veil of ignorance”. Though there exist slightly more complicated variations in the algorithm, the basic principle of all of them is effectively the same.

Additional Reading

Source: ComplexDiscovery

A Matter of Pricing? A Running Update of Semi-Annual eDiscovery Pricing Survey Responses

First administered in December of 2018 and conducted four times during the last two years with 334 individual responses, the semi-annual eDiscovery Pricing Survey highlights pricing on selected collection, processing, and review tasks. The aggregate results of all surveys as shared in the provided comparative charts may be helpful for understanding pricing and its impact on purchasing behavior on selected services over time.



Access the Results Now!

ComplexDiscovery combines original industry research with curated expert articles to create an informational resource that helps legal, business, and information technology professionals better understand the business and practice of data discovery and legal discovery.

All contributions are invested to support the development and distribution of ComplexDiscovery content. Contributors can make as many article contributions as they like, but will not be asked to register and pay until their contribution reaches $5.