Editor’s Note: As highlighted by analytics professor Jonathan Choi, just because datasets may be small doesn’t mean that they are not valuable. In the age of big data influenced eDiscovery, we often neglect the power of small data. However, if considered and used effectively, even the smallest of datasets may provide great value. Contained in this post is a compilation of informational article extracts that may be helpful for those seeking to learn more about the art and science of considering small data in the sphere of data discovery and legal discovery.
An extract from an article by Gary Klein, Ph.D., via Psychology Today
Let’s Reverse Our Strategy For Data Collection
Currently, the Big Data bandwagon continues to pick up momentum: Take advantage of all the data sources available to us via mobile devices, aerial and remote sensing, cameras, microphones, wireless sensor networks, and the like. The data are there, just waiting to be harvested in order to spot trends and find correlations. The enormous volume of data forces us to use various forms of computer-based search and analysis, including Machine Learning. The Big Data approach is exciting as it lets us take massive amounts of information into account. The Big Data approach is also unsettling as we face our insignificance and admit that the algorithms and smart machines know so much more than we ever can.
Previously, I have described some reasons to be uneasy about Big Data, the way the Big Data analytics will follow existing trends but miss subtle yet important changes in the situation that render these trends obsolete. That essay also raised the issue of missing data. People sometimes notice that something did NOT happen, and the absence of an event helps us make sense of a situation. Big Data typically covers events that did happen and ignores events that did not occur, even though these non-occurrences can be significant.
This essay, however, is not about limitations in Big Data.
Instead, I want to suggest that we move in the opposite direction: Trying to collect as little data as possible, ideally just a single data point — but a data point that swings a decision. Rather than getting drowned in data overload, there are times when the right observation will put ambiguous cues into focus.
An extract from an article by Jonathan Choi via JChoi Solutions
Don’t Undervalue Small Data
Since it became popular, big data has taken the business world by storm. It seems to have infiltrated organizations of all sizes across all industries, and there is no denying that big data is the key driver for a large number of commercial successes. However, that’s not what I want to talk about today, with the value of big data being featured everywhere, I want to talk about the value in small data. Not every organization has the luxury to work with big data, but if we have a different approach, we can actually gain a lot of value with small data.
There is Big Value in Small Data
I see a lot of parallels between small data and a new startup venture; we are often obsessed with growth and scale before we get things right. One common advice for startups is to do things that don’t scale, with the famous example of AirBnB founders who visited their early customers one by one to learn about how and why they use the product. The principle is the same when it comes to small data. Because your dataset is small enough, you can look at individual data points, investigate individual outliers, and try to understand how an individual customer or process is represented in your data. This approach is not scalable, but it is a great way to really understand your data. When it comes to extracting value from data, there is no substitute with in-depth understanding, so spend more time with the data when you can still do it.
Recently I read Todd Rose’s book “The End of Average,” which talks about the science of individuality. Todd described how traditional methods start by aggregating the data before analyzing the results, and through that process, we end up with an average that seemingly fits everyone but actually fits no one. The same can be said for big data; sometimes, we are so obsessed with using more and more data, we overlook the value of an individual within the large dataset. It’s somewhat ironic that we strive to achieve personalization through the lens of big data rather than small data. Because we have access to a huge amount of data, we default to aggregating large datasets before really trying to understand what is going on, and we run the risk of overlooking important insights. The science of individuality offers an alternative solution, by first analyzing individuals to find insights and then apply that to the large dataset, we may come to a very different conclusion. So indeed, there is big value hidden in the small dataset if you know how to find it.
An extract from an article by Cathy O’Neil via Bloomberg Media
Bigger Data Isn’t Always Better Data
When making a decision such as whether to hire, insure or lend to someone, is more data better? Actually, when it comes to fairness, the opposite is often true.
Consider a recent Harvard Business Review experiment, which involved sending 316 fake applications to the largest U.S. law firms. All the applicants were among the top 1 percent of students at their schools, but other information — such as their names, college clubs and hobbies — provided hints about their gender and social class.
The result: Upper-class males were four times as likely to get a callback as other candidates, including upper-class women. This suggests that even among equally qualified candidates, the added information gave potential employers something not to like, such as a lower-class background, a bad “cultural fit” or the possibility that a woman might decide to have children and leave the firm.
Proponents of big data tend to believe that such problems can be addressed by handing the decision-making over to an impartial computer. The idea is that with enough information, perhaps ranging from Facebook likes to ZIP codes, an algorithm should be able to choose the objectively best candidates.
Yet algorithms can be as flawed as the humans they replace — and the more data they use, the more opportunities arise for those flaws to emerge.
An extract from an article by Robert Boscacci via Medium
Small Data, and Getting It
Why Even Both with Small N?
Big data is fine and good: As the sample size of our dataset (n) approaches infinity, we can make increasingly confident and general assertions, based on increasingly nuanced aberrations and trends in that data.
Large sample sizes are a major key in many contexts, e.g., rocket science: Precision is king, academic reputations are at stake, and crossing some statistical confidence threshold might validate a lifetime of investigation into something as pivotal to our existence as, like, the big bang theory. Big Data can be too big for some applications, such as intuitive conceptualization and storytelling. There’s also the fact that there are barriers to acquiring huge datasets of high quality.
Here I quote Scientific American’s Emilie Reas, who references “Topographic Representation of Numerosity in the Human Parietal Cortex” in her piece “Our Brains Have a Map for Numbers”:
[There is] a small brain area [which] represents numerosity along a continuous “map.” Just as we organize numbers along a mental “number line,” with one at the left, increasing in magnitude to the right, so is quantity mapped onto space in the brain. One side of this brain region responds to small numbers, the adjacent region to larger numbers, and so on, with numeric representations increasing to the far end.
[These Researchers at Utrecht University in the Netherlands] found that the parietal cortex map represented relative, not absolute, quantities.
While considering currency in factors of ten, for example, all we care about are more or less commas. We don’t visualize the correct-sized pile of hundred dollar bills while pricing something expensive; we just compare with the market price (essentially just a symbol, not even a number we can conceptualize) of other similar things. We understand deltas. We feeble beings have to abstract large (sets of) numbers away to external apparatuses in order to manipulate, and especially to derive meaning. We reduce dimensions, we bin, and we plot data to bring it down to a scale we can digest and act upon. We make Big Data small, to see it and its implicit patterns.
- Data Lakes: An Important Technological Approach for Data and Legal Discovery
- Automating eDiscovery: A Strategic Framework