Relatively Speaking: Predictive Coding Technologies and Protocols Survey Results

The Predictive Coding Technologies and Protocols Survey is a non-scientific survey designed to help provide a general understanding of the use of predictive coding technologies and protocols from data discovery and legal discovery professionals within the eDiscovery ecosystem.

Article by Rob Robinson

The Predictive Coding Technologies and Protocols Survey

Initiated on August 31, 2018, the Predictive Coding Technologies and Protocols Survey is a non-scientific survey designed to help provide a general understanding of the use of predictive coding technologies and protocols from data discovery and legal discovery professionals within the eDiscovery ecosystem. The survey was open from August 31 through September 15, 2018, with individuals invited to participate directly by ComplexDiscovery and indirectly by industry website, blog, and newsletter mentions.

Designed to provide a general understanding of predictive coding technologies and protocols, the survey had two primary educational objectives:

  • To provide a consolidated listing of potential predictive coding technology and protocol definitions. While not all-inclusive or comprehensive, the listing was vetted with selected industry predictive coding experts for completeness and accuracy, thus it appears to be profitable for use in educational efforts.
  • To ask eDiscovery ecosystem professionals about their usage and preferences of predictive coding platforms, technologies, and protocols.

The survey offered responders an opportunity to provide predictive coding background information, including their primary predictive coding platform, as well as posed three specific questions to responders. Those questions being:

  • Which predictive coding technologies are utilized by your eDiscovery platform?
  • If you use k-Nearest Neighbor Classifier (k-NN), do you use 1-NN or k-NN with k>1?
  • Which technology-assisted review protocols are utilized in your delivery of predictive coding?

Closed on September 15, 2018, the survey had 31 responders.

Key Results and Observations

Primary Predictive Coding Platform (Chart 1)

  • More than 80% of responders (80.64%) shared that they did have a specific primary platform for predictive coding.
  • There were 12 different platforms noted as primary predictive platforms by responders.
  • Relativity was listed as the primary predictive coding platform by more than 30% of responders (32.25%).
  • The top three survey primary predictive coding platforms accounted for more than 50% of responses (51.61%).
  • Just under 20% of responders (19.35%) shared they had no single primary platform for predictive coding.

Primary Predictive Coding Technologies Used (Chart 2)

  • All listed predictive coding technologies plus one added technology are used by at least one survey responder.
  • Active Learning was the most used predictive coding technology, with more than 70% of responders (70.96%) reporting that they use it in their predictive coding efforts.
  • Approximately 32% of responders (32.25%) only use one predictive coding technology in their predictive coding efforts.
  • Approximately 68% of responders (67.74%) use more than one predictive coding technology in their predictive coding efforts.

Usage of k-Nearest Neighbor Classifier (Charts 3 and 4)

  • Approximately 25% of responders (25.80%) reported use of k-Nearest Neighbor Classifier as a predictive coding technology.
  • Of those responders using k-Nearest Neighbor Classifier as a predictive coding technology, 50% use 1-NN, and 50% use k-NN with k>1.

Technology-Assisted Review Protocols for Predictive Coding Used (Chart 5)

  • All listed technology-assisted protocols for predictive coding were used by at least one survey responder.
  • Continuous Active Learning (CAL) was the most used predictive coding protocol, with more than 87% of responders (87.09%) reporting that they use it in their predictive coding efforts.
  • Approximately 50% of responders (48.38%) use only one predictive coding protocol in their predictive coding efforts.
  • Approximately 50% of responders (51.61%) use more than one predictive coding protocol in their predictive coding efforts.

Predictive Coding Technology and Protocol Survey Responder Overview (Chart 6)

  • Approximately 54% of responders (54.83%) are from technology provider organizations.
  • Approximately 32% of responders (32.25%) are from law firms.

Survey Charts

(Charts can be expanded for detailed viewing.)

Chart 1: Name of Primary Predictive Coding Platform

Primary Predictive Coding Platform (31 Responders)

Chart 2: Which predictive coding technologies are utilized by your eDiscovery platform?

Predictive Coding Technologies Used (31 Responders)

Chart 3: Usage of k-Nearest Neighbor Classifier (k-NN)

Usage of k-Nearest Neighbor Classifier (31 Responders)

Chart 4: If you use k-Nearest Neighbor Classifier (k-NN), do you use 1-NN or k-NN with k>1?

Usage of k-Nearest Neighbor Classifier (k-NN) (8 Responders)

Chart 5: Which technology-assisted review protocols are utilized in your delivery of predictive coding?

Predictive Coding Protocols Used (31 Responders)

Chart 6: Survey Responder Overview

Responder Overview (31 Responders)

Survey Background Information

A Working List of Predictive Coding Technologies (1,2,3,4)

Aggregated from electronic discovery experts based on professional publications and personal conversations, provided below is a non-all inclusive working list of identified machine learning technologies that have been applied or have the potential to be applied to the discipline of eDiscovery to facilitate predictive coding. This working list is designed to provide a reference point for identified predictive coding technologies and may over time include additions, adjustments, and amendments based on feedback from experts and organizations applying and implementing these mainstream technologies in their specific eDiscovery platforms.

Listed in Alphabetical Order

  • Active Learning: A process, typically iterative, whereby an algorithm is used to select documents that should be reviewed for training based on a strategy to help the classification algorithm learn efficiently.
  • Decision Tree: A step-by-step method of distinguishing between relevant and non-relevant documents, depending on what combination of words (or other features) they contain. A Decision Tree to identify documents pertaining to financial derivatives might first determine whether or not a document contained the word “swap.” If it did, the Decision Tree might then determine whether or not the document contained “credit,” and so on. A Decision Tree may be created either through knowledge engineering or machine learning.
  • k-Nearest Neighbor Classifier (k-NN): A classification algorithm that analyzes the k example documents that are most similar (nearest) to the document being classified in order to determine the best classification for the document. If k is too small (e.g., k=1), it may be extremely difficult to achieve high recall.
  • Latent Semantic Analysis (LSA): A mathematical representation of documents that treats highly correlated words (i.e., words that tend to occur in the same documents) as being, in a sense, equivalent or interchangeable. This equivalency or interchangeability can allow algorithms to identify documents as being conceptually similar even when they aren’t using the same words (e.g., because synonyms may be highly correlated), though it also discards some potentially useful information and can lead to undesirable results caused by spurious correlations.
  • Logistic Regression: A state-of-the-art supervised learning algorithm for machine learning that estimates the probability that a document is relevant, based on the features that it contains. In contrast to the Naïve Bayes, algorithm, Logistic Regression identifies features that discriminate between relevant and non-relevant documents.
  • Naïve Bayesian Classifier: A system that examines the probability that each word in a new document came from the word distribution derived from a trained responsive document or trained non-responsive documents. The system is naïve in the sense that it assumes that all words are independent of one another.
  • Neural Network: An Artificial Neural Network (ANN) is a computational model. It is based on the structure and functions of biological neural networks. It works like the way human brain processes information. It includes a large number of connected processing units that work together to process information.
  • Probabilistic Latent Semantic Analysis (PLSA): This is similar in spirit to LSA but it uses a probabilistic model to achieve results that are expected to be better.
  • Random Forests: An ensemble learning method for classification, regression, and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.
  • Relevance Feedback: An active learning process in which the documents with the highest likelihood of relevance are coded by a human, and added to the training set.
  • Support Vector Machine: A mathematical approach that seeks to find a line that separates responsive from non-responsive documents so that, ideally, all of the responsive documents are on one side of the line and all of the non-responsive ones are on the other side.

General TAR Protocols (5,6,7,8)

Additionally, these technologies are generally employed as part of a TAR protocol which determines how the technologies are used. Examples of TAR protocols include:

  • Simple Passive Learning (SPL): In simple passive learning (“SPL”) methods, the teacher (i.e., human operator) selects the documents to be used as training examples; the learner is trained using these examples, and once sufficiently trained, is used to label every document in the collection as relevant or non-relevant. Generally, the documents labeled as relevant by the learner are re-reviewed manually. This manual review represents a small fraction of the collection, and hence a small fraction of the time and cost of an exhaustive manual review.
  • Simple Active Learning (SAL): In SAL methods, after the initial training set, the learner selects the documents to be reviewed and coded by the teacher, and used as training examples, and continues to select examples until it is sufficiently trained. Typically, the documents the learner chooses are those about which the learner is least certain, and therefore from which it will learn the most. Once sufficiently trained, the learner is then used to label every document in the collection. As with SPL, the documents labeled as relevant are generally re-reviewed manually.
  • Continuous Active Learning (CAL): In CAL, the TAR method developed, used, and advocated by Maura R. Grossman and Gordon V. Cormack, after the initial training set, the learner repeatedly selects the next-most-likely-to-be-relevant documents (that have not yet been considered) for review, coding, and training, and continues to do so until it can no longer find any more relevant documents. There is generally no second review because by the time the learner stops learning, all documents deemed relevant by the learner have already been identified and manually reviewed.
  • Scalable Continuous Active Learning (S-CAL): The essential difference between S-CAL and CAL is that for S-CAL, only a finite sample of documents from each successive batch is selected for labeling, and the process continues until the collection—or a large random sample of the collection—is exhausted. Together, the finite samples form a stratified sample of the document population, from which a statistical estimate of ρ may be derived.
  • Hybrid Multimodal Method: An approach developed by the e-Discovery Team (Ralph Losey) that includes all types of search methods, with primary reliance placed on predictive coding and the use of high-ranked documents for continuous active training.

Survey Background References

(1) Grossman, M. and Cormack, G. (2013). The Grossman-Cormack Glossary of Technology-Assisted Review. [ebook] Federal Courts Law Review. Available at: http://www.fclr.org/fclr/articles/html/2010/grossman.pdf [Accessed 31 Aug. 2018].

(2) Dimm, B. (2018). Expertise on Predictive Coding. [email].

(3) Roitblat, H. (2013). Introduction to Predictive Coding. [ebook] OrcaTec. Available at: https://theolp.wildapricot.org/Resources/Documents/Introduction%20to%20Predictive%20Coding%20-%20Herb%20Roitblat.pdf [Accessed 31 Aug. 2018].

(4) Tredennick, J. and Pickens, J. (2017). Deep Learning in E-Discovery: Moving Past the Hype. [online] Catalystsecure.com. Available at: https://catalystsecure.com/blog/2017/07/deep-learning-in-e-discovery-moving-past-the-hype/ [Accessed 31 Aug. 2018].

(5) Grossman, M. and Cormack, G. (2017). Technology-Assisted Review in Electronic Discovery. [ebook] Available at: https://judicialstudies.duke.edu/wp-content/uploads/2017/07/Panel-1_TECHNOLOGY-ASSISTED-REVIEW-IN-ELECTRONIC-DISCOVERY.pdf [Accessed 31 Aug. 2018].

(6) Grossman, M. and Cormack, G. (2016). Continuous Active Learning for TAR. [ebook] Practical Law. Available at: https://pdfs.semanticscholar.org/ed81/f3e1d35d459c95c7ef60b1ba0b3a202e4400.pdf [Accessed 31 Aug. 2018].

(7) Grossman, M. and Cormack, G. (2016). Scalability of Continuous Active Learning for Reliable High-Recall Text Classification. [ebook] Available at: https://plg.uwaterloo.ca/~gvcormac/scal/cormackgrossman16a.pdf [Accessed 3 Sep. 2018].

(8) Losey, R., Sullivan, J. and Reichenberger, T. (2015). e-Discovery Team at TREC 2015 Total Recall Track. [ebook] Available at: https://trec.nist.gov/pubs/trec24/papers/eDiscoveryTeam-TR.pdf [Accessed 1 Sep. 2018].

(9) Dimm, B. (2018). TAR, Proportionality, and Bad Algorithms (1-NN). [online] Clustify Blog – eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development. Available at: https://blog.cluster-text.com/2018/08/13/tar-proportionality-and-bad-algorithms-1-nn/ [Accessed 31 Aug. 2018].

(10) Robinson, R. (2013). Running Results: Predictive Coding One-Question Provider Implementation Survey. [online] ComplexDiscovery: eDiscovery Information. Available at: https://complexdiscovery.com/2013/03/05/running-results-predictive-coding-one-question-provider-implementation-survey/ [Accessed 31 Aug. 2018].

(11) Robinson, R. (2018). A Running List: Top 100+ eDiscovery Providers. [online] ComplexDiscovery: eDiscovery Information. Available at: https://complexdiscovery.com/2017/01/19/28252/ [Accessed 31 Aug. 2018].

(12) Robinson, R. (2018). Predictive Coding Technologies and Protocols: Overview and Survey [online] ComplexDiscovery: eDiscovery Information. Available at: https://complexdiscovery.com/predictive-coding-technologies-and-protocols-overview-and-survey/ [Accessed 16 Sep. 2018]

 

Have a Request?

If you have information or offering requests that you would like to ask us about, please let us know and we will make our response to you a priority.

ComplexDiscovery is an online publication that highlights data and legal discovery insight and intelligence ranging from original research to aggregated news for use by business, information technology, and legal professionals. The highly targeted publication seeks to increase the collective understanding of readers regarding data and legal discovery information and issues and to provide an objective resource for considering trends, technologies, and services related to electronically stored information.

ComplexDiscovery OÜ is a technology marketing firm providing strategic planning and tactical execution expertise in support of data and legal discovery organizations. Registered as a private limited company in the European Union country of Estonia, one of the most digitally advanced countries in the world, ComplexDiscovery OÜ operates virtually worldwide to deliver marketing consulting and services.

A (Brand) New Approach? Considering the Framework and Structure of eDiscovery Offerings

Today’s eDiscovery providers may benefit from the lessons learned in the creation of the Sgt. Pepper’s Lonely Hearts Club Band album by creating a concept for branding and packaging their offerings within that brand in a connected, theme-based way that represents the offerings’ promise and capability in a way that is easy to understand and remember.

This fictionalized branding approach was developed from the intellectual exercise of trying to figure out a reasonable and memorable way to descriptively highlight the promise and capabilities of offerings typically delivered by full-service eDiscovery providers. It may not be completely comprehensive or fully normalized. However, the hope of sharing this branding example is that it might help those involved in the branding and communication of eDiscovery provider services and solutions.

First Legal Acquires eDiscovery Provider Redpoint Technologies

According to Alex Martinez, CEO of First Legal, “Both First Legal...

Veristar Acquires Planet Data

According to Veristar company founder, CEO, and president Rick Avers, “We...

Questel Acquires doeLEGAL

doeLEGAL today announced that it has been acquired by intellectual property...

Following the Money? Mike Bryant Provides a SOLID Look at Legal Tech Merger and Acquisition Activity

From seed and venture capital investments to private equity and Special...

A New Era in eDiscovery? Framing Market Growth Through the Lens of Six Eras

There are many excellent resources for considering chronological and historiographical approaches...

An eDiscovery Market Size Mashup: 2020-2025 Worldwide Software and Services Overview

While the Compound Annual Growth Rate (CAGR) for worldwide eDiscovery software...

Resetting the Baseline? eDiscovery Market Size Adjustments for 2020

An unanticipated pandemeconomic-driven retraction in eDiscovery spending during 2020 has resulted...

Home or Away? New eDiscovery Collection Market Sizing and Pricing Considerations

One of the key home (onsite) or away (remote) decisions that...

Five Great Reads on eDiscovery for February 2021

From litigation trends and legal tech investing to facial recognition and...

Five Great Reads on eDiscovery for January 2021

From eDiscovery business confidence and operational metrics to merger and acquisition...

Five Great Reads on eDiscovery for December 2020

May the peace and joy of the holiday season be with...

Five Great Reads on eDiscovery for November 2020

From market sizing and cyber law to industry investments and customer...

HaystackID Recognized in IDC MarketScape for eDiscovery Services

According to HaystackID CEO Hal Brooks, “We are proud to once...

A Generational View of Remote Security? HaystackID™ Releases 3.0 Security Enhancements to Review Technology

According to HaystackID's Senior Vice President and General Manager for Review...

Only a Matter of Time? HaystackID Launches New Service for Data Breach Discovery and Review

According to HaystackID's Chief Innovation Officer and President of Global Investigations,...

It’s a Match! Focusing on the Total Cost of eDiscovery Review with ReviewRight Match

As a leader in remote legal document review, HaystackID provides clients...

Cold Weather Catch? Predictive Coding Technologies and Protocols Survey – Spring 2021 Results

The Predictive Coding Technologies and Protocols Survey is a non-scientific semi-annual...

Out of the Woods? Eighteen Observations on eDiscovery Business Confidence in the Winter of 2021

In the winter of 2021, 85.0% of eDiscovery Business Confidence Survey...

Issues Impacting eDiscovery Business Performance: A Winter 2021 Overview

In the winter of 2021, 43.3% of respondents viewed budgetary constraints...

Not So Outstanding? eDiscovery Operational Metrics in the Winter of 2021

In the winter of 2021, eDiscovery Business Confidence Survey more...