Predictive Coding Technologies and Protocols: Spring 2019 Survey

The Predictive Coding Technologies and Protocols Survey is a non-scientific semi-annual survey designed to help provide a general understanding of the current application of predictive coding technologies, protocols, workflows, and uses by data discovery and legal discovery professionals.

As your opinion is essential in helping form a complete picture of the interest and impact of predictive coding in eDiscovery, please do take the time to complete this short, anonymized survey, as the results will help legal, business, and technology professionals in the eDiscovery ecosystem better understand the current use of predictive coding.

Editor’s Note: This is the second semi-annual Predictive Coding Technologies and Protocols survey conducted by ComplexDiscovery. The first survey provided detailed feedback from 31 legal, business, and technology professionals on the use of specific machine learning technologies in predictive coding and also highlighted the usage of those machine learning technologies as part of example technology-assisted review protocols. This second iteration of the survey has been expanded to build on predictive coding technology and protocol usage understanding by adding questions related to workflows and specific uses for technology-assisted review.

Predictive Coding Technologies and Protocols (Survey Backgrounder)

As defined in The Grossman-Cormack Glossary of Technology-Assisted Review (1), Predictive Coding is an industry-specific term generally used to describe a technology-assisted review process involving the use of a machine learning algorithm to distinguish relevant from non-relevant documents, based on a subject matter expert’s coding of a training set of documents. This definition of predictive coding provides a baseline description that identifies one particular function that a general set of commonly accepted machine learning algorithms may use in a technology-assisted review (TAR).

With the growing awareness and use of predictive coding in the legal arena today, it appears that it is increasingly more important for electronic discovery professionals to have a general understanding of the technologies that may be implemented in electronic discovery platforms to facilitate predictive coding of electronically stored information. This general understanding is essential as each potential algorithmic approach has efficiency advantages and disadvantages that may impact the efficiency and efficacy of predictive coding.

To help in developing this general understanding of predictive coding technologies and to provide an opportunity for electronic discovery providers to share the technologies and protocols they use in and with their platforms to accomplish predictive coding, the following working lists of predictive coding technologies and TAR protocols are provided for your use. Working lists on predictive coding workflows and uses are also included for your consideration as they help define how the predictive coding technologies and TAR protocols are implemented and used.

Additionally, a simple four-question eDiscovery provider implementation survey is shared to gather information on how leading eDiscovery providers combine technologies and protocols to conduct predictive coding.

A Working List of Predictive Coding Technologies (1,2,3,4)

Aggregated from electronic discovery experts based on professional publications and personal conversations, provided below is a non-all inclusive working list of identified machine learning technologies that have been applied or have the potential to be applied to the discipline of eDiscovery to facilitate predictive coding. This working list is designed to provide a reference point for identified predictive coding technologies and may over time include additions, adjustments, and amendments based on feedback from experts and organizations applying and implementing these mainstream technologies in their specific eDiscovery platforms.

Listed in Alphabetical Order

  • Active Learning: A process, typically iterative, whereby an algorithm is used to select documents that should be reviewed for training based on a strategy to help the classification algorithm learn efficiently.
  • Decision Tree: A step-by-step method of distinguishing between relevant and non-relevant documents, depending on what combination of words (or other features) they contain. A Decision Tree to identify documents pertaining to financial derivatives might first determine whether or not a document contained the word “swap.” If it did, the Decision Tree might then determine whether or not the document contained “credit,” and so on. A Decision Tree may be created either through knowledge engineering or machine learning.
  • k-Nearest Neighbor Classifier (k-NN): A classification algorithm that analyzes the k example documents that are most similar (nearest) to the document being classified in order to determine the best classification for the document. If k is too small (e.g., k=1), it may be extremely difficult to achieve high recall.
  • Latent Semantic Analysis (LSA): A mathematical representation of documents that treats highly correlated words (i.e., words that tend to occur in the same documents) as being, in a sense, equivalent or interchangeable. This equivalency or interchangeability can allow algorithms to identify documents as being conceptually similar even when they aren’t using the same words (e.g., because synonyms may be highly correlated), though it also discards some potentially useful information and can lead to undesirable results caused by spurious correlations.
  • Logistic Regression: A state-of-the-art supervised learning algorithm for machine learning that estimates the probability that a document is relevant, based on the features that it contains. In contrast to the Naïve Bayes, algorithm, Logistic Regression identifies features that discriminate between relevant and non-relevant documents.
  • Naïve Bayesian Classifier: A system that examines the probability that each word in a new document came from the word distribution derived from a trained responsive document or trained non-responsive documents. The system is naïve in the sense that it assumes that all words are independent of one another.
  • Neural Network: An Artificial Neural Network (ANN) is a computational model. It is based on the structure and functions of biological neural networks. It works like the way the human brain processes information. It includes a large number of connected processing units that work together to process information.
  • Probabilistic Latent Semantic Analysis (PLSA): This is similar in spirit to LSA but it uses a probabilistic model to achieve results that are expected to be better.
  • Random Forests: An ensemble learning method for classification, regression, and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.
  • Relevance Feedback: An active learning process in which the documents with the highest likelihood of relevance are coded by a human, and added to the training set.
  • Support Vector Machine: A mathematical approach that seeks to find a line that separates responsive from non-responsive documents so that, ideally, all of the responsive documents are on one side of the line and all of the non-responsive ones are on the other side.

General TAR Protocols (5,6,7,8)

Additionally, these technologies are generally employed as part of a TAR protocol which determines how the technologies are used. Examples of TAR protocols include:

Listed in Alphabetical Order

  • Continuous Active Learning (CAL): In CAL, the TAR method developed, used, and advocated by Maura R. Grossman and Gordon V. Cormack, after the initial training set, the learner repeatedly selects the next-most-likely-to-be-relevant documents (that have not yet been considered) for review, coding, and training, and continues to do so until it can no longer find any more relevant documents. There is generally no second review because, by the time the learner stops learning, all documents deemed relevant by the learner have already been identified and manually reviewed.
  • Hybrid Multimodal Method: An approach developed by the e-Discovery Team (Ralph Losey) that includes all types of search methods, with primary reliance placed on predictive coding and the use of high-ranked documents for continuous active training.
  • Scalable Continuous Active Learning (S-CAL): The essential difference between S-CAL and CAL is that for S-CAL, only a finite sample of documents from each successive batch is selected for labeling, and the process continues until the collection—or a large random sample of the collection—is exhausted. Together, the finite samples form a stratified sample of the document population, from which a statistical estimate of ρ may be derived.
  • Simple Active Learning (SAL): In SAL methods, after the initial training set, the learner selects the documents to be reviewed and coded by the teacher, and used as training examples, and continues to select examples until it is sufficiently trained. Typically, the documents the learner chooses are those about which the learner is least certain, and therefore from which it will learn the most. Once sufficiently trained, the learner is then used to label every document in the collection. As with SPL, the documents labeled as relevant are generally re-reviewed manually.
  • Simple Passive Learning (SPL): In simple passive learning (“SPL”) methods, the teacher (i.e., human operator) selects the documents to be used as training examples; the learner is trained using these examples, and once sufficiently trained, is used to label every document in the collection as relevant or non-relevant. Generally, the documents labeled as relevant by the learner are re-reviewed manually. This manual review represents a small fraction of the collection, and hence a small fraction of the time and cost of an exhaustive manual review.

TAR Workflows (9)

TAR workflows represent the practical application of predictive coding technologies and protocols to define approaches to completing predictive coding tasks. Three examples of TAR workflows include:

  • TAR 1.0 involves a training phase followed by a review phase with a control set being used to determine the optimal point when you should switch from training to review.  The system no longer learns once the training phase is completed.  The control set is a random set of documents that have been reviewed and marked as relevant or non-relevant.  The control set documents are not used to train the system.  They are used to assess the system’s predictions so training can be terminated when the benefits of additional training no longer outweigh the cost of additional training.  Training can be with randomly selected documents, known as Simple Passive Learning (SPL), or it can involve documents chosen by the system to optimize learning efficiency, known as Simple Active Learning (SAL).
  • TAR 2.0 uses an approach called Continuous Active Learning (CAL), meaning that there is no separation between training and review–the system continues to learn throughout.  While many approaches may be used to select documents for review, a significant component of CAL is many iterations of predicting which documents are most likely to be relevant, reviewing them, and updating the predictions.  Unlike TAR 1.0, TAR 2.0 tends to be very efficient even when prevalence is low.  Since there is no separation between training and review, TAR 2.0 does not require a control set.  Generating a control set can involve reviewing a large (especially when prevalence is low) number of non-relevant documents, so avoiding control sets is desirable.
  • TAR 3.0 requires a high-quality conceptual clustering algorithm that forms narrowly focused clusters of fixed size in concept space.  It applies the TAR 2.0 methodology to just the cluster centers, which ensures that a diverse set of potentially relevant documents are reviewed.  Once no more relevant cluster centers can be found, the reviewed cluster centers are used as training documents to make predictions for the full document population.  There is no need for a control set–the system is well-trained when no additional relevant cluster centers can be found. Analysis of the cluster centers that were reviewed provides an estimate of the prevalence and the number of non-relevant documents that would be produced if documents were produced based purely on the predictions without human review.  The user can decide to produce documents (not identified as potentially privileged) without review, similar to SAL from TAR 1.0 (but without a control set), or he/she can decide to review documents that have too much risk of being non-relevant (which can be used as additional training for the system, i.e., CAL).  The key point is that the user has the info he/she needs to make a decision about how to proceed after completing review of the cluster centers that are likely to be relevant, and nothing done before that point becomes invalidated by the decision (compare to starting with TAR 1.0, reviewing a control set, finding that the predictions aren’t good enough to produce documents without review, and then switching to TAR 2.0, which renders the control set virtually useless).

TAR Uses (10)

TAR technologies, protocols, and workflows can be used effectively to help eDiscovery professionals accomplish many data discovery and legal discovery tasks. Nine commonly considered examples of TAR use include:

  • Identification of Relevant Documents
  • Early Case Assessment/Investigation
  • Prioritization for Review
  • Categorization (By Issues, For Confidentiality or Privacy)
  • Privilege Review
  • Quality Control and Quality Assurance
  • Review of Incoming Productions
  • Disposition/Trial Preparation
  • Information Governance and Data Disposition

Predictive Coding Technologies and Protocols (Survey)

A Four-Question Survey (11,12,13,14)

Provided below is a link to a simple four-question survey designed to capture the current application of technologies, protocols, workflows, and uses of predictive coding in the eDiscovery ecosystem.

Legal, information technology, and business professionals involved in organizational activities involving the use of predictive coding are encouraged to complete the short four-question survey.


Results of the survey (excluding responder contact information) will be aggregated and published on the ComplexDiscovery blog for usage by the eDiscovery community.


(1) Grossman, M. and Cormack, G. (2013). The Grossman-Cormack Glossary of Technology-Assisted Review. [ebook] Federal Courts Law Review. Available at: [Accessed 31 Aug. 2018].

(2) Dimm, B. (2018). Expertise on Predictive Coding. [email].

(3) Roitblat, H. (2013). Introduction to Predictive Coding. [ebook] OrcaTec. Available at: [Accessed 31 Aug. 2018].

(4) Tredennick, J. and Pickens, J. (2017). Deep Learning in E-Discovery: Moving Past the Hype. [online] Available at: [Accessed 31 Aug. 2018].

(5) Grossman, M. and Cormack, G. (2017). Technology-Assisted Review in Electronic Discovery. [ebook] Available at: [Accessed 31 Aug. 2018].

(6) Grossman, M. and Cormack, G. (2016). Continuous Active Learning for TAR. [ebook] Practical Law. Available at: [Accessed 31 Aug. 2018].

(7) Grossman, M. and Cormack, G. (2016). Scalability of Continuous Active Learning for Reliable High-Recall Text Classification. [ebook] Available at: [Accessed 3 Sep. 2018].

(8) Losey, R., Sullivan, J. and Reichenberger, T. (2015). e-Discovery Team at TREC 2015 Total Recall Track. [ebook] Available at: [Accessed 1 Sep. 2018].

(9) Dimm, B. (2016), TAR 3.0 Performance. [online] Clustify Blog – eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development. Available at: [Accessed 18 Feb. 2019].

(10) Electronic Discovery Reference Model (EDRM) (2019). Technology Assisted Review (TAR) Guidelines. [online] Available at: [Accessed 18 Feb. 2019].

(11) Dimm, B. (2018). TAR, Proportionality, and Bad Algorithms (1-NN). [online] Clustify Blog – eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development. Available at: [Accessed 31 Aug. 2018].

(12) Robinson, R. (2013). Running Results: Predictive Coding One-Question Provider Implementation Survey. [online] ComplexDiscovery: eDiscovery Information. Available at: [Accessed 31 Aug. 2018].

(13) Robinson, R. (2018). A Running List: Top 100+ eDiscovery Providers. [online] ComplexDiscovery: eDiscovery Information. Available at: [Accessed 31 Aug. 2018].

(14) Robinson, R. (2018) Relatively Speaking: Predictive Coding Technologies and Protocols Survey Results [online] ComplexDiscovery: eDiscovery Information. Available at: [Accessed 18 Feb. 2019].

Click here to provide specific additions, corrections, and updates.

Have a Request?

If you have information or offering requests that you would like to ask us about, please let us know and we will make our response to you a priority.

ComplexDiscovery is an online publication that highlights data and legal discovery insight and intelligence ranging from original research to aggregated news for use by business, information technology, and legal professionals. The highly targeted publication seeks to increase the collective understanding of readers regarding data and legal discovery information and issues and to provide an objective resource for considering trends, technologies, and services related to electronically stored information.

ComplexDiscovery OÜ is a technology marketing firm providing strategic planning and tactical execution expertise in support of data and legal discovery organizations. Registered as a private limited company in the European Union country of Estonia, one of the most digitally advanced countries in the world, ComplexDiscovery OÜ operates virtually worldwide to deliver marketing consulting and services.

A (Brand) New Approach? Considering the Framework and Structure of eDiscovery Offerings

Today’s eDiscovery providers may benefit from the lessons learned in the creation of the Sgt. Pepper’s Lonely Hearts Club Band album by creating a concept for branding and packaging their offerings within that brand in a connected, theme-based way that represents the offerings’ promise and capability in a way that is easy to understand and remember.

Check Out the New Approach Now!

Interested in Contributing?

ComplexDiscovery combines original industry research with curated expert articles to create an informational resource that helps legal, business, and information technology professionals better understand the business and practice of data discovery and legal discovery.

All contributions are invested to support the development and distribution of ComplexDiscovery content. Contributors can make as many article contributions as they like, but will not be asked to register and pay until their contribution reaches $5.

New from NIST: Integrating Cybersecurity and Enterprise Risk Management (ERM)

NIST has released NISTIR 8286, Integrating Cybersecurity and Enterprise Risk Management...

A Cloudy Alliance? A Next-Generation Cloud for Europe

According to Thierry Breton, Commissioner for the Internal Market, "Europe needs...

Five Great Reads on eDiscovery for October 2020

From business confidence and captive ALSPs to digital republics and mass...

A Season of Change? Eighteen Observations on eDiscovery Business Confidence in the Fall of 2020

In the fall of 2020, 77.2% of eDiscovery Business Confidence Survey...

A Running List: Top 100+ eDiscovery Providers

Based on a compilation of research from analyst firms and industry...

The eDisclosure Systems Buyers Guide – 2020 Edition (Andrew Haslam)

Authored by industry expert Andrew Haslam, the eDisclosure Buyers Guide continues...

The Race to the Starting Line? Recent Secure Remote Review Announcements

Not all secure remote review offerings are equal as the apparent...

Enabling Remote eDiscovery? A Snapshot of DaaS

Desktop as a Service (DaaS) providers are becoming important contributors to...

Home or Away? New eDiscovery Collection Market Sizing and Pricing Considerations

One of the key home (onsite) or away (remote) decisions that...

Revisions and Decisions? New Considerations for eDiscovery Secure Remote Reviews

One of the key revision and decision areas that business, legal,...

A Macro Look at Past and Projected eDiscovery Market Size from 2012 to 2024

From a macro look at past estimations of eDiscovery market size...

An eDiscovery Market Size Mashup: 2019-2024 Worldwide Software and Services Overview

While the Compound Annual Growth Rate (CAGR) for worldwide eDiscovery software...

A Season of Change? Eighteen Observations on eDiscovery Business Confidence in the Fall of 2020

In the fall of 2020, 77.2% of eDiscovery Business Confidence Survey...

The Continuing Case of Budgetary Constraints in the Business of eDiscovery

In the fall of 2020, 49.4% of respondents viewed budgetary constraints...

Outstanding Accounts? eDiscovery Operational Metrics in the Fall of 2020

In the fall of 2020, eDiscovery Business Confidence Survey more...

Holding the Rudder? Fall 2020 eDiscovery Business Confidence Survey Results

This is the twentieth quarterly eDiscovery Business Confidence Survey conducted by...

DISCO Raises $60 Million

According to the media release, DISCO will use this investment to...

Rampiva and the RYABI Group Merge

According to today's announcement, the RYABI Group merger is Rampiva's first...

eDiscovery Mergers, Acquisitions, and Investments in Q3 2020

From HaystackID and NightOwl Global to Reveal Data and NexLP, the...

Mitratech Acquires Acuity ELM

According to Mike Williams, CEO of Mitratech, “We came to the...

Five Great Reads on eDiscovery for October 2020

From business confidence and captive ALSPs to digital republics and mass...

Five Great Reads on eDiscovery for September 2020

From cloud forensics and cyber defense to social media and surveys,...

Five Great Reads on eDiscovery for August 2020

From predictive coding and artificial intelligence to antitrust investigations and malware,...

Five Great Reads on eDiscovery for July 2020

From business confidence and operational metrics to data protection and privacy...