Missing Something? Topic Modeling in eDiscovery

The basic idea behind topic modeling, according to eDiscovery expert and author Herbert Roitblat, is that documents consist of words that are derived from some mixture of topics. The goal of eDiscovery, argues the author, is to get the information contained in a collection of documents, not to get the documents themselves. The documents are just a means to reach the information.

en flag
nl flag
et flag
fi flag
fr flag
de flag
pt flag
ru flag
es flag

Content Assessment: Topic Modeling in eDiscovery Paper by Herbert Roitblat

Information - 98%
Insight - 98%
Relevance - 100%
Objectivity - 98%
Authority - 100%

99%

Excellent

A short percentage-based assessment of the qualitative benefit of the recent post sharing Herb Roitblat's paper on topic modeling in eDiscovery.

Editor’s Note: As an author, contributor, and speaker on eDiscovery, Herbert Roitblat is a technology entrepreneur, inventor, and expert who needs no introduction to serious professionals in the eDiscovery ecosystem. Currently serving as Principal Data Scientist at Mimecast, he is a recognized expert in areas ranging from cognitive science and information retrieval to eDiscovery and machine learning. His recently published paper on topical modeling in eDiscovery calls attention to the search process in legal discovery and highlights that a computer-assisted search process is not only reasonable, but it is also complete when measured by topics.

Is there something I’m missing? Topic Modeling in eDiscovery

By Herbert Roitblat, Ph.D.

Abstract

In legal eDiscovery, the parties are required to search through their electronically stored information to find documents that are relevant to a specific case. Negotiations over the scope of these searches are often based on a fear that something will be missed. This paper continues an argument that discovery should be based on identifying the facts of a case. If a search process is less than complete (if it has Recall less than 100%), it may still be complete in presenting all of the relevant available topics. In this study, Latent Dirichlet Allocation* was used to identify 100 topics from all of the known relevant documents. The documents were then categorized to about 80% Recall (i.e., 80% of the relevant documents were found by the categorizer, designated the hit set and 20% were missed, designated the missed set). Despite the fact that less than all of the relevant documents were identified by the categorizer, the documents that were identified contained all of the topics derived from the full set of documents. This same pattern held whether the categorizer was a naïve Bayes categorizer trained on a random selection of documents or a Support Vector Machine trained with Continuous Active Learning (which focuses evaluation on the most-likely-to-be-relevant documents). No topics were identified in either categorizer’s missed set that were not already seen in the hit set. Not only is a computer-assisted search process reasonable (as required by the Federal Rules of Civil Procedure), it is also complete when measured by topics.


Review the Complete Paper (PDF) Shared with Permission

Topic Modeling in eDiscovery – Herbert Roitblat Ph.D

Read the original paper via arXiv® (Cornell University)


* Background: [Latent Dirichlet Allocation – Wikipedia] In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox. [Peter Gustav Lejeune Dirichlet – Wikipedia] Johann Peter Gustav Lejeune Dirichlet was a German mathematician who made deep contributions to number theory (including creating the field of analytic number theory), and to the theory of Fourier series and other topics in mathematical analysis; he is credited with being one of the first mathematicians to give the modern formal definition of a function. Dirichlet also first stated the pigeonhole principle.

Additional Reading

Source: ComplexDiscovery

Have a Request?

If you have information or offering requests that you would like to ask us about, please let us know and we will make our response to you a priority.

ComplexDiscovery is an online publication that highlights data and legal discovery insight and intelligence ranging from original research to aggregated news for use by business, information technology, and legal professionals. The highly targeted publication seeks to increase the collective understanding of readers regarding data and legal discovery information and issues and to provide an objective resource for considering trends, technologies, and services related to electronically stored information.

ComplexDiscovery OÜ is a technology marketing firm providing strategic planning and tactical execution expertise in support of data and legal discovery organizations. Registered as a private limited company in the European Union country of Estonia, one of the most digitally advanced countries in the world, ComplexDiscovery OÜ operates virtually worldwide to deliver marketing consulting and services.

Business as Unusual? Eighteen Observations on eDiscovery Business Confidence in the Summer of 2020

The results of the recent Summer 2020 eDiscovery Business Confidence Survey present the unfortunate and continuing impact of COVID-19 on the business of eDiscovery. However, for these pandemic-driven results to be fully understood, they should be viewed through the contextual lens of the results of all nineteen surveys that have been administered to eDiscovery professionals since the inception of the eDiscovery Business Confidence Survey in early 2016.



Check Out the Observations Now!

Interested in Contributing?

ComplexDiscovery combines original industry research with curated expert articles to create an informational resource that helps legal, business, and information technology professionals better understand the business and practice of data discovery and legal discovery.

All contributions are invested to support the development and distribution of ComplexDiscovery content. Contributors can make as many article contributions as they like, but will not be asked to register and pay until their contribution reaches $5.

Mitratech Acquires Acuity ELM

According to Mike Williams, CEO of Mitratech, “We came to the...

Veritas Acquires Globanet

“By integrating Globanet’s technology into our digital compliance portfolio, we’re making...

Five Great Reads on eDiscovery for September 2020

From cloud forensics and cyber defense to social media and surveys,...

Time for a Change? FTC Proposes Changes to HSR Act Premerger Notification Rules

The Federal Trade Commission, with the support of the Department of...

A Running List: Top 100+ eDiscovery Providers

Based on a compilation of research from analyst firms and industry...

The eDisclosure Systems Buyers Guide – 2020 Edition (Andrew Haslam)

Authored by industry expert Andrew Haslam, the eDisclosure Buyers Guide continues...

The Race to the Starting Line? Recent Secure Remote Review Announcements

Not all secure remote review offerings are equal as the apparent...

Enabling Remote eDiscovery? A Snapshot of DaaS

Desktop as a Service (DaaS) providers are becoming important contributors to...

Home or Away? New eDiscovery Collection Market Sizing and Pricing Considerations

One of the key home (onsite) or away (remote) decisions that...

Revisions and Decisions? New Considerations for eDiscovery Secure Remote Reviews

One of the key revision and decision areas that business, legal,...

A Macro Look at Past and Projected eDiscovery Market Size from 2012 to 2024

From a macro look at past estimations of eDiscovery market size...

An eDiscovery Market Size Mashup: 2019-2024 Worldwide Software and Services Overview

While the Compound Annual Growth Rate (CAGR) for worldwide eDiscovery software...

Festive or Restive? The Fall 2020 eDiscovery Business Confidence Survey

Since January 2016, 2,189 individual responses to nineteen quarterly eDiscovery Business...

Casting a Wider Net? Predictive Coding Technologies and Protocols Survey – Fall 2020 Results

The Predictive Coding Technologies and Protocols Survey is a non-scientific semi-annual...

Business as Unusual? Eighteen Observations on eDiscovery Business Confidence in the Summer of 2020

Based on the aggregate results of nineteen past eDiscovery Business Confidence...

A Growing Concern? Budgetary Constraints and the Business of eDiscovery

In the summer of 2020, 56% of respondents viewed budgetary constraints...

Mitratech Acquires Acuity ELM

According to Mike Williams, CEO of Mitratech, “We came to the...

Veritas Acquires Globanet

“By integrating Globanet’s technology into our digital compliance portfolio, we’re making...

An eDiscovery Holiday Season Down Under? Macquarie Prepares Nuix for IPO

According to John Beveridge, writing for Small Caps, Macquarie holds a...

ayfie to Acquire Haive

According to Johannes Stiehler, CEO of ayfie Group AS, “This acquisition...

Five Great Reads on eDiscovery for September 2020

From cloud forensics and cyber defense to social media and surveys,...

Five Great Reads on eDiscovery for August 2020

From predictive coding and artificial intelligence to antitrust investigations and malware,...

Five Great Reads on eDiscovery for July 2020

From business confidence and operational metrics to data protection and privacy...

Five Great Reads on eDiscovery for June 2020

From collection market size updates to cloud outsourcing guidelines, the June...