Content Assessment: Topic Modeling in eDiscovery Paper by Herbert Roitblat

Information - 98%
Insight - 98%
Relevance - 100%
Objectivity - 98%
Authority - 100%

99%

Excellent

A short percentage-based assessment of the qualitative benefit of the recent post sharing Herb Roitblat's paper on topic modeling in eDiscovery.

Editor’s Note: As an author, contributor, and speaker on eDiscovery, Herbert Roitblat is a technology entrepreneur, inventor, and expert who needs no introduction to serious professionals in the eDiscovery ecosystem. Currently serving as Principal Data Scientist at Mimecast, he is a recognized expert in areas ranging from cognitive science and information retrieval to eDiscovery and machine learning. His recently published paper on topical modeling in eDiscovery calls attention to the search process in legal discovery and highlights that a computer-assisted search process is not only reasonable, but it is also complete when measured by topics.

Is there something I’m missing? Topic Modeling in eDiscovery

By Herbert Roitblat, Ph.D.

Abstract

In legal eDiscovery, the parties are required to search through their electronically stored information to find documents that are relevant to a specific case. Negotiations over the scope of these searches are often based on a fear that something will be missed. This paper continues an argument that discovery should be based on identifying the facts of a case. If a search process is less than complete (if it has Recall less than 100%), it may still be complete in presenting all of the relevant available topics. In this study, Latent Dirichlet Allocation* was used to identify 100 topics from all of the known relevant documents. The documents were then categorized to about 80% Recall (i.e., 80% of the relevant documents were found by the categorizer, designated the hit set and 20% were missed, designated the missed set). Despite the fact that less than all of the relevant documents were identified by the categorizer, the documents that were identified contained all of the topics derived from the full set of documents. This same pattern held whether the categorizer was a naïve Bayes categorizer trained on a random selection of documents or a Support Vector Machine trained with Continuous Active Learning (which focuses evaluation on the most-likely-to-be-relevant documents). No topics were identified in either categorizer’s missed set that were not already seen in the hit set. Not only is a computer-assisted search process reasonable (as required by the Federal Rules of Civil Procedure), it is also complete when measured by topics.


Review the Complete Paper (PDF) Shared with Permission

Topic Modeling in eDiscovery – Herbert Roitblat Ph.D

Read the original paper via arXiv® (Cornell University)


* Background: [Latent Dirichlet Allocation – Wikipedia] In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox. [Peter Gustav Lejeune Dirichlet – Wikipedia] Johann Peter Gustav Lejeune Dirichlet was a German mathematician who made deep contributions to number theory (including creating the field of analytic number theory), and to the theory of Fourier series and other topics in mathematical analysis; he is credited with being one of the first mathematicians to give the modern formal definition of a function. Dirichlet also first stated the pigeonhole principle.

Additional Reading

Source: ComplexDiscovery

 

Have a Request?

If you have information or offering requests that you would like to ask us about, please let us know, and we will make our response to you a priority.

ComplexDiscovery OÜ is a highly recognized digital publication focused on providing detailed insights into the fields of cybersecurity, information governance, and eDiscovery. Based in Estonia, a hub for digital innovation, ComplexDiscovery OÜ upholds rigorous standards in journalistic integrity, delivering nuanced analyses of global trends, technology advancements, and the eDiscovery sector. The publication expertly connects intricate legal technology issues with the broader narrative of international business and current events, offering its readership invaluable insights for informed decision-making.

For the latest in law, technology, and business, visit ComplexDiscovery.com.

 

Generative Artificial Intelligence and Large Language Model Use

ComplexDiscovery OÜ recognizes the value of GAI and LLM tools in streamlining content creation processes and enhancing the overall quality of its research, writing, and editing efforts. To this end, ComplexDiscovery OÜ regularly employs GAI tools, including ChatGPT, Claude, DALL-E2, Grammarly, Midjourney, and Perplexity, to assist, augment, and accelerate the development and publication of both new and revised content in posts and pages published (initiated in late 2022).

ComplexDiscovery also provides a ChatGPT-powered AI article assistant for its users. This feature leverages LLM capabilities to generate relevant and valuable insights related to specific page and post content published on ComplexDiscovery.com. By offering this AI-driven service, ComplexDiscovery OÜ aims to create a more interactive and engaging experience for its users, while highlighting the importance of responsible and ethical use of GAI and LLM technologies.