Data Lakes: An Important Technological Approach for Data and Legal Discovery

As highlighted by Jennifer Zaino in BizTech, a data lake is an architecture for storing high-volume, high-velocity, high-variety, as-is data in a centralized repository for Big Data and real-time analytics. And the technology is an attention-getter: The global data lakes market is expected to grow at a rate of 28 percent between 2017 and 2023.

en flag
nl flag
fr flag
de flag
pt flag
es flag

Editor’s Note: Data lakes provide an architectural approach for storing high-volume, high-velocity, and high-variety data. This storage approach is of increasing interest to business, information technology, and legal professionals as they seek to deal with increasing volumes and types of data coupled with the challenge of interrogating, identifying, and indexing data so it can be analyzed to help organizations uncover insight for business benefit, compliance obligations, and litigation requirements. Provided in this post is a compilation of informational article extracts that may be helpful for those seeking to learn more about the benefit of data lakes and their potential in the sphere of data discovery and legal discovery.


An extract from an article by Jennifer Zaino via BizTech

Data Lakes Prove Key to Modern Data Platforms

What Is a Data Lake? 

Data lakes store data of any type in its raw form, much as a real lake provides a habitat where all types of creatures can live together.

A data lake is an architecture for storing high-volume, high-velocity, high-variety, as-is data in a centralized repository for Big Data and real-time analytics. And the technology is an attention-getter: The global data lakes market is expected to grow at a rate of 28 percent between 2017 and 2023.

Companies can pull in vast amounts of data — structured, semistructured and unstructured — in real time into a data lake, from anywhere. Data can be ingested from Internet of Things sensors, clickstream activity on a website, log files, social media feeds, videos and online transaction processing (OLTP) systems, for instance. There are no constraints on where the data hails from, but it’s a good idea to use metadata tagging to add some level of organization to what’s ingested, so that relevant data can be surfaced for queries and analysis.

“To ensure that a lake doesn’t become a swamp, it’s very helpful to provide a catalog that makes data visible and accessible to the business, as well as to IT and data management professionals,” says Doug Henschen, vice president and principal analyst at Constellation Research.

Data Lakes vs. Data Warehouses

Data lakes should not be confused with data warehouses. Where data lakes store raw data, warehouses store current and historical data in an organized fashion.

IT teams and data engineers should think of a data warehouse as a highly structured environment, where racks and containers are clearly labeled and similar items are stacked together for supply chain efficiency.

The difference between a data lake and a data warehouse primarily pertains to analytics.

Data warehouses are best for analyzing structured data quickly and with great accuracy and transparency for managerial or regulatory purposes. Meanwhile, data lakes are primed for experimentation, explains Kelle O’Neal, founder and CEO of management consulting firm First San Francisco Partners.

With a data lake, businesses can quickly load a variety of data types from multiple sources and engage in ad hoc analysis. Or, a data team could leverage machine learning in a data lake to find “a needle in a haystack,” O’Neal says.

“The rapid inclusion of new data sets would never be possible in a traditional data warehouse, with its data model–specific structures and its constraints on adding new sources or targets,” O’Neal says.

Data warehouses follow a “schema on write” approach, which entails defining a schema for data before being able to write it to the database. Online analytical processing (OLAP) technology can be used to analyze and evaluate data in a warehouse, enabling fast responses to complex analytical queries.

Data lakes take a “schema on read” approach, where the data is structured and transformed only when it is ready to be used. For this reason, it’s a snap to bring in new data sources, and users don’t have to know in advance the questions they want to answer. With lakes, “different types of analytics on your data — like SQL queries, Big Data analytics, full-text search, real-time analytics and machine learning — can be used to uncover insights,” according to Amazon. Moreover, data lakes are capable of real-time actions based on algorithm-driven analytics.

Businesses may use both data lakes and data warehouses. The decision about which to use turns on “understanding and optimizing what the different solutions do best,” O’Neal says.

Read the complete article at Data Lakes Prove Key to Modern Data Platforms


An extract from an article by Bernard Marr via Forbes

What Is A Data Lake? A Super-Simple Explanation For Anyone

Some mistakenly believe that a data lake is just the 2.0 version of a data warehouse. While they are similar, they are different tools that should be used for different purposes. James Dixon, the CTO of Pentaho is credited with naming the concept of a data lake. He uses the following analogy:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

A data lake holds data in an unstructured way and there is no hierarchy or organization among the individual pieces of data. It holds data in its rawest form—it’s not processed or analyzed. Additionally, a data lake accepts and retains all data from all data sources, supports all data types, and schemas (the way the data is stored in a database) are applied only when the data is ready to be used.

Read the complete article at What Is A Data Lake? A Super-Simple Explanation For Anyone


An extract from an article by Lance Weaver via Equinix

Why Companies Are Jumping Into Data Lakes

Data lakes are a developing entity, and the industry hasn’t coalesced around a single, universally accepted definition. A consensus definition, derived from the consultation of several different sources, follows: “A data lake is a storage mechanism designed to facilitate the colocation and use of many different types of data, including data that is date-defined using various schemata, structural frameworks, blobs, and other files.”

The hope is that a data lake will make it possible for an enterprise to gain new business insights by accumulating large amounts of data, in the format chosen by each workload, and then make it easy to process using big data analytics, cross-workload analysis, reporting, research, and even some forms of transactional workloads.

The movement toward the implementation of data lakes is at the intersection of several trends. One is a move by cloud service providers who are seeking to innovate and provide new storage products.

Another trend sees enterprises experiencing fundamental shifts in the sources of their data and how they are using it. The data is now coming from many types of end user-focused devices and systems and is still being generated and processed by traditional systems. Efforts are underway to combine all of this structured and unstructured data, regardless of its form or original intent, making it easier to join with other systems of record. That’s where data lakes come in.

Read the complete article at Why Companies Are Jumping Into Data Lakes


An extract from an article by Michael Lappin via Nuix

Finding Structure for Your Unstructured Data Using Data Lakes

Why Fill The Lake?

Generally, we’ve seen a mix of proactive and reactive drivers pushing companies toward creating and filling a data lake.

  • Ongoing eDiscovery: The most popular driver we see with companies is frustration with slowness or lack of accuracy completing iterative eDiscovery tasks. These tasks include searching and producing old data for custodians on legal hold.
  • Migration or Extraction from Legacy Email Archives: Large email archives are very common and unmanageable. Many folks believe you need to extract the data—or at least the part of it that makes sense (by a custodian or by date)—index it and prepare it for discovery, governance, or migration to a new platform like Microsoft Office365.
  • Legal Hold Management: Legal hold management is linked to the previous drivers and it often seems to take the form of removing hundreds or even thousands of old holds and reducing them to a reasonable, manageable number.
  • Data Privacy and Information Governance: Recent regulations around the world have led to a new interest in information governance. The most publicized of these, the European Union’s General Data Protection Regulation (GDPR), contains measures for companies to answer data subjects’ subject access requests and delete the information upon request under its ‘right to be forgotten’ provisions. Along with this, the California Consumer Protection Act (CCPA) has introduced similar protections in the US that are likely to spread to other states.

Read the complete article at Finding Structure for Your Unstructured Data Using Data Lakes


Additional Reading

Source: ComplexDiscovery

A Matter of Pricing? A Running Update of Semi-Annual eDiscovery Pricing Survey Responses

First administered in December of 2018 and conducted four times during the last two years with 334 individual responses, the semi-annual eDiscovery Pricing Survey highlights pricing on selected collection, processing, and review tasks. The aggregate results of all surveys as shared in the provided comparative charts may be helpful for understanding pricing and its impact on purchasing behavior on selected services over time.



Access the Results Now!

ComplexDiscovery combines original industry research with curated expert articles to create an informational resource that helps legal, business, and information technology professionals better understand the business and practice of data discovery and legal discovery.

All contributions are invested to support the development and distribution of ComplexDiscovery content. Contributors can make as many article contributions as they like, but will not be asked to register and pay until their contribution reaches $5.

A Pillar of Empowerment? Evaluating and Reviewing GDPR Data Protection

The general view is that two years after it started to...

Connecting the Dots: Considering eDiscovery from Initiators to Ecosystem

From a macro perspective, connecting the dots in eDiscovery is to...

XDD Acquires RVM

According to XDD CEO Bob Polus, “Merging forces with RVM further...

Ipro Acquires NetGovern

According to Dean Brown, CEO at Ipro Tech, “We are thrilled...

A Running List: Top 100+ eDiscovery Providers

Based on a compilation of research from analyst firms and industry...

The eDisclosure Systems Buyers Guide – 2020 Edition (Andrew Haslam)

Authored by industry expert Andrew Haslam, the eDisclosure Buyers Guide continues...

The Race to the Starting Line? Recent Secure Remote Review Announcements

Not all secure remote review offerings are equal as the apparent...

Enabling Remote eDiscovery? A Snapshot of DaaS

Desktop as a Service (DaaS) providers are becoming important contributors to...

Home or Away? New eDiscovery Collection Market Sizing and Pricing Considerations

One of the key home (onsite) or away (remote) decisions that...

Revisions and Decisions? New Considerations for eDiscovery Secure Remote Reviews

One of the key revision and decision areas that business, legal,...

A Macro Look at Past and Projected eDiscovery Market Size from 2012 to 2024

From a macro look at past estimations of eDiscovery market size...

An eDiscovery Market Size Mashup: 2019-2024 Worldwide Software and Services Overview

While the Compound Annual Growth Rate (CAGR) for worldwide eDiscovery software...

A Matter of Pricing? A Running Update of Semi-Annual eDiscovery Pricing Survey Responses

First administered in December of 2018 and conducted four times during...

A Pandemeconomic Indicator? Summer 2020 eDiscovery Pricing Survey Results

Based on the complexity of data and legal discovery, it is...

COVID-19 Constrained? The Impact of Six Issues on the Business of eDiscovery

In the spring of 2020, 51.2% of respondents viewed budgetary constraints...

A Cause to Pause? eDiscovery Operational Metrics in the Spring of 2020

In the spring of 2020, 150 eDiscovery Business Confidence Survey participants...

XDD Acquires RVM

According to XDD CEO Bob Polus, “Merging forces with RVM further...

Ipro Acquires NetGovern

According to Dean Brown, CEO at Ipro Tech, “We are thrilled...

Morae Acquires Legal Management Consultancy Janders Dean

According to Janders Dean founder Justin North, "Now more than ever,...

eDiscovery Mergers, Acquisitions, and Investments in Q2 2020

From UnitedLex to Onna, ComplexDiscovery findings, data points, and tracking information...

Five Great Reads on eDiscovery for June 2020

From collection market size updates to cloud outsourcing guidelines, the June...

Five Great Reads on eDiscovery for May 2020

From review market sizing revisions to pandemeconomic pricing, the May 2020...

Five Great Reads on eDiscovery for April 2020

From business confidence to the boom of Zoom, the April 2020...

Five Great Reads on Data Discovery and Legal Discovery for March 2020

From business continuity considerations to cybersecurity attacks, the March 2020 edition...