Data Lakes: An Important Technological Approach for Data and Legal Discovery

As highlighted by Jennifer Zaino in BizTech, a data lake is an architecture for storing high-volume, high-velocity, high-variety, as-is data in a centralized repository for Big Data and real-time analytics. And the technology is an attention-getter: The global data lakes market is expected to grow at a rate of 28 percent between 2017 and 2023.

en flag
nl flag
fr flag
de flag
pt flag
es flag

Editor’s Note: Data lakes provide an architectural approach for storing high-volume, high-velocity, and high-variety data. This storage approach is of increasing interest to business, information technology, and legal professionals as they seek to deal with increasing volumes and types of data coupled with the challenge of interrogating, identifying, and indexing data so it can be analyzed to help organizations uncover insight for business benefit, compliance obligations, and litigation requirements. Provided in this post is a compilation of informational article extracts that may be helpful for those seeking to learn more about the benefit of data lakes and their potential in the sphere of data discovery and legal discovery.

An extract from an article by Jennifer Zaino via BizTech

Data Lakes Prove Key to Modern Data Platforms

What Is a Data Lake? 

Data lakes store data of any type in its raw form, much as a real lake provides a habitat where all types of creatures can live together.

A data lake is an architecture for storing high-volume, high-velocity, high-variety, as-is data in a centralized repository for Big Data and real-time analytics. And the technology is an attention-getter: The global data lakes market is expected to grow at a rate of 28 percent between 2017 and 2023.

Companies can pull in vast amounts of data — structured, semistructured and unstructured — in real time into a data lake, from anywhere. Data can be ingested from Internet of Things sensors, clickstream activity on a website, log files, social media feeds, videos and online transaction processing (OLTP) systems, for instance. There are no constraints on where the data hails from, but it’s a good idea to use metadata tagging to add some level of organization to what’s ingested, so that relevant data can be surfaced for queries and analysis.

“To ensure that a lake doesn’t become a swamp, it’s very helpful to provide a catalog that makes data visible and accessible to the business, as well as to IT and data management professionals,” says Doug Henschen, vice president and principal analyst at Constellation Research.

Data Lakes vs. Data Warehouses

Data lakes should not be confused with data warehouses. Where data lakes store raw data, warehouses store current and historical data in an organized fashion.

IT teams and data engineers should think of a data warehouse as a highly structured environment, where racks and containers are clearly labeled and similar items are stacked together for supply chain efficiency.

The difference between a data lake and a data warehouse primarily pertains to analytics.

Data warehouses are best for analyzing structured data quickly and with great accuracy and transparency for managerial or regulatory purposes. Meanwhile, data lakes are primed for experimentation, explains Kelle O’Neal, founder and CEO of management consulting firm First San Francisco Partners.

With a data lake, businesses can quickly load a variety of data types from multiple sources and engage in ad hoc analysis. Or, a data team could leverage machine learning in a data lake to find “a needle in a haystack,” O’Neal says.

“The rapid inclusion of new data sets would never be possible in a traditional data warehouse, with its data model–specific structures and its constraints on adding new sources or targets,” O’Neal says.

Data warehouses follow a “schema on write” approach, which entails defining a schema for data before being able to write it to the database. Online analytical processing (OLAP) technology can be used to analyze and evaluate data in a warehouse, enabling fast responses to complex analytical queries.

Data lakes take a “schema on read” approach, where the data is structured and transformed only when it is ready to be used. For this reason, it’s a snap to bring in new data sources, and users don’t have to know in advance the questions they want to answer. With lakes, “different types of analytics on your data — like SQL queries, Big Data analytics, full-text search, real-time analytics and machine learning — can be used to uncover insights,” according to Amazon. Moreover, data lakes are capable of real-time actions based on algorithm-driven analytics.

Businesses may use both data lakes and data warehouses. The decision about which to use turns on “understanding and optimizing what the different solutions do best,” O’Neal says.

Read the complete article at Data Lakes Prove Key to Modern Data Platforms

An extract from an article by Bernard Marr via Forbes

What Is A Data Lake? A Super-Simple Explanation For Anyone

Some mistakenly believe that a data lake is just the 2.0 version of a data warehouse. While they are similar, they are different tools that should be used for different purposes. James Dixon, the CTO of Pentaho is credited with naming the concept of a data lake. He uses the following analogy:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

A data lake holds data in an unstructured way and there is no hierarchy or organization among the individual pieces of data. It holds data in its rawest form—it’s not processed or analyzed. Additionally, a data lake accepts and retains all data from all data sources, supports all data types, and schemas (the way the data is stored in a database) are applied only when the data is ready to be used.

Read the complete article at What Is A Data Lake? A Super-Simple Explanation For Anyone

An extract from an article by Lance Weaver via Equinix

Why Companies Are Jumping Into Data Lakes

Data lakes are a developing entity, and the industry hasn’t coalesced around a single, universally accepted definition. A consensus definition, derived from the consultation of several different sources, follows: “A data lake is a storage mechanism designed to facilitate the colocation and use of many different types of data, including data that is date-defined using various schemata, structural frameworks, blobs, and other files.”

The hope is that a data lake will make it possible for an enterprise to gain new business insights by accumulating large amounts of data, in the format chosen by each workload, and then make it easy to process using big data analytics, cross-workload analysis, reporting, research, and even some forms of transactional workloads.

The movement toward the implementation of data lakes is at the intersection of several trends. One is a move by cloud service providers who are seeking to innovate and provide new storage products.

Another trend sees enterprises experiencing fundamental shifts in the sources of their data and how they are using it. The data is now coming from many types of end user-focused devices and systems and is still being generated and processed by traditional systems. Efforts are underway to combine all of this structured and unstructured data, regardless of its form or original intent, making it easier to join with other systems of record. That’s where data lakes come in.

Read the complete article at Why Companies Are Jumping Into Data Lakes

An extract from an article by Michael Lappin via Nuix

Finding Structure for Your Unstructured Data Using Data Lakes

Why Fill The Lake?

Generally, we’ve seen a mix of proactive and reactive drivers pushing companies toward creating and filling a data lake.

  • Ongoing eDiscovery: The most popular driver we see with companies is frustration with slowness or lack of accuracy completing iterative eDiscovery tasks. These tasks include searching and producing old data for custodians on legal hold.
  • Migration or Extraction from Legacy Email Archives: Large email archives are very common and unmanageable. Many folks believe you need to extract the data—or at least the part of it that makes sense (by a custodian or by date)—index it and prepare it for discovery, governance, or migration to a new platform like Microsoft Office365.
  • Legal Hold Management: Legal hold management is linked to the previous drivers and it often seems to take the form of removing hundreds or even thousands of old holds and reducing them to a reasonable, manageable number.
  • Data Privacy and Information Governance: Recent regulations around the world have led to a new interest in information governance. The most publicized of these, the European Union’s General Data Protection Regulation (GDPR), contains measures for companies to answer data subjects’ subject access requests and delete the information upon request under its ‘right to be forgotten’ provisions. Along with this, the California Consumer Protection Act (CCPA) has introduced similar protections in the US that are likely to spread to other states.

Read the complete article at Finding Structure for Your Unstructured Data Using Data Lakes

Additional Reading

Source: ComplexDiscovery

Interested in Contributing?

ComplexDiscovery regularly reports on key cyber, data, and legal discovery business spheres of interest ranging from market size and mergers to business confidence and vendor developments.

We do not offer ads on the website but like to support our work with voluntary contributions from those who enjoy and benefit from the research, news, and articles shared. Your support is greatly appreciated and will be directly used to support our publishing efforts for our dynamic community of cyber, data, and legal discovery professionals.

Have a Request?

If you have information or offering requests that you would like to ask us about, please let us know and we will make our response to you a priority.

ComplexDiscovery is an online publication that highlights cyber, data and legal discovery insight and intelligence ranging from original research to aggregated news for use by business, information technology, and legal professionals. The highly targeted publication seeks to increase the collective understanding of readers regarding cyber, data and legal discovery information and issues and to provide an objective resource for considering trends, technologies, and services related to electronically stored information.

ComplexDiscovery OÜ is a technology marketing firm providing strategic planning and tactical execution expertise in support of cyber, data and legal discovery organizations. Registered as a private limited company in the European Union country of Estonia, one of the most digitally advanced countries in the world, ComplexDiscovery OÜ operates virtually worldwide to deliver marketing consulting and services.

Incidental Growth? Cost of a Data Breach Hits Record High According to New Report
Keeping Cyber Threats At Bay? Cyber Insurance Start Up At-Bay Closes Series D Financing
An Economical Impact? How Cyber Insurance Shapes Incident Response
A Window into Malware? The New Malware Reverse Engineering Handbook from CCDCOE
Cobra Legal Solutions Acquires Digital Discovery

According to Candice Corby, Chief Executive Officer for Cobra Legal Solutions,...

TransPerfect and Semantix Merge

According to TransPerfect President and CEO Phil Shawe, “We have competed...

Braintrace Acquired by Sophos

According to Bret Laughlin, CEO and co-founder of Braintrace, “NDR is...

DISCO Announces IPO Pricing

A registration statement relating to the initial public offering by DISCO...

A New Era in eDiscovery? Framing Market Growth Through the Lens of Six Eras

There are many excellent resources for considering chronological and historiographical approaches...

An eDiscovery Market Size Mashup: 2020-2025 Worldwide Software and Services Overview

While the Compound Annual Growth Rate (CAGR) for worldwide eDiscovery software...

Resetting the Baseline? eDiscovery Market Size Adjustments for 2020

An unanticipated pandemeconomic-driven retraction in eDiscovery spending during 2020 has resulted...

Home or Away? New eDiscovery Collection Market Sizing and Pricing Considerations

One of the key home (onsite) or away (remote) decisions that...

Five Great Reads on Cyber, Data, and Legal Discovery for July 2021

From considerations for cyber insurance and malware to eDiscovery business confidence...

Five Great Reads on eDiscovery for June 2021

From remediating cyberattacks to eDiscovery pricing, the June 2021 edition of...

Five Great Reads on eDiscovery for May 2021

From cyber discovery and data breaches to business of law and...

Five Great Reads on eDiscovery for April 2021

From X-Road® and risk management to business confidence and cybersecurity, the...

Glowing Expectations? Eighteen Observations on eDiscovery Business Confidence in the Summer of 2021

In the summer of 2021, 63.3% of survey respondents felt that...

Issues Impacting eDiscovery Business Performance: A Summer 2021 Overview

In the summer of 2021, 24.4% of respondents viewed increasing types...

Looking Up? eDiscovery Operational Metrics in the Summer of 2021

In the summer of 2021, 80 eDiscovery Business Confidence Survey participants...

Extreme Heat? Summer 2021 eDiscovery Business Confidence Survey Results

Since January 2016, 2,522 individual responses to twenty-three quarterly eDiscovery Business...