Editor’s Note: Data lakes provide an architectural approach for storing high-volume, high-velocity, and high-variety data. This storage approach is of increasing interest to business, information technology, and legal professionals as they seek to deal with increasing volumes and types of data coupled with the challenge of interrogating, identifying, and indexing data so it can be analyzed to help organizations uncover insight for business benefit, compliance obligations, and litigation requirements. Provided in this post is a compilation of informational article extracts that may be helpful for those seeking to learn more about the benefit of data lakes and their potential in the sphere of data discovery and legal discovery.
An extract from an article by Jennifer Zaino via BizTech
Data Lakes Prove Key to Modern Data Platforms
What Is a Data Lake?
Data lakes store data of any type in its raw form, much as a real lake provides a habitat where all types of creatures can live together.
A data lake is an architecture for storing high-volume, high-velocity, high-variety, as-is data in a centralized repository for Big Data and real-time analytics. And the technology is an attention-getter: The global data lakes market is expected to grow at a rate of 28 percent between 2017 and 2023.
Companies can pull in vast amounts of data — structured, semistructured and unstructured — in real time into a data lake, from anywhere. Data can be ingested from Internet of Things sensors, clickstream activity on a website, log files, social media feeds, videos and online transaction processing (OLTP) systems, for instance. There are no constraints on where the data hails from, but it’s a good idea to use metadata tagging to add some level of organization to what’s ingested, so that relevant data can be surfaced for queries and analysis.
“To ensure that a lake doesn’t become a swamp, it’s very helpful to provide a catalog that makes data visible and accessible to the business, as well as to IT and data management professionals,” says Doug Henschen, vice president and principal analyst at Constellation Research.
Data Lakes vs. Data Warehouses
Data lakes should not be confused with data warehouses. Where data lakes store raw data, warehouses store current and historical data in an organized fashion.
IT teams and data engineers should think of a data warehouse as a highly structured environment, where racks and containers are clearly labeled and similar items are stacked together for supply chain efficiency.
The difference between a data lake and a data warehouse primarily pertains to analytics.
Data warehouses are best for analyzing structured data quickly and with great accuracy and transparency for managerial or regulatory purposes. Meanwhile, data lakes are primed for experimentation, explains Kelle O’Neal, founder and CEO of management consulting firm First San Francisco Partners.
With a data lake, businesses can quickly load a variety of data types from multiple sources and engage in ad hoc analysis. Or, a data team could leverage machine learning in a data lake to find “a needle in a haystack,” O’Neal says.
“The rapid inclusion of new data sets would never be possible in a traditional data warehouse, with its data model–specific structures and its constraints on adding new sources or targets,” O’Neal says.
Data warehouses follow a “schema on write” approach, which entails defining a schema for data before being able to write it to the database. Online analytical processing (OLAP) technology can be used to analyze and evaluate data in a warehouse, enabling fast responses to complex analytical queries.
Data lakes take a “schema on read” approach, where the data is structured and transformed only when it is ready to be used. For this reason, it’s a snap to bring in new data sources, and users don’t have to know in advance the questions they want to answer. With lakes, “different types of analytics on your data — like SQL queries, Big Data analytics, full-text search, real-time analytics and machine learning — can be used to uncover insights,” according to Amazon. Moreover, data lakes are capable of real-time actions based on algorithm-driven analytics.
Businesses may use both data lakes and data warehouses. The decision about which to use turns on “understanding and optimizing what the different solutions do best,” O’Neal says.
An extract from an article by Bernard Marr via Forbes
What Is A Data Lake? A Super-Simple Explanation For Anyone
Some mistakenly believe that a data lake is just the 2.0 version of a data warehouse. While they are similar, they are different tools that should be used for different purposes. James Dixon, the CTO of Pentaho is credited with naming the concept of a data lake. He uses the following analogy:
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
A data lake holds data in an unstructured way and there is no hierarchy or organization among the individual pieces of data. It holds data in its rawest form—it’s not processed or analyzed. Additionally, a data lake accepts and retains all data from all data sources, supports all data types, and schemas (the way the data is stored in a database) are applied only when the data is ready to be used.
An extract from an article by Lance Weaver via Equinix
Why Companies Are Jumping Into Data Lakes
Data lakes are a developing entity, and the industry hasn’t coalesced around a single, universally accepted definition. A consensus definition, derived from the consultation of several different sources, follows: “A data lake is a storage mechanism designed to facilitate the colocation and use of many different types of data, including data that is date-defined using various schemata, structural frameworks, blobs, and other files.”
The hope is that a data lake will make it possible for an enterprise to gain new business insights by accumulating large amounts of data, in the format chosen by each workload, and then make it easy to process using big data analytics, cross-workload analysis, reporting, research, and even some forms of transactional workloads.
The movement toward the implementation of data lakes is at the intersection of several trends. One is a move by cloud service providers who are seeking to innovate and provide new storage products.
Another trend sees enterprises experiencing fundamental shifts in the sources of their data and how they are using it. The data is now coming from many types of end user-focused devices and systems and is still being generated and processed by traditional systems. Efforts are underway to combine all of this structured and unstructured data, regardless of its form or original intent, making it easier to join with other systems of record. That’s where data lakes come in.
An extract from an article by Michael Lappin via Nuix
Finding Structure for Your Unstructured Data Using Data Lakes
Why Fill The Lake?
Generally, we’ve seen a mix of proactive and reactive drivers pushing companies toward creating and filling a data lake.
- Ongoing eDiscovery: The most popular driver we see with companies is frustration with slowness or lack of accuracy completing iterative eDiscovery tasks. These tasks include searching and producing old data for custodians on legal hold.
- Migration or Extraction from Legacy Email Archives: Large email archives are very common and unmanageable. Many folks believe you need to extract the data—or at least the part of it that makes sense (by a custodian or by date)—index it and prepare it for discovery, governance, or migration to a new platform like Microsoft Office365.
- Legal Hold Management: Legal hold management is linked to the previous drivers and it often seems to take the form of removing hundreds or even thousands of old holds and reducing them to a reasonable, manageable number.
- Data Privacy and Information Governance: Recent regulations around the world have led to a new interest in information governance. The most publicized of these, the European Union’s General Data Protection Regulation (GDPR), contains measures for companies to answer data subjects’ subject access requests and delete the information upon request under its ‘right to be forgotten’ provisions. Along with this, the California Consumer Protection Act (CCPA) has introduced similar protections in the US that are likely to spread to other states.
- What Issues Most Concern eDiscovery Business Professionals Today? Budgetary Constraints (Fall 2019)
- Automating eDiscovery: A Strategic Framework