In our Data Lake vs Data Warehouse blog, we explored the differences between two of the leading data management solutions for enterprises over the last decade. We highlighted the key capabilities of data lakes and data warehouses with real examples of enterprises using both solutions to support data analytics use cases in their daily operations. Ultimately, we showed that data warehouses and data lakes each have their own unique advantages and use cases, and that they can coexist in the enterprise tech stack.
Now, with enterprises seeking to maximize the value of their data and decrease time-to-insights, we’re seeing the emergence of a new data management architecture that’s picking up steam in the world of big data: the data lakehouse.
Image Source: Google Trends
Google Trends data shows that searchers around the world are curious about data lakehouse solutions.
In this week’s blog post, we’re taking a close look at the gap between data lakehouse ideals and execution in today’s market. We explore the key features that a data lakehouse should have in 2021, the crucial pain points that need to be addressed, and where modern data lakehouse solutions are missing the mark. Finally, we’ll weigh in on whether you should invest in a data lakehouse, or go a different route to future-proof your enterprise data analytics strategy.
What is a Data Lakehouse?
A data lakehouse is a unified platform for data storage, management, and analytics that theoretically combines the best features of a data lake and a data warehouse, while addressing and removing the pain points that characterize each of these solutions on its own.
Data lakehouses should give organizations the flexibility to store data at the lowest cost, selectively structure data when it makes sense, and run queries against both structured, semi-structured, and unstructured data.
The data lakehouse should make it faster, easier, and cheaper for enterprises to extract value from their data by delivering analytics on the full dataset in the lake, without end users needing to build data pipelines to move, transform or apply structure to the data, and without dealing with database provisioning, configuration, or other administrative tasks that cause headaches along the way.
Let’s quickly refresh our understanding of data lakes, data warehouses, and their key capabilities, to highlight some of the design elements we’d expect to see in a data lakehouse solution that delivers the best of both worlds: data lake storage philosophy, plus data warehouse functionality.
Data Warehouse Functionality
Data warehouse systems emerged in the late 1980s to help organizations start leveraging operational data to support business decision-making. A data warehouse is a data management system where organizations can store cleansed and structured data from operational systems to support business intelligence applications. The data warehouse was purpose-built to solve specific use cases, typically requiring high performance and high concurrency.
In a data warehouse architecture, data is extracted from operational systems, cleaned, transformed, structured in a staging area, and eventually loaded into the data warehouse itself. This time-consuming and costly process is known as ETL (extract-transform-load).
Once the structured data reaches the warehouse, it can be queried and analyzed using business intelligence (BI) tools. In some cases, subsets of data are loaded into data marts that cater to the BI needs of specific departments within the organization. Traditional data warehousing solutions were built to support relational queries in SQL on structured data.
Image Source: Wikipedia
The basic architecture of a data warehouse
Biggest cons of the data warehouse? It requires a rigid design, structured data, and it can take years before it’s in production and delivering value.
But there are key functions of the data warehouse that the data lakehouse aims to replicate, in order to deliver the analytic capabilities that businesses today depend on:
- Data Discovery — A set of capabilities for discovering, classifying, labeling, and reporting on data
- Metadata Management — Metadata acts as an index or table of contents for data
- Security and Access Controls — Role-based access controls (RBAC), encryption, and other measures for managing access and securing sensitive data
- Data Governance — Capabilities for establishing and enforcing data retention policies, recording and visualizing data lineages, and auditing the security of sensitive data
- SQL Queries — The predominant query language for storing, manipulating, and retrieving data stored in a relational database
- Batch and Stream Data Processing — The ability to process large volumes of data at one time, as well as analyze streaming data in near real time
- Support for BI Tools — Connecting to downstream business intelligence tools that analysts may use to visualize data and extract insights
Data Lake Storage
A data lake is a centralized storage repository where data from multiple tools, and in multiple formats, are stored in their raw or original structures. The concept of a data lake emerged in 2010 to address data warehousing pain points that were beginning to surface as organizations experienced digital transformation and big data growth:
- High data storage costs that ballooned as data volumes grew. Organizations needed a more cost-effective storage repository for big data in multiple formats.
- Discarded data stifles innovation. As data volumes grew, high storage costs and time-consuming ETL meant that organizations only warehoused the most interesting data and discarded the rest. This practice stifled innovation and limited the discovery of novel use cases.
- Pre-processing limits data utility. Organizations needed a new kind of storage repository for data in raw, unstructured format, that could be queried in novel ways.
- Building data pipelines adds cost and complexity. IT teams needed to provision resources and configure data pipelines to move and transform data prior to analysis. As data volumes grew, this process became even more complex, time-consuming, and expensive. Organizations needed a way to extract insights from their data at scale without the ongoing pains of provisioning and configuring databases and pipelines.
To address these pain points, James Dixon imagined data lakes as a new data management system that could easily ingest data in raw formats, support storage and querying on multiple data models and by multiple engines, and make it easier for all members of an organization to access and experiment with data.
Image Source: Dremio
Data lake architecture featuring Azure Data Lake Storage.
In reality, organizations that implemented data lakes often found themselves stuck with a swamp — it’s easy to get data into it, but because it serves as a dumping ground for such diverse datasets, it becomes hard to give the right access and analytic tools to the right users so they can get value out.
Yet again, there are benefits from the data lake concept that data lakehouses aim to learn from: namely, leveraging inexpensive cloud object storage and incorporating a multi-model or multi-engine approach.
A data lakehouse should reflect the following elements of the data lake storage philosophy:
- Unified, centralized data storage for all kinds of enterprise data
- Support for multiple data types — structured, unstructured, and semi-structured
- Easy raw data ingestion — using a schema-on-read approach to apply schema at query time (instead of before ingestion) makes it quick and easy to ingest raw data with no pre-processing and no ETL
- Cost-effective data storage — deployed in cloud storage like Amazon S3 or Google Cloud Storage
- Decoupled storage and compute — such that data storage and querying can scale separately
- Support for multiple query types — including Full Text Search, SQL, NoSQL, and Machine Learning, applied directly to data in storage
- Democratized data access — making it easy for anyone in an organization to query data and create visualizations using their preferred tools
Data Lakehouse: The Best of Both Worlds?
A data lakehouse should act as a centralized, cloud-native data store, ingesting all types of data in raw formats. Users should be able to query the data directly in the lakehouse repository with no ETL process and no data movement. Access should be democratized, with support for multiple query types (Full Text Search, SQL, NoSQL, and Machine Learning) and multiple front-end consumption tools (e.g., Tableau, Kibana, Looker, Python, Tensorflow). Batch and stream data processing should both be supported.
The lakehouse should also have data discovery, metadata management, and data governance capabilities. It should support access controls and security rules to maintain data integrity and security.
When all these features come together, enterprises should get the best of both worlds: data warehouse functionality with data lake storage characteristics — but how often does this really happen?
Today’s Data Lakehouse Solutions: The Devil’s in the Details
Look at the landscape of data lakehouse solutions available today and you’ll find a series of data platforms that deliver plenty of house, and not enough lake when it comes to managing enterprise data. They offer strong warehouse functionality but don’t really follow the data lake philosophy. In most cases, it’s because these platforms started as data warehousing solutions and then added support for data lakes later on.
They tell a big story, but in practice, data lakehouses often struggle when it comes to streamlining the data pipelining and ETL processes required to get data in the right place and/or format for analytics. The complexity of getting them up and running at scale can be a deterrent for many organizations.
Let’s take a closer look at three examples of data lakehouse platforms.
Founded by the original creator of the open source Apache Spark, Databricks offers a managed Apache Spark service that’s marketed as a data lakehouse platform.
The Databricks lakehouse architecture consists of a public cloud storage repository (data lake), an integrated storage layer with ACID transaction support (Delta Lake), and an analytics engine (Delta Engine) that supports business intelligence, data science, and machine learning use cases.
Image Source: Towards Data Science
Databricks lakehouse platform architecture.
Databricks offers most of the data warehousing functionalities we might expect to see in a data lakehouse platform, with support for metadata management, batch and stream data processing for multi-structured datasets, data discovery, secure access controls, and SQL analytics.
In terms of delivering on the key elements of the data lake storage philosophy, Databricks recently introduced their Auto Loader, which automates ETL and data ingestion and uses data sampling to infer schema for several data types. Alternatively, users can use Delta Live Tables to construct ETL pipelines between their public cloud data lake and Delta Lake.
Databricks checks all of the boxes on paper, but to build out the complete data lakehouse takes a lot of manual work from expert engineers who have to set up the solution and build its data pipelines. Further, the solution becomes increasingly complex at scale. It’s not as simple as it sounds.
It started out as a data warehouse solution built on top of cloud infrastructure. The platform consists of a centralized storage repository that sits on top of AWS, Microsoft Azure, or Google Cloud Platform (GCP) public cloud storage. Next, there’s a multi-cluster compute layer where users can spin up a virtual data warehouse and run SQL queries against their data storage. This architecture supports the decoupling of storage and compute resources, so enterprises can scale the two separately as needed.
Finally, Snowflake delivers a service layer that includes metadata cataloging, resource management, data governance, transactions, and more.
Image Source: Snowflake
Architectural overview of the Snowflake cloud data platform
The platform does a great job at delivering data warehouse functionality, including BI tool integrations, metadata management, access controls, and SQL queries.
But Snowflake is limited to one, relational query engine based on SQL. This makes it easy to manage but less flexible, and doesn’t fulfill the multi-model data lake vision.
And Snowflake requires enterprises to load data from their cloud storage into a centralized storage layer before it can be queried or analyzed. The data pipelining process is manual, requiring upfront ETL, provisioning and structuring of the data before it can be analyzed. These manual processes become exasperated at scale. Snowflake’s data lakehouse is another solution that fits the bill on paper but, in practice, contradicts the data lake philosophy of easy data ingestion.
Azure Synapse Analytics
Azure Synapse Analytics is a unified data management platform that integrates existing cloud services to deliver limitless data warehousing and big data analytics.
Though it has not been marketed with the term “data lakehouse”, Azure Synapse Analytics offers many key features of a data lakehouse solution. Users benefit from a low-cost cloud data repository, support for batch processing and data streaming, and multiple query types (SQL, NoSQL, and Machine Learning) that execute against data stored in the storage layer.
Image Source: James Serra
Azure Synapse Analytics architecture.
With Azure Synapse Analytics, there’s no centralized storage repository for big data. Instead, users can run federated queries on multiple Azure data stores, including Cosmos DB, ADLS Gen2, Spark tables, and relational databases.
Azure Synapse also requires dedicated SQL pools to create a persistent SQL database for warehousing. This results in tightly coupled storage and compute, a feature that contradicts the data lake philosophy of decoupled storage and compute.
Finally, SQL queries in Azure Synapse Analytics work by pushing data in real time from the storage layer into relational tables with columnar storage, which supports high performance SQL queries at scale. But, again, the manual work associated with creating data pipelines to get the data structured for analysis becomes painful and complex, especially at scale.
The need to move and duplicate data before running analytics goes against data lake storage philosophy and increases time, cost, and complexity for users.
Do You Need a Data Lakehouse?
The emergence of data lakehouse solutions reflects a wider trend in big data: the integration of data storage and analytics in unified data platforms that help enterprises maximize the value of their data while minimizing the time, cost, and complexity of extracting that value.
So, does that mean that you should build or invest in a data lake house for your organization? Not necessarily.
Platforms like Databricks, Snowflake, and Azure Synapse Analytics have all been associated with the data lakehouse concept, yet they offer different capabilities and show a collective tendency to act more like a data warehouse than a true data lake. Enterprises should be skeptical of what it means when a solution is advertised as a “data lakehouse''.
To choose the right data platform that will grow with organizations into the future, enterprises need to look past marketing phrases like “data lakehouse” and instead investigate the capabilities that each platform provides.
Enterprises don’t need a data lakehouse per se — they need cost-performant querying, multi-model analytics, and data governance functionality on top of an accessible data lake.
(Warning, shameless plug): Many enterprises today looking to fulfill the data lake promise are looking to ChaosSearch, which delivers on data lake storage philosophy and architecture while providing the multi-model analytics and data warehousing functionality to make it work. Because ChaosSearch plugs directly into Amazon S3 or Google Cloud Storage and processes all data transformations virtually, not physically, all of the complexities associated with creating pipelines or ETL processes to get data ready for analytics go away.
Time to Future-Proof Your Analytics Stack? Think Time, Cost, and Complexity.
Data lakes represent the ideal paradigm for data storage, but extracting value from data without the right functionality can be more trouble than it’s worth. The conversation about future-proofing your analytics stack ultimately comes back to time, cost, and complexity. Enterprises should carefully compare the features and benefits of a range of cloud-native data solutions before determining the best way forward.
Questions to ask that can guide your data management strategy and platform evaluation:
- How can I maximize storage and analytics capabilities while minimizing time, cost, and complexity?
- Where can I get the best cost savings on the indexing and querying I need to run my business?
- Which solution is the simplest to deploy? Which has the fewest moving parts?
- Which vendors are making it easy to access data lakehouse features and benefits with the least management overhead?
- Who has the best multi-model querying capabilities, with support for relational, non-relational (including search), and ML workloads?
At ChaosSearch, we’re committed to fulfilling the data lake promise with a single unified platform that minimizes the time, cost, and complexity of extracting value from your data lake.
- Read the blog — Data Lake vs. Data Warehouse
- Download the whitepaper — Transforming Data Into Information: How to Make Refining Data as Affordable as Generating It
- Dig Deeper into the ChaosSearch Data Lake Platform — A Detailed Look at ChaosSearch