Data Lake Architecture & The Future of Log Analytics
Organizations in 2021 are leveraging log analytics in the cloud for a variety of use cases, including network security and application monitoring, user behavior analysis, forensic network investigation, and supporting compliance with local and industry-specific regulations.
But with enterprise data growing at astronomical rates, organizations are finding it increasingly difficult to efficiently store, secure, and access their log files.
The average company with over 1,000 employees now pulls data from more than 400 sources to support their business intelligence and analytics initiatives, with the top 20% of companies now capturing data from over 1,000 sources (IDG).
That’s a huge amount of log data that needs to be collected, stored, and indexed before it can be analyzed.
As organizations expand their presence in the cloud and generate growing volumes of event logs each day, data lakes are once again being considered by CIOs as an attractive data storage option. Data lake solutions are designed to support cost-effective high-volume data storage and expanded data access within organizations (also known as data democratization), leading to increased data utilization and value creation.
In this blog post, we’re taking a closer look at the role of data lakes and data lake architecture in the future of log analytics. We’ll,
- Explain what data lakes are and describe the four key features that characterize data lake solutions.
- Look at three different approaches to data lake architecture and what makes them effective for storing large volumes of log files for analysis.
- Comment on the future of log analytics and propose a simple data lake architecture that will help organizations maximize the value of their log data.
Let’s dive in!
What is a Data Lake and How is it Different from a Data Warehouse?
It was around 2010 when the term Data Lake was first coined by James Dixon, then-CTO of Pentaho Corporation.
At this time, organizations involved with big data analytics were utilizing data warehouses for large-scale storage of processed data.
Data marts were also deployed, enabling individual business units to access warehoused data pertaining to their department.
From Dixon’s perspective, data marts were preventing organizations from reaching their full potential for big data utilization.
Data and information was siloed because each department could only access data in their own data mart, while other areas of the data warehouse remained opaque and inaccessible.
Data marts were also accused of stifling innovation because they presented users with structured data derived from raw data - but not the raw data itself.
A lack of access to raw data, Dixon believed, prevented users from transforming the data in alternative ways to extract new insights or develop new use cases.
Dixon’s concept of a data lake is based on the idea of storing data in its raw form and broadening access to break down silos and drive innovation. Here are Dixon’s own words describing how a data lake would differ from a data mart:
“If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” (James Dixon)
Today, a data lake is defined as a data storage repository that centralizes, organizes, and protects large amounts of structured, semi-structured, and unstructured data from multiple sources. Unlike data warehouses that follow a schema-on-write approach (data is structured as it enters the warehouse), data lakes follow a schema-on-read approach where data can be structured at query-time based on user needs.
As a result, organizations using a data lake have more flexibility to analyze their data in new ways, extract new insights, and uncover valuable new use cases for enterprise data.
How Does Data Lake Architecture Work?
Data lakes can be designed and architected in different ways. Integrating with existing enterprise software tools, they can deliver the capabilities that help companies store and analyze log files at scale.
Four Key Functions of an Enterprise Data Lake Solution
- Data Ingest - Data lake solutions should be able to ingest structured and unstructured data from a variety of sources. A key benefit of data lakes is that they create a centralized repository where users can access and analyze many different types of enterprise data.
- Data Storage - Data lake solutions should allow for cost-effective data storage with unlimited capacity to scale. Public cloud service providers offer the best economies of scale for data storage, making them an ideal storage backing for a data lake solution.
- Data Indexing - Data lake solutions should be able to catalog or index data without moving the data. This keeps data centralized instead of siloed and gets rid of costly data egress fees from public cloud service providers.
- Data Analysis - Data lake solutions connect enterprise log data with data analytics, visualization, and business intelligence tools, allowing organizations to analyze their data and extract insights that drive value creation.
Based on these four functions, we identify the key components of data lake architecture.
5 Key Components of Data Lake Architecture
- Data Sources - Applications that generate enterprise data.
- Data Ingest Layer - Software that captures enterprise data and moves it into the storage layer.
- Data Storage Layer - Software storage backing for the data lake.
- Catalog/Index Layer - Software that cleans, prepares, and transforms data to create indexed views without moving the data.
- Client Layer - Software that enables data analysis, visualization, and insight development.
What are the Three Types of Data Lake Architecture?
How do different data lake solutions incorporate the key components of data lake architecture to deliver on these critical functionalities?
Most vendors have adopted one of three main approaches to data lake architecture.
Data Lake Architecture: The Template Approach
In 2019, AWS released a new data lake solution known as “Data Lake on AWS”.
This solution uses a template-based approach to data lake architecture that automatically configures existing AWS services to support data lake functionality, such as tagging, sharing, transforming, accessing, and governing data in a centralized repository.
The template approach championed by AWS cuts down on manual configuration, allowing users to set up their data lake in as little as 30 minutes.
Image Source: Amazon Web Services
In this data lake architecture, users access the data lake through a console secured by Amazon Cognito (user authentication service). Data is ingested via services like Amazon CloudWatch that capture log and event data from throughout the cloud environment. Amazon S3 acts as the data storage repository, while metadata is managed in DynamoDB. Data is cataloged with AWS Glue and can be searched using the Amazon OpenSearch service or analyzed with Amazon Athena.
A template-based approach can make data lakes easier to configure, but complexity and IT overhead are significant issues with data lake architectures that have so many moving parts.
Related Content: Performance Comparison Series Part 1: Elastic vs. Redshift vs. CHAOSSEARCH
Data Lake Architecture: The “Lakehouse” Approach
Another approach to data lake architecture involves combining the features of a data warehouse and a data lake into a modified architecture that’s been termed a “Lakehouse.”
Data platforms like Databricks and Snowflake fall into this category, as do some data warehousing services like Google BigQuery and AWS Redshift Spectrum.
Image Source: Medium
In the Data Lakehouse architecture pictured here, you’ll notice many of the same data lake architecture components we’ve already mentioned. Data is ingested from a variety of sources into Amazon S3 buckets, Hadoop HDFS, or another cloud object store.
Data can be queried using Apache Drill or AWS Athena without moving it from the data lake repository storage.
Delta Lake (created by Databricks) provides an additional open format storage layer and allows users to perform ETL processes and run BI workloads directly on the data lake.
The data lakehouse approach has its benefits, but it also introduces a high level of complexity that can result in poor data quality, and performance degradation.
High complexity also makes it challenging for non-IT users to utilize data, ultimately preventing organizations from reaching the promised land of data democratization.
Data Lake Architecture: The Cloud Data Platform Approach
The third approach to data lake architecture - and also our favorite - is the cloud data platform approach.
In this architecture, a self-service data lake engine sits on top of a cloud-based data repository, delivering key features that help organizations achieve data lake benefits and realize the full value of their data.
Image Source: ChaosSearch
In the data lake architecture reimagined here, raw data is produced by applications (either on-prem or in the cloud) and ingested into Amazon S3 buckets with services like Amazon CloudWatch or a log aggregator tool like Logstash.
ChaosSearch runs as a managed service in the cloud, allowing organizations to:
- Automatically discover, normalize, and index data in Amazon S3 at scale.
- Index data with extreme compression for ultra cost-effective data storage.
- Store data in a proprietary, universal format called Data Edge that removes the need to transform data in different ways to satisfy alternative use cases.
- Perform textual searches and relational queries on indexed data.
- Effectively orchestrate indexing, searching, and querying operations to optimize performance and avoid degradations.
- Clean, prepare, and virtually transform data without moving it out of Amazon S3 buckets, eliminating the ETL process and avoiding data egress fees.
- Analyze indexed data directly in Amazon S3 with data visualization and BI analytics tools.
ChaosSearch delivers a simplified approach to data lake architecture that unlocks the full potential of Amazon S3 as a large-scale storage repository for enterprise log data.
Data Lake Architecture & The Future of Log Analytics
As enterprise log data continues to grow, organizations will need to start future-proofing their log analytics initiatives with data storage solutions that enable log data indexing and analysis at scale.
Data lakes are a natural fit here – they can ingest large volumes of event logs, ramp up log data storage with limitless capacity, index log data for querying, and feed log data into visualization tools to drive insights.
The three data lake architectures we showcased all deliver on these core capabilities – but only cloud data platforms offer a fully optimized architecture that reduces management complexity and minimizes technical overhead.
If you’re on your way to producing more event logs than you can analyze, or if you’re already there, it’s time to think about a data lake solution that delivers hassle-free and cost-effective performance at scale.