Process, Store and Analyze JSON Data with Ultimate Flexibility. Check out the blog and new demo! -->
Process, Store and Analyze JSON Data with Ultimate Flexibility. Check out the blog and new demo! -->
Start Free Trial

ChaosSearch Blog

9 MIN READ

How to Get Started with a Security Data Lake

Modern, data-driven enterprise SecOps teams use Security Information and Event Management (SIEM) software solutions to aggregate security logs, detect anomalies, hunt for security threats, and enable rapid response to security incidents.

SIEMs enable accurate, near real-time detection of security threats, but today's SIEM solutions were never designed to handle the large amounts of security log data generated by modern organizations on a daily basis.

As daily log ingestion grows, the cost of ingesting, processing, analyzing, and storing/retaining security logs in the SIEM increases exponentially. SIEM query performance can also deteriorate, delaying critical insights and increasing the organization’s vulnerability to cyber attacks.

SecOps teams may try to salvage SIEM performance by reducing the data retention window for security logs, but doing so negatively impacts long-term use cases for security data and can further damage the organization's security posture.

Thankfully, there’s another way to store and manage security data that can help Enterprise SecOps teams overcome these challenges, augment or replace existing SIEM Solutions, and bolster their security posture: a Security Data Lake.

In this blog, we’ll explain the unique features and benefits of security data lakes, how a security data lake can complement your existing SIEM solution, and practical guidance for getting started with a Security Data Lake.

 

How to Get Started with a Security Data Lake

 

What is a Security Data Lake?

A Security Data Lake is a centralized repository for aggregating, storing, and analyzing enterprise security data. Security Data Lakes leverage the inherent advantages of data lake architecture to deliver cost-effective data storage and log analytics for SecOps teams at scale, including:

  • Schema-on-Read Approach - SIEM solutions typically ingest data in a structured schema or auto-normalize security data upon ingestion (an approach known as schema-on-write), while data lakes ingest raw data in its source format - whether structured, unstructured or semi-structured - and apply schema only at query time (an approach known as schema-on-read).
  • Loosely Coupled Storage and Compute - The schema-on-read approach results in a loose coupling of the storage and compute resources needed to maintain your security data lake. Being able to ingest raw data without applying schema or auto-normalizing on the front-end makes it faster, cheaper, and easier to ingest logs into your Security Data Lake.
  • Fewer Data Restrictions - While SIEM tools are relatively narrow in the types of data they can accept, a Security Data Lake can store and aggregate security data from multiple data sources to provide a more comprehensive view of enterprise security. This can include security logs from the organization’s IT infrastructure, along with user access logs, threat intelligence, and other types of security data.
  • Multi Model Analytics - A security data lake gives enterprise SecOps teams the flexibility to query their data in several different ways, including SQL/relational querying, full-text search, and using Artificial Intelligence (AI) or Machine Learning (ML) tools.

The unique characteristics of data lake architecture make Security Data Lakes an excellent tool for cost-effectively storing and retaining security data at scale. Enterprise SecOps teams can continue using a trusted SIEM tool for near real-time anomaly detection and threat hunting, while the addition of a Security Data Lake allows for cost-effective long-term retention of security data and enables long-term security log analytics use cases like advanced persistent threat (APT) detection and root cause analysis.

Read: Integrating Observability into Your Security Data Lake Workflows

 

Getting Started with a Security Data Lake

Getting started with a security data lake involves choosing the right data lake architecture and software components based on your organization’s unique needs, capabilities, and circumstances.

Below, we share some of the key decision points you’ll encounter and our best practical advice for getting started with a security data lake.

 

1. Choose your Data Lake Storage and Platform

Modern security data lakes are deployed in the cloud, as public cloud infrastructure offers the most durable, scalable, and cost-effective storage backing for your security data lake. Enterprise SecOps teams can choose from data lake solutions offered by public cloud providers (e.g. AWS Data Lake, Amazon Security Lake, Google Data Lake), or by 3rd-party SaaS vendors like Snowflake or ChaosSearch.

 

Amazon Security Lake Reference Architecture

Image Source

Amazon Security Lake Reference Architecture

 

When choosing a data lake solution, enterprise SecOps teams should compare solutions in terms of overall complexity and management overhead, total cost of ownership (TCO), and ease of integration with existing systems and data sources.

 

2. Choose Your Data Sources

Once you’ve chosen a data lake solution, the next step is to identify sources of data for your security data lake. SecOps teams will want to capture security logs from cloud-based applications and services (including IAM services and network security tools), along with web servers, endpoint devices, and any on-prem network infrastructure. Threat intelligence from public/private feeds or cooperating organizations may also be ingested into the security data lake.

 

3. Configure Data Ingestion

Next, you’ll need to set up and configure a process for ingesting data from the various data sources into your data lake. Cloud-based data ingestion tools include open-source options like Fluentd, Logstash and Apache Kafka, 3rd-party SaaS solutions like Wavefront, and public cloud services like Amazon Kinesis.

It’s important to choose a data lake solution that allows you to ingest data in its raw unstructured format and apply schema at query time - otherwise, you’ll have to deal with the up-front cost and complexity of transforming your data before it enters your data lake.

 

4. Catalog or Index Your Data

When you ingest large volumes of data into your data lake, you run the risk of creating a data swamp: a disorganized, poorly-maintained data lake that’s difficult to navigate and analyze. Cataloging or indexing security data as it enters your data lake helps you stay organized and keep track of the valuable data you’re storing.

AWS Glue and Google Cloud Data Catalog are public cloud services that deliver data cataloging capabilities on their respective platforms. SecOps teams can also implement an open-source tool like Apache Atlas, or take advantage of proprietary data indexing technology offered by 3rd-party vendors like ChaosSearch.

 

5. Connect to Analytics and Visualization Tools

At this point, you should have security logs and other data streaming from your chosen data sources into your data lake platform. The next step is to connect your data lake to analytics, BI, and data visualization tools that allow SecOps teams to explore, transform, filter, and analyze the data to gain insights into your organization’s security posture.

The most sophisticated data lake solutions offer multi-model analytics capabilities, allowing SecOps teams to run full-text search, SQL queries, or ML workloads on their data.

 

Amazon Security Lake and ChaosSearch. Delivering security analytics with industry-leading cost and unlimited retention. Learn how!

 

Optimize Your Security Data Lake Architecture with ChaosSearch

Ready to build your security data lake?

With ChaosSearch, it takes just minutes to stand up a cost-effective data lake that reduces your SIEM costs and enhances visibility of your enterprise security environment with unlimited data retention to support long-term security analytics use cases.

 

Security Data Lake Visualization

Image Source

ChaosSearch Security Data Lake Reference Architecture

 

ChaosSearch attaches directly to AWS or GCP, transforming your Amazon S3 or GCS storage backing into a hot security data lake. Once your security logs are ingested into cloud object storage, our proprietary Chaos Index® technology indexes the data 60X faster than Elasticsearch and with up to 20X data compression.

From there, you can use our Chaos Refinery® tool to virtually filter, transform, and query security logs to hunt for APTs or investigate the root cause of a security incident. Building your security data lake with ChaosSearch can help you reduce SIEM costs and increase visibility of your enterprise security posture with low management overhead and TCO.

 

Ready to learn more?

Want to see just how easy it is to stand up a security data lake using ChaosSearch? Read our ChaosSearch for SecOps Solution Brief to learn more about how ChaosSearch enables scalable log analytics for security operations and threat hunting.

 

Ready to stand up a security data lake? Read the solution brief!

About the Author, David Bunting

David Bunting is the Director of Demand Generation at ChaosSearch, the cloud data platform simplifying log analysis, cloud-native security, and application insights. Since 2019 David has worked tirelessly to bring ChaosSearch’s revolutionary technology to engineering teams, garnering the company such accolades as the Data Breakthrough Award and Cybersecurity Excellence Award. A veteran of LogMeIn and OutSystems, David has spent 20 years creating revenue growth and developing teams for SaaS and PaaS solutions. More posts by David Bunting