Why Log Data Retention Windows Fail
If you’re using Elasticsearch as part of an ELK stack solution for log analytics, you’ll need to manage the size of your indexed log data to achieve the best performance.
Elasticsearch indices have no limit to their size, but a larger index takes longer to query and is more costly to store. Performance degradation is often observed with large Elastic indices and queries on large indices can even crash Elasticsearch when they use up all of the available heap memory on the node.
To avoid these pitfalls, you’ll need to implement policies in Elasticsearch that constrain the index size by shortening the data retention window for your logs. But as we’ll explain, reducing data retention can lead to lost insights and prevent you from leveraging valuable use cases for retrospective log data.
In this week’s blog, we explore why data retention windows fail, what happens when they do, and how to maximize the value of your log data.
What is a Data Retention Window?
Data retention policies establish how long your log data will be retained in the index before it is automatically expired and deleted. In Elasticsearch, you can proactively manage the size of your indices by setting a data retention policy for your indexed log data.
Your data retention policy also determines how much retrospective log data will be available for analysis at any given time. A data retention period of 90 days means that developers and security teams will have access to a rolling 90-day window of indexed log data for analytics purposes - that’s your data retention window.
Why Do Log Data Retention Windows Fail?
Shorter data retention windows result in lower data storage costs, but the trade-off is that DevSecOps teams quickly lose access to analyze older log data that could support long-term log analytics use cases ranging from advanced persistent threat detection to long-term user trends and root cause analysis.
Longer data retention windows provide deeper access to retrospective log data and better support for long-term log analytics use cases, but they also result in higher storage costs and ELK stack performance degradation as indexed data is retained for longer periods of time.
This trade-off between low storage cost and analytics value is the main reason why log data retention windows fail. You want to retain as much log data as possible to support long-term log analytics use cases, but the more you expand your digital footprint, the more log data you generate, and the more costly it becomes to make all of that data available for analytics over longer windows of time.
So you compromise.
You stop collecting logs from applications and services you deem “non-essential”. For “low importance” applications, you retain logs for just 30 days. Eventually, you start shortening the data retention window for mission-critical applications - from 2 years down to 1 year, then 6 months, and maybe even down to 3 months or less.
Before you know it, the only logs you’re storing for the long term are those needed to satisfy regulatory requirements, and the vast majority of your logs are being discarded or expired from storage before you can ever glean their full value.
Ultimately, log data retention windows fail when they limit your ability to get the maximum value from your logs.
What Happens When Log Data Retention Windows Fail?
When you shorten your data retention windows to drive down storage costs or preserve Elasticsearch querying performance, you start to lose out on log analytics use cases that depend on the availability of long-term data.
Advanced Persistent Threat Detection
Industry research now shows that enterprise organizations take more than 200 days on average to identify a data breach. Based on that figure, even a data retention window of six months would only help you identify around 50% of data breaches.
Retaining security logs for longer periods of time enables a more comprehensive approach to security log analysis and helps your security team detect long-term persistent threats.
Root Cause and Forensic Analysis
In the aftermath of an application failure or security incident, DevSecOps teams may want to forensically analyze historical log data to uncover (and eventually mitigate) the root cause of the problem. But if the root cause of your incident took place 130 days ago and you’re only retaining 60 days of logs, the valuable information you need is already gone.
The more you can open up the data retention window, the more you’ll succeed at root cause investigations that depend on the availability of retrospective data.
Long-term Application and User Trends
The holiday season is here and it’s the busiest time of the year for eCommerce in particular. Digital retailers can prepare for peak demand season by analyzing application performance data and user trends from previous years. These analyses can help you answer questions like:
- Which days were the busiest shopping days on your website last year?
- How much did website traffic increase during peak demand hours?
- Did users experience slow page-load times or other performance degradation during peak demand times?
- Which products, services, or pages had the highest conversion rates last holiday season?
- Which holiday promotion drove the most engagement last year?
If you’re only retaining log data for 3 months, you’re throwing away valuable application and user session logs that could otherwise help you improve your marketing, optimize cloud resource configuration to enhance the customer experience, and drive revenue.
Enterprises Need Real-Time Visibility and Long-Term Analytics
Total observability of your applications, security posture, and IT environment requires a combination of real-time (short-term) alerting and long-term analytics capabilities.
Enterprise developers already depend on application performance monitoring (APM) and observability tools (e.g. New Relic, Sematext, etc.) to capture application telemetry (logs, metrics, and traces) in real-time and alert on anomalies or performance issues that warrant a review.
On the security side, security information and event management (SIEM) tools (e.g. Datadog, Exabeam, etc.) provide real-time anomaly detection and alert on potential security threats.
These capabilities are essential for DevSecOps teams, but neither APM, observability or SIEM tools were meant for long-term analysis – and using those tools at scale gets very expensive and complex to manage. Gaining total observability of your network security posture and application performance requires a centralized log management (CLM) tool that enforces a small storage footprint for your logs and supports long-term security, DevOps, and cloud log analysis use cases.
ChaosSearch Removes Restrictive Limits on Data Retention
ChaosSearch replaces your ELK stack with a cloud data platform that enables log analytics at scale and eliminates the need for restrictive data retention windows.
What’s the secret? Our proprietary indexing technology known as Chaos Index®.
Chaos Index massively reduces the size of your log data, providing up to 20x compression while still fully indexing it.
Combining the extreme data compression of Chaos Index with cost-effective cloud data retention in AWS or Google Cloud means that you can store your indexed log data perpetually - without having to set a data retention window or delete any of your data.
Once your data is indexed, you’ll be able to perform text search or relational queries against your logs, and successfully execute on long-term log analytics use cases with no data movement and no re-indexing.
Ready to learn more?
Schedule a demo of ChaosSearch to learn how our fully managed log analytics platform lowers the cost of analyzing log data at scale with unlimited queries and data retention.
Download the Brief: The Future of Data: A special Raconteur report published in The Times
Read the Case Study: Fast-Growing SaaS Scales Log Analytics with Huge Cost Savings