5 AWS Logging Tips and Best Practices
If you’re an AWS user, you’re probably familiar with some of Amazon’s native services available for logging and monitoring, such as CloudWatch and CloudTrail. With that said, log management can get complicated quickly, especially if you’re dealing with a high volume of logs from AWS Lambda functions or a multi-cloud/hybrid cloud environment. AWS logging tips we’ll walk you through can certainly help.
Getting logging right starts with how well you know the strengths and limitations of native AWS monitoring tools. Many times, you can build a cloud-native observability stack that complements these tools in areas where they become difficult to navigate. Here are five AWS logging best practices you can follow for analyzing logs more effectively.
1. Understand the limitations and differences between CloudTrail and CloudWatch
CloudWatch is Amazon's primary monitoring and logging service built into the AWS cloud. It can collect logs and metrics from all of your AWS services and workloads – from CloudTrail and EKS, to Route53, to individual EC2 instances running the CloudWatch agent, to Lambda functions. CloudWatch, through CloudWatch Logs Insights, provides basic search and analytics capabilities, such as visualizations, to help you interpret log and metrics data. And it lets you configure alarms, which can alert you to anomalies or sudden changes in workload performance patterns.
While AWS CloudWatch works well for basic monitoring and alerts, on its own, it may not be the best solution for log data at scale. Limitations include user interface and scalability issues can hold users back from utilizing logs in CloudWatch for troubleshooting use cases.
CloudTrail is different from CloudWatch because it keeps a record of every API call within your AWS account (vs. the application- and service-level logs and metrics in CloudWatch). CloudTrail gives you deeper visibility into the activity within your account, helping you see who did what and when. You can use the CloudTrail logs not only to track the security of the user access but also for operational troubleshooting.
Until recently, the native way provided by AWS to analyze CloudTrail data was a simple UI that lets you run some basic searches on events over the last 90 days worth of data. However in January, AWS announced CloudTrail Data Lake, an additional service to store and enable querying of CloudTrail data using the familiar SQL query language.
Even with both of these tools in play, you may still face some of these common AWS monitoring challenges.
2. Ingest and store raw log data within AWS S3 as a data lake
The first step to effective log analysis is setting up your AWS data lake the right way. Configure it to ingest and store raw data in its source format – before any cleaning, processing, or data transformation takes place.
Storing data in its raw format gives you the opportunity to interrogate your data and generate new use cases for enterprise data. The scalability and cost-effectiveness of Amazon S3 data storage means that you can retain your data in the cloud for long periods of time, and query it months, or even years down the road.
Storing everything in its raw format also means that nothing is lost. As a result, your AWS data lake becomes the single source of truth for all the raw data you ingest.
One thing to note: Amazon S3 offers different classes of cloud storage, each one cost-optimized for a specific access frequency or use case. Generally speaking, Amazon S3 Standard is a solid option for your data ingest bucket, where you’ll be sending raw structured and unstructured data from both your cloud and on-prem applications.
3. Don’t trade off data retention for more effective analysis
Data retention policies establish how long your log data will be retained in the index before it is automatically expired and deleted. These policies also determine how much historic log data will be available for analysis at any given time. So, a data retention period of 90 days means that developers and security teams will have access to a rolling 90-day window of indexed log data for analysis.
Shorter data retention windows result in lower data storage costs, but the trade-off is that you quickly lose access to analyze older log data that could support long-term log analytics use cases ranging from advanced persistent threat detection to incident investigation, long-term user trends, root cause analysis, among others. On the other hand, retaining log data for longer provides deeper access to retrospective log data and better support for long-term log analytics use cases. If you’re using tools like CloudWatch or the ELK stack – costs and performance can degrade as log data volume increases. So many people delete data they otherwise need for analysis.
That doesn’t have to be the case. Gaining total observability of your network security posture and application performance requires a centralized log management solution that enforces a small storage footprint for your logs and supports security operations, DevOps, and CloudOps use cases.
4. Avoid relying entirely on general-purpose observability solutions for all log management needs
Many companies address their observability needs of full, centralized visibility into metrics, logs and traces by buying a holistic application performance management (APM) solution from one vendor, such as DataDog.
However, if you do that, you may find that while centralizing all telemetry in a single platform works well at first, it creates significant challenges at scale. The underlying data technologies for monitoring, troubleshooting and trend analysis/reporting are fundamentally different, often leading to ballooning costs, reduced data retention, increased operational burden and limited ability to answer relevant analytics questions.
To avoid these challenges, a better approach may be to build a best-of-breed observability solution made up of open source tools and open APIs. The Cloud Native Computing Foundation(CNCF) provides a good basis from which to start with its OpenTelemetry project. This open-source project is a collection of tools, APIs, and SDKs you can use to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your applications’ performance and behavior.
So, instead of relying on a single tool to explore all telemetry, a cloud-native observability stack might look something like this.
5. Collect and analyze all log data centrally
As mentioned above, many companies have a multi-cloud or hybrid cloud strategy that makes log analysis in CloudWatch difficult to manage. Even if all of your workloads run in AWS, you may not be able to collect and analyze as much data from them as you would prefer, since CloudWatch only supports specific predefined log and metrics types.
To extend your log analytics strategy beyond CloudWatch, deploy a logging solution like ChaosSearch that gives more control over the log and metrics data you expose, as well as how you interact with it. There are typically two ways to do this:
- You can use CloudWatch with ChaosSearch by exporting CloudWatch logs, moving them to S3 and indexing these logs and other log data stored in S3.
- You can bypass CloudWatch and push logs directly into ChaosSearch. From there, the platform can index any data stored in S3 in the log, JSON, or CSV format. There is a huge ecosystem of log shippers and tools to transport data to cloud object storage (Amazon S3) from Logstash and beats, Fluentd, Fluentbit to Vector, Segment.io, Cribl.io or programmatically from Boto3.
Making the most of your AWS logs
To sum up, it’s tempting to go for the default AWS tools like CloudWatch and CloudTrail to manage your log data, but they may not be enough – particularly if you have a complex serverless or microservices-based architecture, or if you are operating in a multi-cloud or hybrid cloud environment. There are many reasons why this is true – from scalability and usability issues to cost. To make the most of your AWS logs, ideally you should centralize them (along with log files from other sources) using AWS S3 as a data lake. From there, you can rely on a best-of-breed observability approach to understand all of your telemetry – including logs, metrics and traces.
Read the Blog: 10 Essential Cloud DevOps Tools for AWS
Listen to the Podcast: Making the World's AWS Bills Less Daunting
Check out the Whitepaper: Build a DataOps Foundation for Agile Data Analytics