Analyzing Historical Log Data with CHAOSSEARCH – Part 1: Archive Your Data to S3
In recent years, many companies have sprung up in the log analytics and monitoring market. It’s no surprise — as more and more complicated software and services are released, the need for deep visibility into those systems increases. With the rise of cloud computing and serverless architectures, centralizing metrics and event data is key to understanding the state of a deployment, as well as diagnosing any issues that arise in real time.
Solutions are a mix of proprietary and open source stacks. Many of the latter wrap around the ELK stack, and many of the former use Elasticsearch as their data store as well. Elasticsearch is wonderful for this use case, but its tendency of indices to explode in size with increased data ingestion causes many vendors to offer short retention periods; the amount of time your log data is accessible within their systems is often only 7, 14, or maybe 30 days. After this time period, many users have no choice but to delete or archive their data; they simply can’t afford the cost or complexity of longer term storage. Some allow their data to be deleted, some archive it. Deleted data has no value. Archived data has value, but at a cost.
CHAOSSEARCH seeks to bridge this gap — allowing search and analytics on top of your archived data, while exposing the Elasticsearch API that we all love. Yes, this means you can use Kibana too:
When it comes to choosing a cloud storage offering, Object Storage is king. More specifically, Amazon’s Simple Storage Service (S3) is king. It offers the perfect platform for archiving your historical log data. S3 is:
- Scalable — offering ‘unlimited’ storage
- Available - 99.99% over a given year
- Durable — to ‘11 9s’
- Cheap. Really cheap. Like 2 cents a month per GB cheap
And, you can lower this price further by utilizing lesser storage classes:
- Infrequently Accessed data — https://aws.amazon.com/s3/storage-classes/
- Reduced Redundancy storage — https://aws.amazon.com/s3/reduced-redundancy/
To use this feature, either attach a storage class to each object with the CLI or given SDK, or access an object’s properties in the console:
You can even automatically move older data sets to one of the above storage classes after a predefined time has elapsed, or archive data to Amazon Glacier, an even cheaper storage option:
This functionality can also be enabled via the management console under bucket management:
Whether you choose to implement any or all of these lifecycle rules, CHAOSSEARCH will continue to provide the same level of visibility into your historical data.
Because of these considerations, we’ve built CHAOSSEARCH with a cloud-first mentality, and native S3 support. By building a solution specifically with object storage in mind, we leverage the simplicity and cost of S3. For the first time, your historical archives can be readily analyzed right where they live — with familiar tools and concepts, such as relational modeling and Kibana text search and visualizations. A combination of your ‘Hot’ storage solution for real-time monitoring and alerts, and CHAOSSEARCH for long-term data analytics, provides visibility into all past and current events within your stack.
If you are already archiving your data to S3, give us a try and gain insight into what you have. If you aren’t, you’ve got options.
A few vendors offer archiving as part of the data lifecycle — usually to S3. So if you are using Loggly, Logz.io, or Sumo Logic, you have options to send directly to S3:
- Loggly — s3-bucket-archives
- Logz.io — elk-role-based-access-s3-fluentd/
- Sumo Logic — Configure-Data-Forwarding-from-Sumo-Logic-to-S3
If you are using something else that doesn’t offer this service, you can easily dump whatever you like into S3 using the AWS CLI and just a few commands. The CLI maps most of the aws console functionality to your command line, so you need to specify that you want to use S3, as opposed to, say, EC2. S3 functionality comes in two layers:
- s3api — this is the baseline REST api natively supported by S3
- s3 — this layer performs more advanced functionality composed of s3api calls
The first thing you need is a ‘bucket’ to store things in.
A bucket is a container for storing objects (files) in. This is part of the standard REST api, so we can get at it with the s3api layer:
> aws s3api create-bucket --bucket my-bucket
This can also be created in the management console within S3:
Note that the bucket name (my-bucket in this case) must be globally unique. This is because AWS optionally maps a public URL to your bucket — we can get access to our new bucket through either of the following:
This is quite convenient for things like static web hosting, which is beyond the scope of this post.
Now that we have a bucket, we probably want to put something in it. If you have a single file, perhaps a log file, there are a couple of ways to do this. We can use the s3api again to put an object to our bucket (assuming it is not too big):
> aws s3api put-object --bucket my-bucket --key ALog.log --body ALog.log
Or through the management console by clicking on your bucket and choosing upload:
This will put the blob data in ALog.log to your bucket with a key of the same name. You can get objects in a similar fashion. You are probably thinking that this is all well and good, but you’ll have to wrap this thing in a script to handle any large files or bulk requesting. That’s where the s3 layer comes in. It hides some of the ugliness of multipart uploads and multi file upload / download.
The copy (cp) command can be used to copy files between your local machine and S3. It can even move files from one bucket to another:
Uploading our log with this command looks like this:
> aws s3 cp ALog.log s3://my-bucket
We can also download it somewhere:
> aws s3 cp s3://my-bucket/Alog.log .
Or move it between buckets:
> aws s3 cp s3://my-bucket/Alog.log s3://my-other-bucket/Alog.log
If I have an entire directory of log files that I want to archive, the CLI can handle that too with the sync feature. It handles the translating of keyparts into folder names and vice versa, allowing you to keep a local filesystem in sync with an S3 bucket.
This will copy anything in the file tree under my-logs to a corresponding key in my-bucket:
> aws s3 sync my-logs-dir s3://my-bucket
We can also reverse the source and destination to achieve the opposite:
> aws s3 sync s3://my-bucket my-logs-dir
The console does also allow uploading, even within batches, within the bucket view. However, the maximum file size is limited to 78GB, versus 5TB through the API. I wouldn’t recommend using the console for anything over a few GBs:
Done playing around? You can delete everything in the bucket like so:
> aws s3 rm s3://my-bucket --recursive
Or, by selecting your bucket and choosing ‘Empty’ (or ‘Delete’ to also remove the bucket itself) in the console:
Note that these commands don’t specify any region information, so they are defaulting to us-east-1. There are also a lot more options you can specify for these operations to customize your processes involving security, storage classes, encryption, and more.
If you want to include your uploads as part of a python script or infrastructure, you can look into the Boto3 library. Boto3 is Amazon’s python API. It provides access to much of the functionality you get in the management console, including S3 features.
For some users, a local upload just isn’t going to cut it. When you have terabytes of data, it may actually be faster to ship a physical piece of hardware to AWS with your data on it that they can dump into S3. For this, AWS has implemented Snowball and Snowmobile. Snowball is essentially a server shipped to you. You load your data onto it and then ship it back. Snowmobile is literally a truck for exabyte scale migrations — now that is Big Data:
For those running local, on premises ELK stacks, you have a couple of options for archiving. You can use Logstash directly to output data to S3 using the S3 Output Plugin seen here:
Logstash can also be used to export CSV from Elasticsearch (CHAOSSEARCH supports Json and CSV):
There are also some third party tools, such as Elasticdump that enable you to dump to different formats locally:
With the data local, you’re free to use the method of choice to move the data to an S3 bucket, such as the CLI commands above.
Once your data is in S3, the next step is to use CHAOSSEARCH to discover and organize it so that it can be queried and searched in a relational context and with Kibana through our Elastic API. That will be the topic of our next blog. So stay tuned!