Sometimes, the data you want to analyze lives in AWS S3 buckets by default. If that’s the case for the data you need to work with, good on you: You can easily ingest it into an analytics tool that integrates with S3.
But what if you have a data source — such as logs generated by applications running in a Kubernetes cluster — that isn’t stored natively in S3? Can you manage and analyze that data in a cost-efficient, scalable way?
The answer is yes, you can. You just need to build an efficient pipeline for moving that data into S3, from which you can ingest it into an analytics solution. And while setting up a pipeline like this requires a little extra work, modern tooling makes it quite simple.
To prove the point, here’s a look at how to build a pipeline for getting Kubernetes logs into S3 using Logstash, the popular open source log pipeline tool.
The Conventional Approach to Kubernetes Logs
You may be thinking: I already have a way to push my Kubernetes log data into a log analytics tool. I run a sidecar container (or, more likely, a bunch of sidecar containers) that aggregates the logs for me and delivers it directly to my analytics platform.
You can certainly take this approach — which tends to be the default strategy for managing Kubernetes logs.
But there is a big downside to using a sidecar container to push logs directly to an analytics platform: You are restricted to whichever storage service or architecture that platform offers.
Usually, that storage is not as cheap or scalable as a service like S3 — a fact that becomes especially problematic if you want to keep your logs on hand for a long time, or you plan to process logs in batches.
For purposes like those, you need storage that costs pennies per gigabyte and that scales indefinitely.
Streamlining Kubernetes Logging with S3
What if, instead of relying on sidecars to move data directly into a log analytics platform, you could push your Kubernetes application logs into S3, then use S3 as a data lake that efficiently houses your logs until you’re ready to analyze them? That’s the flexible approach that ChaosSearch enables. By making it easy to work with any and all data types directly from S3, ChaosSearch allows teams to leverage S3 as an endlessly scalable and affordable data lake.
The only big question to solve is how you get your Kubernetes logs into S3 in the first place.
Fortunately, there are a number of ways to do this using free tools (after all, as an open source platform, Kubernetes is all about choice and freedom). One of the most straightforward approaches is to use a log collector like Logstash, the open source data collector, to push Kubernetes logs to S3.
In this article, we’ll walk through the steps required to push Kubernetes logs to S3 using Logstash (with some help from Filebeat, an open source log shipper).
As we’ll see, it’s very easy to build a pipeline that moves logs from your pod into S3 using open source tools, especially if you take advantage of Helm charts, which make the tools a snap to deploy into your cluster.
Pushing Logs to S3
The process for building our pipeline is quite simple. First, we’ll install Filebeat, which will pull application logs from the nodes where they live by default. Then, we’ll set up Logstash and configure it to receive logs from Filebeat. Finally, we’ll configure Logstash to push logs to S3.
Set up Filebeat
First, let’s deploy Filebeat in our cluster.
Again, Helm charts make this easy to do. First we need to enable the repository for Filebeat:
helm repo add elastic https://helm.elastic.co
Now, we can deploy Filebeat with a simple command:
helm install filebeat elastic/filebeat
You’ll need to configure Filebeat for your cluster, of course, by setting the appropriate annotations for your deployment and DaemonSet in Filebeat’s values.yaml file. Filebeat’s GitHub repository details the relevant parameters.
Set Up Logstash
Next, let’s deploy Logstash, again with the help of a Helm chart. The Helm repository we need is the same as for Filebeat, so it should already be enabled if you followed the instructions in the preceding section. So go ahead and install Logstash with:
helm install logstash elastic/logstash
Here again, you’ll need to configure a values.yaml file to fit your cluster.
Configure Filebeat to Push to Logstash
By default, the Filebeat deployment we installed with Helm is set up to push logs directly to Elasticsearch. We want the logs to go to Logstash instead, so we need to remove the output.elasticsearch: section from Filebeat’s values.yaml file and replace it with a configuration that matches our Logstash setup. It should look something like this:
hosts: ["184.108.40.206”, ”220.127.116.11.”, ”18.104.22.168”]
Define the IP addresses to match your nodes. You may also need to specify port numbers if you have configured a Logstash port other than the default (which is 5044).
For details on other parameters that you can set for this configuration, refer to the Filebeat documentation.
Configure Logstash to Receive Logs from Filebeat
In addition to telling Filebeat how to ship logs to Logstash, we also need to tell Logstash how to accept logs from Filebeat. We do that by editing Logstash’s pipelines.yml file, which should be in the /etc/logstash/conf.d directory of your Logstash server. Add a section like this:
port => 5044
This tells the server to use Logstash’s Beats plugin on port 5044 to accept data.
Configure Logstash to Push to S3
The last step is to configure Logstash’s S3 output plugin, which enables Logstash to push logs to an S3 bucket that you specify.
To enable the plugin, open your pipelines.yml file back up and add a section like the following:
region => "eu-west-1"
bucket => "your_bucket"
Depending on your S3 configuration, you’ll likely need to define other parameters as well, especially if you restrict access to your storage bucket. The Logstash documentation has all the details.
Congratulations: With the help of Filebeat and Logstash, you’re now pushing your Kubernetes application logs into S3. From there, you can ingest and analyze them using ChaosSearch whenever you’re ready — and thanks to S3’s scalable, cost-effective storage architecture, you won’t have to worry about being hit with crazy storage bills or maxing out your Kubernetes log storage capacity, no matter how much you scale up your log analytics operation.