Process, Store and Analyze JSON Data with Ultimate Flexibility. Check out the blog and new demo! -->
Process, Store and Analyze JSON Data with Ultimate Flexibility. Check out the blog and new demo! -->
Start Free Trial

ChaosSearch Blog

13 MIN READ

How to Create an S3 Bucket with AWS CLI

Intro

I don’t want to manage an Elasticsearch cluster. It was one of the main reasons I joined CHAOSSEARCH. To be rid of the accidental complexity of ES, and allow others to do the same. But my job is as an SRE, and logs will be created, those logs need to be searched, and that data needs to be stored somewhere. So let’s pick apart some of the options for quickly getting data into S3 so it can be indexed by CHAOSSEARCH, and try to avoid resorting to a fat client like Logstash or Fluentd.

 

Table Stakes

Most Amazon Services

Most AWS services have a nice way to ship their logs to S3. Most of the time in new line delimited JSON objects, gzip’d into a timestamped file. They are wonderful and easy to work with.

Examples:

For those who are playing with infrastructure as code, tools like Terraform let you easily set up the buckets for common AWS services.

Bonus points: Terraform resource for CloudTrail and ALBs

 

Watch this quick demo to learn how to analyze CloudTrail logs with ChaosSearch:

 

The AWS CLI

The AWS CLI should be standard issue in a cloud user’s toolbox, so we’ll keep this short. pip install awscli or brew install awscli on your Mac. apt-get install awscli or pip install awscli on a Linux box, or couple other methods described here to install it.

Once it’s installed, you can use aws s3 cp or aws s3 sync to do a one time shot of moving data up to be indexed, OR you can use the CLI following along with this blog to start shipping data from an index.

 

An instance that ships

We’re going to create an S3 bucket, an IAM role that can write into that bucket, and an instance profile. Once the profile is created, it can be attached to an instance to use the permissions in that role to push data to the created S3 bucket. You can add that IAM role to an existing instance to push its logs to the bucket, or create a new instance that uses this profile.

Warning: These instructions are mostly CLI. Some instructions include links to AWS docs for doing the same with the AWS console.

 

Do you have an existing instance?

We’re going to assume you have an instance running already, and you just need some help making a bucket and giving the instance permissions to write to it.

Do you not have an instance? That’s pretty out of scope for this doc. I’m sorry. Check out the AWS Quickstart - Launch Instance and you should have an instance in no time.

 

The Bucket

Create a bucket to push your logs to.

See also: AWS Quick Start Guide: Back Up Your Files to Amazon Simple Storage Service

  • Create a test bucket:
aws s3 mb s3://chaos-blog-test-bucket
aws s3 mb s3://chaos-blog-test-bucket
  • Did you get an error? S3 buckets are global, so if someone else has created a bucket with the same name, like the author of this blog post, you’re going to have to substitute your own bucket name for chaos-blog-test-bucket for the rest of this post.
  • List the objects in the bucket:
aws s3 ls s3://chaos-blog-test-bucket

WOW LOOK! A bucket with nothing in it. It’s not impressive yet, but you have to have a bucket to get in the game.

 

The IAM Role

You need an Identity Access Management role to attach Identity Access Management Policies to. Roles are a collection of policies, policies are a collection of permissions. In a more sane world, it’d be easy to create a role that has some named policies attached to it. We do not live in that world. The tooling for AWS makes it easier to create the role, then create the policy and attach it to that role on creation.

Are you confused? I’m sorry, check out: IAM roles for Amazon EC2 which lays this out pretty well, but for the CLI inclined…

Note: Don’t forget to update the name of the bucket you choose for all the following commands. Remember, the bucket namespace is global and across count, we both can’t have a bucket with the same name.

  • Create a role with no permissions:
aws iam create-role --role-name WriteToBucket_Role --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
  • Create a policy that can write into that bucket, and attach it to the role we just created:
aws iam put-role-policy --role-name WriteToBucket_Role --policy-name WriteToBucket_policy --policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":"s3:*","Resource":["arn:aws:s3:::chaos-blog-test-bucket","arn:aws:s3:::chaos-blog-test-bucket/*"]}]}'
  • Create an instance profile for you to attach to an instance:
aws iam create-instance-profile --instance-profile-name WriteToBucket_profile
  • Attach the role to the profile:
aws iam add-role-to-instance-profile --instance-profile-name WriteToBucket_profile --role-name WriteToBucket_Role
  • Attach profile to your running instance. Here you’ll have to know the instance id of the instance you’re adding the IAM profile to:
aws ec2 associate-iam-instance-profile --instance-id YOUR_INSTANCE_ID --iam-instance-profile Name="WriteToBucket_profile"
  • Go checkout your AWS console, and you can see the IAM policy is now associated.

Your instance should now be able to write into that bucket.

 

Try it out:

  • ssh to your instance
  • Create an empty file:
touch x
  • Copy a new empty file to the bucket
aws cp x s3://chaos-blog-test-bucket
  • You should now be able to see the file in the bucket:
aws s3 ls s3://chaos-blog-test-bucket
  • If the copy fails, double check the IAM permissions, and that the instance has the IAM role attacked in the aws console.

 

Automation

Hooray, a thing in the bucket. What a marvel. Let’s do something more impressive and less one-off / manual. Now that we have an instance that can write to bucket easily, let’s automate getting some data there.

 

The AWS CLI + Cron

A cron job that uses the AWS CLI to copy a file to a bucket on a schedule.

MMMMMM….cron. Is this gross? Yes? Maybe? I’m not sure. I do know writing a scheduler is hard. I do know I’m not smarter than Paul Vixie. Cron is on every Unix machine I’ve ever touched, so why not use it for this POC?

Let’s try a quick example uploading /var/log/syslog to an S3 bucket. Are you still on that instance that can write to the bucket? Great. Try running this.

echo "*/5 * * * * root /usr/bin/aws s3 cp /var/log/syslog s3://chaos-blog-test-bucket/${HOSTNAME}/" | sudo tee /etc/cron.d/upload_syslog

Now every five minutes s3://chaos-blog-test-bucket/${HOSTNAME}/syslog will be uploaded to the test bucket.

Depending on your OS/distro, you can check the cron log with tail -f /var/log/crond or journal -fu crond. You should see an execution in the log within 5 minutes.

 

The AWS CLI + Watchdog

Watchdog is a utility that watches a directory for changes, and executes a command when an event happens. watchmedo is a CLI tool that uses watchdog to execute commands on file creation / destruction. They are both related to each other, and pretty awesome.

Let’s do something a bit more dynamic than the cron job.

  • Install watchdog and watchmedo:
python3 -m pip install watchdog[watchmedo]
  • Upload syslog every time it changes:
watchmedo shell-command \
   --patterns="syslog" \
   --interval=10 \
   --wait \
   --command="/usr/bin/aws s3 cp \${watch_src_path} s3://chaos-blog-test-bucket/watchdog/${HOSTNAME}/syslog" \
   /var/log/

Now every time /var/log/syslog is written to, s3://chaos-blog-test-bucket/watchdog/${HOSTNAME}/syslog will be updated.

Yes, doing this for syslog is wasteful. It’s going to upload syslog a lot, one would even say constantly. Maybe this is a better method for a log that only updates a few times a day. Or maybe the pattern should be for *.gz and only ship rotated logs.

 

Watchdog is cool!

Yes, watchdog is cool. Let’s use the python watchdog module with some standard boto to do what watchmedo was doing programmatically. This python code is pretty hacky / POC level, but it should get you started on your way to playing with using watchdog and boto to push interesting data into S3.

  • Watch a directory
  • If a file changes, put the file in a dict
  • Every X seconds, loop thru that dict, and upload each file to the S3 bucket hardcoded to the global var, s3_bucket

Hattip: Bruno Rocha blog where I got a good example to use watchdog

#!/usr/bin/env python3
 
import boto3
import os
import socket
import time
from fnmatch import fnmatch
from botocore.exceptions import ClientError
from watchdog.events import FileSystemEventHandler
from watchdog.observers import Observer
 
##
## Globals
##
 
s3_bucket       = "chaos-blog-test-bucket" # what bucket to upload to
dir_to_watch    = "/var/log/"           # what directory to watch for new files
file_regex      = "*syslog"           # what files should match and be uploaded
s3_prefix       = socket.gethostname()     # this string will be prefixed to all uploaded file names
upload_interval = 30                       # how long to sleep between uploads
 
# global file list
file_queue = dict()
 
##
## file modification handler
##
 
class FileEventHandler(FileSystemEventHandler):
    def on_any_event(self, event):
        print(event)
        if fnmatch(event.src_path, file_regex):
            print ("File matched")
            if event.__class__.__name__ == 'FileModifiedEvent' or event.__class__.__name__ == 'FileCreatedEvent':
                print("Will try to upload %s" % event.src_path)
                file_queue[event.src_path] = "discovered"
            if event.__class__.__name__ == 'FileDeletedEvent':
                if event.src_path in file_queue:
                    print("File was deleted before it could be uploaded: %s" % event.src_path)
                    del(file_queue[event.src_path])
        else:
            print("Not following %s" % event.src_path)
 
##
## Upload files to S3
##
 
def upload_files():
    print("Fired upload_files")
 
    sesh = boto3.session.Session()
    s3_client = sesh.client('s3')
 
    tmp_files = dict(file_queue)
    for file in tmp_files:
        del(file_queue[file])
 
        print ("File: %s" % file)
 
        if os.path.isfile(file):
          print("Uploading %s (%i bytes)" % (file, os.path.getsize(file)))
          with open(file, 'rb') as data:
              s3_client.upload_fileobj(data, s3_bucket, "{}/{}".format(s3_prefix, file))
 
          print("Uploaded: %s" % file)
        else:
          print("%s isn't a file" % file)
 
##
## Main
##
 
event_handler = FileEventHandler()
observer = Observer()
observer.schedule(event_handler, dir_to_watch, recursive=False)
observer.start()
 
# stay awake...
try:
    while True:
        print("Sleeping %i seconds" % upload_interval)
        time.sleep(upload_interval)
        upload_files()
except KeyboardInterrupt:
    print(file_queue)
    print("Attempting one last upload before exiting.")
    upload_files()
    observer.stop()
observer.join()

Play around with it. Change the file or the mask or the interval. This python script weighs in at about 50 megs of ram. It is not super robust, but it’s a cute demo, and surprisingly powerful.

 

Wrapping up

It’s pretty common to think you have to ship logs with a purpose-built log shipper, but there are a few lighter weight options to move files on a schedule or as they change. Except for the AWS shippers, none of the above options are perfect, but they should be able to get you started, and they should help get your gears turning and some bits flowing.

What’s your favorite way to ship files to S3? Hit me up: @platformpatrick.

Preview: In the next post, we’re going to hack on Libbeat and use Filebeat to write directly to S3. Is Filebeat a fat client? Maybe, but it is wayyyyy skinnier than Logstash and Fluentd.

Start a Trial

 

Related Content

About the Author, Patrick Flaherty