Equifax Simplifies Cloud Operations Management with ChaosSearch! Read the Case Study -->
Start Free Trial

ChaosSearch Blog

18 MIN READ

The New Best Way to Index and Query JSON Logs

JSON has become the de facto standard format for capturing and storing log data - and for good reason. Structured logging with JSON reduces the cost and complexity of extracting valuable insights from log data.

While plain text logs can support use cases like application performance monitoring and security analysis, writing logs in an unstructured format makes it time-consuming and costly for data engineers to parse, index, and analyze those logs. With JSON, logs are parsed automatically and stored in a structured format that can be converted into a table (tabularized) to support querying and analytics applications.

But despite the benefits of storing logs as JSON, data engineers still encounter technical challenges and trade-offs when indexing and querying JSON logs with complex data structures (e.g. multiple nesting levels, nested arrays, and sibling arrays).

To eliminate those trade-offs, ChaosSearch has introduced JSON Flex®, a new platform capability that unlocks the full potential of JSON logging.

In this blog post, you’ll discover how analysis of custom and nested JSON poses a challenge for data engineers, how enterprises are solving this problem today, and why JSON Flex is the new best way to index and query JSON logs.

 

JSON Logs

 

What is the JSON Analysis Challenge?

As JSON logs increase in complexity — with nested objects, nested arrays, and multiple nested levels — fully indexing those logs results in either:

  1. Row Explosion, which increases index time and leads to prohibitively high data storage costs, or
  2. Column Explosion, which makes it increasingly difficult and time-consuming to write clear queries that generate useful insights.

To understand why this is the case, let’s explain the concept of JSON Flattening.

 

What is JSON Flattening?

To make JSON logs available for querying and analytics, engineers convert the logs into a table. But when logs contain nested objects, they must be transformed into a flat data structure before they can be tabularized. This process is known as JSON flattening.

JSON flattening transforms JSON logs with nested objects and arrays into a tabular format, but it also results in explosive data growth and unwieldy queries that make it difficult to analyze the data.

To illustrate why, we’ll show you a series of examples detailing exactly what JSON flattening is doing with nested JSON objects and arrays.

Consider the following, simple/inherently flat JSON object structure to represent employees:

 

json object

 

Here’s what it would look like if we took the JSON example shown above and indexed it into a tabular format:

 

first_name

last_name

John

Smith

Sally

Walker

 

The flat data structure yields a simple employee index where the field names from JSON objects are conceptually translated into columns of a table and the field values are translated into row data.

Now let’s see what happens when we encounter nested fields. Consider the following example of employee records with multiple layers of nested fields:

 

json object 2

 

When we encounter nested JSON objects in logs, we can use JSON flattening techniques to convert the data into a tabular format. Nested fields will be represented as columns within the same physical row, and will be prefixed with the parent JSON object to clarify their identity.

Here’s how the data in the above example would look after JSON flattening:

 

employee.first_name

employee.last_name

employee.address.city

employee.address.street

John

Smith

Chicago

2 Pine St.

Sally

Walker

Seattle

1 Oak St.

 

Every outer level JSON object can be horizontally flattened into a single column in tabular format, so we still have a manageable number of columns and rows in our data representation relative to the raw data size.

Now let’s see what happens when we add an array of phone numbers to our JSON log and try to flatten the data into a table. Consider the following example:

 

json object 3

 

In this example, John’s employment record has three phone numbers associated with it. If we want to associate each of John’s phone numbers with all other columns in the flattened row representation, we must choose between two approaches: vertical flattening and horizontal flattening.

If we choose vertical flattening, each element in the phone_numbers array will be represented as a separate row:

 

employee.first_name

employee.last_name

employee.address.city

employee.address.street

employee.phone_numbers

John

Smith

Chicago

2 Pine St.

5551112222

John

Smith

Chicago

2 Pine St.

5553334444

John

Smith

Chicago

2 Pine St.

5556667777

 

If we choose horizontal flattening, our data representation will include an additional column for each value in the phone_numbers array:

 

employee.first_name

employee.last_name

employee.address.city

employee.address.street

employee.phone_numbers.0

employee.phone_numbers.1

employee.phone_numbers.2

John

Smith

Chicago

2 Pine St.

5551112222

5553334444

5556667777

 

Vertical and horizontal JSON flattening each have their own advantages and disadvantages at the time of indexing and at query-time.

Vertical flattening is necessary to perform aggregations over values in an array, but also means that individual JSON objects will be seen as multiple rows. Row explosion increases the size of the data, increasing index time and multiplying data storage costs.

Queries are simpler when vertical flattening is used, but the expansion of rows when flattening sibling arrays can result in misleading results during aggregation.

Horizontal flattening results in both faster indexing and faster queries, but leads to column explosion that makes it increasingly difficult to write clear and constructive queries without using wildcard naming.

As JSON logs get more complicated, the negative effects of JSON flattening also increase. To illustrate how, let’s extend our JSON flattening example to include a sibling array:


json object 4

 

Now, our JSON example has an array of phone numbers and an array of addresses at the same nesting level. How will this sibling array look when we flatten it? Again, we’ll need to choose between the vertical and horizontal JSON flattening approaches.

Here’s what a vertical flattening approach would look like:

 

employee.first_name

employee.last_name

employee.addresses.city

employee.addresses.street

employee.phone_numbers

John

Smith

Chicago

2 Pine St.

5551112222

John

Smith

Chicago

2 Pine St.

5553334444

John

Smith

Chicago

2 Pine St.

5556667777

John

Smith

Boston

3 Willow St.

5551112222

John

Smith

Boston

3 Willow St.

5553334444

John

Smith

Boston

3 Willow St.

5556667777

 

We now begin to see the concerns of representing data with a vertical flattening approach. As sibling arrays are encountered in the data, we will end up increasing the number of rows so that all possible combinations can later be associated and queried.

What started as a single JSON object to represent a single employee, has turned into 6 'rows' of data. Furthermore, adding a 4th phone number and a 3rd address would expand this out to12 rows. Adding another sibling array to represent employee children would further expand the row count, and so on.

To avoid row explosion, we might try a horizontal flattening approach instead:

 

employee.first_name

employee.last_name

employee.addresses.0.city

employee.addresses.0.street

employee.addresses.1.city

employee.addresses.1.street

employee.phone_numbers.0

employee.phone_numbers.1

employee.phone_numbers.2

John

Smith

Chicago

2 Pine St.

Boston

3 Willow St.

5551112222

5553334444

5556667777

 

Horizontal flattening has a distinct advantage in that we only ever see John as a single row in our dataset, but we’re still running into column explosion that results in unwieldy queries as we try to find all addresses and/or phone numbers for John.

As a final example of how JSON flattening leads to trade-offs between explosive data growth and complex querying, let’s see what happens when we use JSON flattening to tabularize an array nested inside another array:

 

json object 5

 

Each employee now has an array of children, who in turn each have an array of phone numbers. As we can see, our simple employee record is now becoming very difficult to represent in tabular format. Here’s what happens when we use a vertical JSON flattening approach:

 

employee.first_name

employee.last_name

employee.addresses.city

employee.addresses.street

employee.phone_numbers

employee.children.name

employee.children.phone_numbers

John

Smith

Chicago

2 Pine St.

5551112222

Ted

5557777777

John

Smith

Chicago

2 Pine St.

5551112222

Ted

5558888888

John

Smith

Chicago

2 Pine St.

5551112222

Marie

5559999999

John

Smith

Chicago

2 Pine St.

5553334444

Ted

5557777777

John

Smith

Chicago

2 Pine St.

5553334444

Ted

5558888888

John

Smith

Chicago

2 Pine St.

5553334444

Marie

5559999999

John

Smith

Chicago

2 Pine St.

5555556666

Ted

5557777777

John

Smith

Chicago

2 Pine St.

5555556666

Ted

5558888888

John

Smith

Chicago

2 Pine St.

5555556666

Marie

5559999999

John

Smith

Boston

3 Willow St.

5551112222

Ted

5557777777

John

Smith

Boston

3 Willow St.

5551112222

Ted

5558888888

John

Smith

Boston

3 Willow St.

5551112222

Marie

5559999999

John

Smith

Boston

3 Willow St.

5553334444

Ted

5557777777

John

Smith

Boston

3 Willow St.

5553334444

Ted

5558888888

John

Smith

Boston

3 Willow St.

5553334444

Marie

5559999999

John

Smith

Boston

3 Willow St.

5555556666

Ted

5557777777

John

Smith

Boston

3 Willow St.

5555556666

Ted

5558888888

John

Smith

Boston

3 Willow St.

5555556666

Marie

5559999999

 

As our JSON example increases in complexity, a vertical JSON flattening approach begins to rapidly increase the number of rows in our data table.

Here’s what happens when we use horizontal JSON flattening instead:

 

employee.first_name

employee.last_name

employee.addresses.0.city

employee.addresses.0.street

employee.addresses.1.city

employee.addresses.1.street

employee.phone_numbers.0

employee.phone_numbers.1

employee.phone_numbers.2

employee.children.0.name

employee.children.0.phone_numbers.0

employee.children.0.phone_numbers.1

employee.children.1.name

employee.children.1.phone_numbers.0

John

Smith

Chicago

2 Pine St.

Boston

3 Willow St.

5551112222

5553334444

5555556666

Ted

5557777777

5558888888

Marie

5559999999

 

As our JSON example increases in complexity, a horizontal flattening approach starts begins to widen the table, resulting in column explosion and awkward field names that lead to confusing queries.

 

Summarizing the Nested JSON Analysis Challenge

Indexing JSON logs with nested objects and arrays requires data engineers to flatten the JSON files. But JSON flattening results in an explosion in database size or the necessity of writing complex queries to get value from the data. Or, they treat the JSON objects as strings and miss out on valuable insights.

 

How Do Enterprises Analyze Nested JSON Today?

Enterprise DevOps, SecOps, and CloudOps teams that utilize JSON logging regularly encounter the nested JSON analysis problem when dealing with complicated JSON logs.

Here’s an example of a typical AWS CloudTrail log that a SecOps teams might want to analyze:

 

AWS CloudTrail Log

 

AWS CloudTrail monitors account activity on AWS infrastructure and writes logs in structured JSON format. These logs contain a wealth of information that can be analyzed to gain insights into system security and user behavior.

SecOps teams might want to query AWS CloudTrail logs in order to:

  1. Check the source IP address of authenticated users with root privileges vs. blacklisted IPs (1, 3, 6).
  2. Check what that user did (5, 7, 8, 9).
  3. Check any other anomalous behavior in the system around the same time (4).
  4. Check what other things the root user and the new user accessed (2, 8).
  5. Check what other things the same user accessed around the same timeframe (2, 4).

The ideal approach would be to fully index these logs, but AWS CloudTrail logs are complex, with nested objects and arrays in multiple levels. Indexing these logs will require JSON flattening, which means either blowing up the size of the data (row explosion) or making the index harder to query (column explosion).

To avoid explosive data growth and complex queries, enterprises instead choose one of two alternative approaches to analyzing nested JSON:

  1. Data Engineering Approach: Instead of fully indexing complex JSON logs, data engineers can create dedicated data pipelines that pre-process the log data and store only the structured data that is relevant for known analyses.

    To prepare a JSON pipeline, data engineers must know the relevant fields for analytics in advance — and anything non-essential will be discarded. In the above example, each question the SecOps team wants to ask requires a dedicated pipeline. That’s a lot of time and work required of data Ops teams. And the need to pre-configure data for analytics means that valuable insights are lost.
  2. Point Search Approach: Point searches allow the SecOps teams to search for specific information in a specific time period within JSON logs — they’re only useful if you know what you’re looking for and where to look for it.

    A point search approach might answer the immediate query, but wouldn’t be able to answer the 3rd question above and look for other anomalous behavior in the system around the same time. Just like the data engineering approach, point search reduces the volume of data analyzed and results in lost insights.

Whether choosing the data engineering approach or a point search approach to analyze nested JSON, the end result is the same: lost insights that prevent organizations from extracting the full value of structured JSON logs.

Thankfully, there is now a new approach.

 

ChaosSearch Solves the Nested JSON Analysis Challenge

Introducing JSON Flex, a ChaosSearch proprietary technology that solves the Nested JSON Analysis Problem.

JSON Flex allows customers to store raw JSON and analyze it as if it were structured at different nested levels — with no data explosion, no complex and unwieldy queries, and no lost insights. Our approach is to maintain the smallest possible data representation at index time, while allowing our users full customization at query time to materialize the data.

 

Watch: Unlock JSON Files for Analytics at Scale in ChaosSearch

 

Here’s how it works:

Chaos Index® detects and indexes JSON automatically, without any configuration from the user. Our proprietary index format supports multi-model data access (search, SQL, and ML) in one representation with unparalleled compression ratios of up to 95%, while maintaining performance.

As a result, our users can store all of their JSON logs in full native format with no costly data explosion and without losing fidelity of insights.

For users who may not want to index every field in their JSON logs, we’ve introduced two new functions that make it easy to customize what gets indexed:

  1. JSON Include/Exclude - Users can specify a blacklist and/or a whitelist to easily configure which logs will be indexed and which may be excluded.
  2. JSON Nesting Levels - Users can reduce the combination of nested array expansions to omit irrelevant data and optimize their JSON index for downstream querying.

After indexing JSON logs with Chaos Index, users can explore their log data by creating dynamic virtual views in Chaos Refinery®.

With Chaos Refinery, we give our users complete flexibility to easily customize index views in whatever representation they choose, even switching between vertically and horizontally flattened views depending on what makes sense for each query.

ChaosSearch Array Shaping even allows users in Chaos Refinery to choose vertical flattening for arrays that are relevant to a specific query while keeping the rest in horizontal format. These decisions are maintained in ChaosSearch’s lightweight views representation and materialized at query-time with no need for re-indexing.

 

JSON Array Tranformation

JSON array transformation with ChaosSearch Array Shaping in Chaos Refinery

 

With JSON Flex, ChaosSearch users can fully index and analyze even the most complex JSON logs with no explosive data growth, no unwieldy queries, no trade-offs, and no lost insights.

 

Unlock the Full Potential of JSON Logging

You’re just minutes away from experiencing the seamless flexibility of JSON Flex and unlocking the full potential of structured JSON logging. Here’s what to do next:

  1. Register for our free trial experience
  2. Start landing JSON structured logs in your own cloud storage buckets
  3. Use ChaosSearch to index and analyze structured JSON logs at scale with no data explosions, no data movement, no re-indexing, no trade-offs, and no compromises.

Ready to get started?

Start My Free Trial

 

Additional Resources

Watch the Video: Why and How Log Analytics Makes Cloud Operations Smarter

Read the Blog: Troubleshooting Cloud Services and Infrastructure with Log Analytics

Download the White Paper: Ultimate Guide to Log Analytics: 5 Criteria to Evaluate Tools

About the Author, George Hamilton

George Hamilton
FOLLOW ME ON:
As the director of product marketing, George leads product positioning, messaging, and go-to-market strategy for new and existing ChaosSeach offerings. Prior to ChaosSearch, George led product marketing for CloudHealth by VMware’s cloud management platform. George has also worked at several Boston-area startups, led product marketing for Dell EMC’s object storage, and was an industry analyst focused on cloud computing and IT management software. More posts by George Hamilton