In the ChaosSearch era we solved the problem in 3 minutes

July 21, 2020

User Experience road sign with sun background (1)

In the late afternoon of a beautiful June day, a last-minute customer error was threatening to derail the end of my work day. We received communication from the customer that was simply titled, “There’s a weird error message happening.”

 

The customer included a screenshot of the interface with a nasty SQL error that was propagating to the surface. This was concerning for a few reasons - we specifically design our systems to obfuscate errors like that from users and it wasn’t one that we had seen before. 

 

At Transeo, we are focused on helping schools track and report community service and workplace learning experiences without the paper trail. Our users are non-technical and the user experience is a central part of our mission.  An error like the one the user experienced in the raw on this June day is something we strive to avoid. We now know our choice of using ChaosSearch to make all of our log data in our S3 buckets hot searchable data was the right one.

 

Thinking back to what our workflow would have looked like prior to ChaosSearch, it’s hard to imagine how we would have gone about diagnosing this issue, much less solving it. Most likely I would have asked the customer to repeat the action they took, stream the logs live, and hope that the bug happened again and that I could parse out the full error message from the firehouse of data whizzing by. 

 

In the ChaosSearch era of Transeo we solved the problem in three minutes. 

 

After receiving the request from the customer, our team jumped into Bugsnag where we quickly found the details of the HTTP request, along with a crucial component: the request ID. This customer issue was actually shaping up to be a perfect test of our new reference-based logging system we had recently built. Just days prior to this specific error we deployed a new logging pipeline that not only logged out our standard request + error messages, but also dumped the database queries for a particular request when there was an error. Alongside those queries was the unique request ID sent to Bugsnag for future analysis. 

 

Once we had the request ID it was as simple as running a pre-built query in the ChaosSearch dashboard. Within seconds we had the exact breadcrumb trail of what actions the user took and which database queries they ran. The issue was simple to fix from there, and we had a new deploy up and running in 20 minutes. 

 

Not only were we able to query our logs against the request ID, but because the user-facing error message had the SQL syntax in it we were able to add an additional filter to search just for logs in that particular sequence. This simply wouldn’t have been possible using standard grepping or simple query tools. 

 

I struggle to think what this process would have looked like if we didn’t have ChaosSearch in place; although it’s hard to know for sure, I’m positive it would have taken more than three minutes of my time.