How Blackboard Pivoted Their Log Analytics Approach When the World Went Virtual

Start Free Trial

Blackboard Logo

Log Analytics at Blackboard

Committed to supporting learners throughout their lifelong journey, Blackboard Inc. is a world leader in education technology, serving higher education, K-12, business and government clients around the world with learning management systems (LMS) and other digital education products.

The world has changed dramatically recently — particularly since 2020 — and Blackboard continues pushing the boundaries of digital learning by investing in innovations that will propel them, and the education industry at large, forward faster. For the purposes of this case study, we will focus on how the Blackboard Product Development team is driving innovation in their approach to log analytics.

Blackboard’s flagship product, Blackboard Learn, is a web-based LMS that can be installed on local servers, hosted by Blackboard application service provider (ASP) solutions, or provided as software-as-a-service (SaaS). Acting as both an ASP and SaaS, Blackboard uses log analytics to monitor cloud deployments, troubleshoot application issues, maximize uptime, and deliver on data integrity and reporting obligations for customers.

When COVID-19 hit, millions of students switched to online learning models, and Blackboard saw daily log ingest volumes skyrocket overnight. By migrating an overworked ELK stack to ChaosSearch, Blackboard was able to manage a 3,000%+ increase in product usage while growing their business amidst the pandemic.

 

The Challenge: Make ELK Work at Scale

Before ChaosSearch, Blackboard’s Product, SRE, DevOps and Support teams depended on a combination of custom-managed ELK (Elasticsearch, Logstash, Kibana) stacks and managed Elasticsearch service offerings for centralized log management (CLM). But growing daily log volumes and variable spikes in log volumes were causing pain. Unpredictable spikes would cause the ELK stack to go down, making it unusable at times while management and data storage costs grew.

 

Goal: Reduce Management Overhead and Complexity

The Blackboard Learn team in particular had run their own ELK clusters and relied on commercially managed services, but each came with its own set of management challenges. As the product’s customer base and daily log ingest grew, it became unstable and difficult to scale. Blackboard found that even with a third party managing the environment, the same challenges remained as a result of running Elasticsearch under the hood:

  • Elasticsearch inherently presents architectural challenges at scale, potentially leading to yellow or red status clusters requiring support. Regardless of who is providing that support (Blackboard vs. the service provider), it’s a headache to manage.
  • Elasticsearch requires the customer to handle capacity planning, which is hard to do for log data because spikes are hard to predict. Elasticsearch customers are stuck either over-paying to ensure uptime, or dealing with cluster constraints in an attempt to keep costs in line. They wanted staff to focus on the core product versus managing infrastructure; when demand increased up to 3,000% across their portfolio and an avalanche of data came streaming in, they struggled to keep the proper balance. Simple changes and basic maintenance tasks to ensure availability and performance were eating up at least 10 to 15 hours of valuable SREs’ time every week.

 

Goal: Cut Costs & Un-limit Data Retention

Blackboard needed to analyze short-term data for troubleshooting, real-time alerts, customer support, etc., but they also needed access to long-term data for deeper analyses and compliance purposes.

Because Elasticsearch costs balloon at scale and the environment is so complex to manage, Blackboard opted to reduce log data retention in Elasticsearch to 30 days. Longer term log data was shipped to Amazon S3 cloud object storage. As a result, they had to deal with:

  • Limited insights — Only the most recent time slice of log data was available in Elasticsearch for log analytics.
  • Unnecessary data duplication — Short-term data was captured in two places, which is messy and costs extra.
  • Data integrity risk — With data captured in two places, they had to be careful to make sure logs were reaching each destination in their entirety.

“We would typically send logs to two outputs,” explained Joel Snook, Blackboard’s senior director of DevOps Engineering. “We'd send them to S3, and then also Elasticsearch. And Elasticsearch would expire data out because it's expensive to hold it on that tiered data set. We were storing the data twice.”

 

The Tipping Point

Blackboard needed a new approach to log analytics that would meet specific criteria:

  • Alleviate the pain of constantly requiring valuable resources to manage ELK cluster deployments and address outages.
  • Expand data retention limits for longer term and more comprehensive insights, while staying within budget.
  • Keep end users happy with a seamless transition experience that allows users to keep working in the tools they know and trust (e.g. Kibana).

The team had been contemplating building an in-house solution when Snook discovered ChaosSearch at AWS re:Invent in 2019. He and his colleagues appreciated ChaosSearch’s ability to index and search log data directly in Amazon S3 buckets with no data movement or duplicate storage.

“We had to dismantle some of our logging just because of the volume. And then when COVID hit, it just became untenable,” said Snook. “It was awakening. We realized we really needed to change our strategy.”

 

Solution: Replace Elasticsearch with ChaosSearch

Snook explained, “We sought out ChaosSearch for a specific need — a place to aggregate our logs into a way that we could consume them and get information out of them.”

Delivered as a fully managed, highly available service, the ChaosSearch Data Lake Platform promised not to go down, and could take advantage of all the durability, scalability, cost-effectiveness, and governance controls that inherently come with running directly on S3.

The transition to ChaosSearch was quick and easy. The platform acts as a drop-in replacement for ELK, utilizing the same Elastic API and familiar Kibana interface for visualization and dashboarding. In just a short time, Blackboard was able to deploy ChaosSearch on top of its global Amazon S3 data lake and start using the platform for log analytics at unlimited scale. And the platform is GDPR compliant, which is critically important in the data-sensitive education industry.

Blackboard is now using ChaosSearch to support use cases like:

  • Day-to-day visibility of cloud computing environments at scale
  • App troubleshooting and alerting over long periods of time
  • Root cause analysis without data retention limits or trade-offs
  • Resolution of application performance issues

 

The Impact: Cost-Performant Log Analytics at Scale

Adopting ChaosSearch helped the Blackboard Learn team replace their ELK stack with a more cost-efficient log analytics solution that performs at scale. Blackboard successfully reduced the total cost of ownership (TCO) of its log analytics environment, eliminated management complexity associated with maintaining Elasticsearch clusters, and overcame data retention limits that previously restricted the volume of logs available for analysis in the legacy ELK stack.

Long term, Blackboard plans to leverage ChaosSearch as a single pane of glass that sits on top of all application monitoring and observability tools used across the organization.

“We're moving into an innovation phase that's hopefully going to propel us,” said Snook. “And having a platform like ChaosSearch allows us to do that faster.”

 

Activating S3 for Analytics

With ChaosSearch, Blackboard is now using Amazon S3 storage in a much more efficient manner (previously used as a backup/archive) by utilizing it as a functional data lake that delivers cost-performant log analytics at scale. Now, Blackboard can query and analyze significantly more log data with long-term, flexible data retention capabilities than they could before by leveraging the scalability of S3 and ChaosSearch.

“With ChaosSearch, we're storing data once, and we can index all of it and search all of it, whereas before we were limited,” explained Snook. “From our perspective, that's been a big selling point internally because you're going to get more for potentially the same amount of cost.”

Blackboard now indexes log files directly in S3 with Chaos Index® — a novel data representation that delivers both small data size and fast time-to-insights. By leveraging the data compression capabilities of Chaos Index and cost-optimized S3 storage, Blackboard can store, index, search and analyze unlimited volumes of log data.

 

A Future-Ready Platform That’s Agile & Cost-Effective

Blackboard is saving — in terms of both hosting resources and people costs — and staying agile thanks to the on-demand, cloud-native nature of ChaosSearch.

With Elasticsearch, storage and compute resources are coupled and instances are charged per compute resource time. If Blackboard wanted to do more querying or analyze longer timelines, it needed to reconfigure the cluster with additional resources and pay the associated costs. If log ingestion or querying activity spiked, Blackboard had to make trade-offs between service availability, data retention, and cost.

With ChaosSearch, Blackboard’s log analytics costs are based on the daily volume of logs they ingest. There are no extra costs for querying and no clusters to reconfigure when Blackboard’s querying needs change — even from day to day. Now, Blackboard can scale its querying activities on-demand to manage periods of high volume without any manual reconfiguration process, unpleasant trade-offs, or cost increases.

 

Simplicity Without Duplication

In the past, Blackboard pushed logs to both Amazon S3 (for cost-effective long-term storage) and Elasticsearch (for log analytics).

Now, Blackboard satisfies both its long-term data retention and short-term log analytics needs by pushing data directly into Amazon S3, where it can be indexed, queried, and analyzed at scale using ChaosSearch. Data duplication and unnecessary data storage costs have been eliminated.

“The value of ChaosSearch for us has been that we don't have to retain data. We've already retained it. It's just searching through it now,” said Snook. “And that's one of the big value-adds that we've seen from ChaosSearch vs. the traditional ELK stack.”

 

Management Headaches Resolved

Before, some engineers at Blackboard were spending as many as 10 to 15 hours per week on maintaining the performance and availability of Elasticsearch clusters.

ChaosSearch is delivered as a fully managed service with 99.999% uptime on the customer’s cloud storage environment. Site reliability engineers (SREs) have re-allocated that cluster management time toward value-add projects that impact the business, and the team no longer worries about unplanned downtime due to failed Elasticsearch clusters.

“With ChaosSearch, I’ve never called in and said, ‘Hey. Our cluster is red,’ because there's no cluster,” concluded Snook.

We had to dismantle some of our logging just because of the volume. And then when COVID hit, it just became untenable. It was awakening. We realized we really needed to change our strategy.
Joel Snook Senior Director of DevOps Engineering at Blackboard
Blackboard Infographic
We're moving into an innovation phase that's hopefully going to propel us, and having a platform like ChaosSearch allows us to do that faster.
Joel Snook Senior Director of DevOps Engineering at Blackboard

INDUSTRY

Education

LOCATION

Global

USE CASE(S)

  • Day-to-day visibility of cloud computing environments at scale
  • App troubleshooting and trend analysis over long periods of time
  • Root cause analysis without data retention limits or trade-offs
  • Resolution of application performance issues

IMPACT

  • Single pane of glass across all log data
  • Better cost-performance of log analytics at scale
  • No data movement or duplication 
  • SREs previously spending 10-15 hours per week managing ELK get that time back for value-add projects
  • No clusters to manage

SCALE

  • Capacity to manage 3,000%+ volume increase within first year
  • Data retention increased from 30 days to unlimited 

USERS

  • SRE engineers & managers
  • DevOps engineers & managers
  • Support teams      

DATA MANAGEMENT ENVIRONMENT

  • Data Lake Platform: ChaosSearch
  • Cloud Object Storage: Amazon S3
  • Analytic Tool(s): Kibana

ADDITIONAL RESOURCES

  • Case study PDF
  • Press release: ChaosSearch Unveils Industry's First Multi-model, Multi-cloud Data Lake Platform for Cost-effective Analytics & BI at Scale
With ChaosSearch, we're storing data once, and we can index all of it and search all of it, whereas before we were limited. From our perspective, that's been a big selling point internally because you're going to get more for potentially the same amount of cost.
Joel Snook Senior Director of DevOps Engineering at Blackboard
ralston-smith-sEgodrJdMGw-unsplash
The value of ChaosSearch for us has been that we don't have to retain data. We've already retained it. It's just searching through it now. And that's one of the big value-adds that we've seen from ChaosSearch vs. the traditional ELK stack.
Joel Snook Senior Director of DevOps Engineering at Blackboard