Committed to supporting learners throughout their lifelong journey, Blackboard Inc. is a world leader in education technology, serving higher education, K-12, business and government clients around the world with learning management systems (LMS) and other digital education products.
The world has changed dramatically recently — particularly since 2020 — and Blackboard continues pushing the boundaries of digital learning by investing in innovations that will propel them, and the education industry at large, forward faster. For the purposes of this case study, we will focus on how the Blackboard Product Development team is driving innovation in their approach to log analytics.
Blackboard’s flagship product, Blackboard Learn, is a web-based LMS that can be installed on local servers, hosted by Blackboard application service provider (ASP) solutions, or provided as software-as-a-service (SaaS). Acting as both an ASP and SaaS, Blackboard uses log analytics to monitor cloud deployments, troubleshoot application issues, maximize uptime, and deliver on data integrity and reporting obligations for customers.
When COVID-19 hit, millions of students switched to online learning models, and Blackboard saw daily log ingest volumes skyrocket overnight. By migrating an overworked ELK stack to ChaosSearch, Blackboard was able to manage a 3,000%+ increase in product usage while growing their business amidst the pandemic.
Before ChaosSearch, Blackboard’s Product, SRE, DevOps and Support teams depended on a combination of custom-managed ELK (Elasticsearch, Logstash, Kibana) stacks and managed Elasticsearch service offerings for centralized log management (CLM). But growing daily log volumes and variable spikes in log volumes were causing pain. Unpredictable spikes would cause the ELK stack to go down, making it unusable at times while management and data storage costs grew.
The Blackboard Learn team in particular had run their own ELK clusters and relied on commercially managed services, but each came with its own set of management challenges. As the product’s customer base and daily log ingest grew, it became unstable and difficult to scale. Blackboard found that even with a third party managing the environment, the same challenges remained as a result of running Elasticsearch under the hood:
Blackboard needed to analyze short-term data for troubleshooting, real-time alerts, customer support, etc., but they also needed access to long-term data for deeper analyses and compliance purposes.
Because Elasticsearch costs balloon at scale and the environment is so complex to manage, Blackboard opted to reduce log data retention in Elasticsearch to 30 days. Longer term log data was shipped to Amazon S3 cloud object storage. As a result, they had to deal with:
“We would typically send logs to two outputs,” explained Joel Snook, Blackboard’s senior director of DevOps Engineering. “We'd send them to S3, and then also Elasticsearch. And Elasticsearch would expire data out because it's expensive to hold it on that tiered data set. We were storing the data twice.”
Blackboard needed a new approach to log analytics that would meet specific criteria:
The team had been contemplating building an in-house solution when Snook discovered ChaosSearch at AWS re:Invent in 2019. He and his colleagues appreciated ChaosSearch’s ability to index and search log data directly in Amazon S3 buckets with no data movement or duplicate storage.
“We had to dismantle some of our logging just because of the volume. And then when COVID hit, it just became untenable,” said Snook. “It was awakening. We realized we really needed to change our strategy.”
Snook explained, “We sought out ChaosSearch for a specific need — a place to aggregate our logs into a way that we could consume them and get information out of them.”
Delivered as a fully managed, highly available service, the ChaosSearch Data Lake Platform promised not to go down, and could take advantage of all the durability, scalability, cost-effectiveness, and governance controls that inherently come with running directly on S3.
The transition to ChaosSearch was quick and easy. The platform acts as a drop-in replacement for ELK, utilizing the same Elastic API and familiar Kibana interface for visualization and dashboarding. In just a short time, Blackboard was able to deploy ChaosSearch on top of its global Amazon S3 data lake and start using the platform for log analytics at unlimited scale. And the platform is GDPR compliant, which is critically important in the data-sensitive education industry.
Blackboard is now using ChaosSearch to support use cases like:
Adopting ChaosSearch helped the Blackboard Learn team replace their ELK stack with a more cost-efficient log analytics solution that performs at scale. Blackboard successfully reduced the total cost of ownership (TCO) of its log analytics environment, eliminated management complexity associated with maintaining Elasticsearch clusters, and overcame data retention limits that previously restricted the volume of logs available for analysis in the legacy ELK stack.
Long term, Blackboard plans to leverage ChaosSearch as a single pane of glass that sits on top of all application monitoring and observability tools used across the organization.
“We're moving into an innovation phase that's hopefully going to propel us,” said Snook. “And having a platform like ChaosSearch allows us to do that faster.”
With ChaosSearch, Blackboard is now using Amazon S3 storage in a much more efficient manner (previously used as a backup/archive) by utilizing it as a functional data lake that delivers cost-performant log analytics at scale. Now, Blackboard can query and analyze significantly more log data with long-term, flexible data retention capabilities than they could before by leveraging the scalability of S3 and ChaosSearch.
“With ChaosSearch, we're storing data once, and we can index all of it and search all of it, whereas before we were limited,” explained Snook. “From our perspective, that's been a big selling point internally because you're going to get more for potentially the same amount of cost.”
Blackboard now indexes log files directly in S3 with Chaos Index® — a novel data representation that delivers both small data size and fast time-to-insights. By leveraging the data compression capabilities of Chaos Index and cost-optimized S3 storage, Blackboard can store, index, search and analyze unlimited volumes of log data.
Blackboard is saving — in terms of both hosting resources and people costs — and staying agile thanks to the on-demand, cloud-native nature of ChaosSearch.
With Elasticsearch, storage and compute resources are coupled and instances are charged per compute resource time. If Blackboard wanted to do more querying or analyze longer timelines, it needed to reconfigure the cluster with additional resources and pay the associated costs. If log ingestion or querying activity spiked, Blackboard had to make trade-offs between service availability, data retention, and cost.
With ChaosSearch, Blackboard’s log analytics costs are based on the daily volume of logs they ingest. There are no extra costs for querying and no clusters to reconfigure when Blackboard’s querying needs change — even from day to day. Now, Blackboard can scale its querying activities on-demand to manage periods of high volume without any manual reconfiguration process, unpleasant trade-offs, or cost increases.
In the past, Blackboard pushed logs to both Amazon S3 (for cost-effective long-term storage) and Elasticsearch (for log analytics).
Now, Blackboard satisfies both its long-term data retention and short-term log analytics needs by pushing data directly into Amazon S3, where it can be indexed, queried, and analyzed at scale using ChaosSearch. Data duplication and unnecessary data storage costs have been eliminated.
“The value of ChaosSearch for us has been that we don't have to retain data. We've already retained it. It's just searching through it now,” said Snook. “And that's one of the big value-adds that we've seen from ChaosSearch vs. the traditional ELK stack.”
Before, some engineers at Blackboard were spending as many as 10 to 15 hours per week on maintaining the performance and availability of Elasticsearch clusters.
ChaosSearch is delivered as a fully managed service with 99.999% uptime on the customer’s cloud storage environment. Site reliability engineers (SREs) have re-allocated that cluster management time toward value-add projects that impact the business, and the team no longer worries about unplanned downtime due to failed Elasticsearch clusters.
“With ChaosSearch, I’ve never called in and said, ‘Hey. Our cluster is red,’ because there's no cluster,” concluded Snook.