Since the data lake concept emerged more than a decade ago, data lakes have been pitched as the solution to many of the woes surrounding traditional data management solutions, like databases and data warehouses. Data lakes, we have been told, are more scalable, better able to accommodate widely varying types of data, cheaper to build and so on.
Much of that is true, at least theoretically. In reality, the way many organizations go about building data lakes leads to a variety of problems that undercut the value their data lakes are supposed to deliver.
That’s one of the key takeaways from a recent ChaosSearch webinar, Turn Your Amazon S3 into a Hot, Searchable Analytic Data Lake, in which Mike Leone, Senior Analyst for Data Platforms, Analytics and AI at ESG Global, and Thomas Hazel, ChaosSearch’s CTO and Founder, discussed the challenges of conventional data lake technologies and strategies.
Those challenges don’t mean that data lakes are a fundamentally bad solution. You can and should use data lakes. But if you want to reap all the rewards that data lakes have to offer, you should rethink the way you build and manage your data lake.
Here’s a recap of data lake challenges according to Mike and Thomas, along with tips on how to solve them by revamping your data lake management.
What Is a Data Lake?
As you probably know, if you help to manage or analyze data, a data lake is a repository that houses data in its original, raw form.
Data inside data lakes can vary widely in size and structure, and it does not have to be organized in any particular way. A data lake can also accommodate as little or as much data as you need, and it can grow quickly in size as you add more data.
Thanks to these features, data lakes offer key advantages over data warehouses, data marts and traditional databases. They all require data to be structured and organized in particular ways, which in turn makes them less flexible and scalable.
When you have lots of data of differing types to manage or analyze, data lakes make it easy to do so -- at least in theory.
7 Data Lake Challenges
In reality, data lakes often fall short of delivering fully on their theoretical promises, for several reasons.
1. High Cost
Data lakes can be very expensive to implement and maintain. Although some data lake platforms, like Hadoop, are open source and free of cost if you build and manage them yourself, doing so often takes months and requires expert (read: expensive) staff.
Managed data lake platforms like those that run in the cloud may be easier to deploy, but they are still difficult to manage -- and they come with steep fees.
2. Management Difficulty
Even for skilled engineers, data lakes are hard to manage. Ensuring that your host infrastructure has the capacity for the data lake to keep growing, dealing with redundant data, securing all of the data and so on are all complex tasks, whether you’re using a vanilla open source data lake platform or a managed service.
3. Long Time-to-Value
Even after you spend months setting up your data lake, it will often be years before it grows large enough and becomes sufficiently well integrated with your data analytics tools and workflows to deliver real value.
4. Immature Data Security and Governance
Conventional data lake platforms are good at storing data, but not so good at securing it and enforcing data governance rules. You’ll need to graft on security and governance. That translates to even more time, money and management headache.
5. Lack of Skills
Engineers with real expertise in setting up and managing data lakes aren’t exactly a dime a dozen. In fact, there is an ongoing skills shortage for both data scientists and data engineers. Things may get better over time as more people specialize in these fields, but don’t expect rapid change on this front anytime soon.
6. Data is Outpacing Moore’s Law
Data is growing faster than compute power. This means that data lakes are getting bigger and bigger, but the computers required to host and manage them aren’t getting more powerful at the same rate.
The result: Without an efficient way to manage data within data lakes, businesses will end up paying more and more for the compute resources they need to handle their data.
7. Open Source is Not Going to Save Us
For a while, open source data platforms like Hadoop, Spark and the ELK stack were imagined as solutions to the data lake and data management challenges businesses face today. By providing open source solutions for building data lakes, they were going to “democratize data” and allow every business -- not just the hyperscalers -- to build massive data lakes.
But these platforms have failed to solve most of the core problems surrounding data lakes: They are difficult and costly to manage. There is still a serious skills gap surrounding them, they require massive (and hence expensive) computing power.
Building a Better Data Lake
All of the above adds up to a world in which businesses are spending more and more to build and manage their data lakes, while receiving less and less value from them. Data lakes can do a good job of storing data at scale in a flexible way, but that on its own is not enough to turn data into value -- far from it.
So, that’s the problem. What’s the solution?
It involves modernizing the way businesses build and manage data lakes.
It requires taking full advantage of the cloud, as opposed to trying to build unwieldy data lakes on your own infrastructure.
It entails removing data silos and building data lakes that cater to any and all use cases, rather than designing them to fit a certain range of needs only to find out later that your needs have changed.
For more on these solutions to data lake challenges, and how ChaosSearch is rethinking data lake management, read “The ChaosSearch Approach to Data Analytics Optimization,” which recaps the rest of Mike and Thomas’s webinar on data lake challenges and solutions.
Read the Series
Part 1A: Data Lake Challenges: Or, Why Your Data Lake Isn’t Working Out