The Data Lake Awakening

November 15, 2017

There is a great disturbance in the Force. I have felt it...

Ok, this is a bit tongue in cheek, but during the Big Data Strata Conference in New York this past September, the energy around data lakes was electric. The news coming from the event was the strongest indicator that the future is NOW. Though analysts have been slow to cover this paradigm shift in data management, the tremors are echoing and accelerating. The Data Warehouse anti-pattern — of building data lake solutions as the core engine to business insights — in my opinion, will foster the next generation of dynamic and dominant companies.

I’m a big fan and advocate for data lakes. In some ways, I have been promoting and selling the data lake idea for years, hoping someone would take notice. About three years ago, I began writing a series of articles around the topic, starting with Big Data Doctrine: Warehouse vs. Data Lakes and recently wrote a blog on Data Lakes Reimagined. When colleagues would ask what the future might hold or what I plan to do next, the conversations would quickly jump to data analytics, data lake philosophy of ‘schema on read’, and the huge potential it would unleash — the Next Big Data Revolution.

Now, I am sure you are asking, how is this different than the big data movement we’ve been hearing so much about since 2000? The answer is simple: the first Big Data revolution was based on scaling database capacity. In other words, taking the new information age data and “shoving” it into structured and semi-structured databases for analysis. And when a database was at capacity, doing it again.

The issue with the first revolution was that not all company data could be stored; and if it was, this shoving was time-consuming and costly. In the world of data analytics, shoving refers to the process of transforming data to fit into database structure (schema on write). However, it is often the case where transformation causes loss of data. Each transformation moves further from the raw truth, ultimately resulting in analytical errors. And with today’s ever more diverse data, this problem has gotten worst. Therefore, the average company has been left out of the big data analytics race and has made the data engineer one of the most sought after and valuable assets within an organization.

The answer to this data problem and the path to the next big data revolution is to remove the time, cost, and complexity of yesterday's solutions, allowing everyone within an organization to perform their own data analytics. The premise is actually quite simple, with the advent of cheap elastic cloud object storage, companies can save every bit of data it generates. Storing large amounts of diverse data is now simple and fast, and the first step in building a data lake architecture. The next step, really the only important step, is to perform the actual analysis of the raw data. But there is a catch, raw data is not ready for self-service analytics...

So what was the awakening last month?? From the looks of it, the idea of storing everything and post process with a ‘schema on read’ design, does not look any different than upfront ‘schema on write’ preprocessing. And the answer is, you are correct.

There is a great disturbance in the Force. I have felt it...

There is no difference conceptually. However, the magic is in the details. The value of storing everything is knowing that you have not thrown anything away. Typically ‘schema on write’ is a design that concludes what the future questions are and the data that is needed. Irrelevant data is purposely thrown away. But what we have learned over the decades is that all data is relevant and new business demands and insights require it, particularly over time. The ability to go back to the raw data source ensures future ‘unknown’ questions will have future ‘valuable’ answers.

However, the awakening is not in the data lake idea per se, though the recent press coverage will definitely lead to an awareness awakening, but the idea that the “individual” is in control of their analytical data destiny. In other words, the trend around analytic solutions focusing on the business analysis and data scientist and “less” on data engineering tooling. The premise that a successful self-service product begins with simplification and automation, where the complexity of building and maintaining a data lake solution, such as Hadoop and any variant that of, is removed.

In my next blog, I will talk about how self-service data analytics cannot truly be achieved when using yesterday's database technology under the data lake hood.