Databricks Data Lakehouse vs. a Data Warehouse: What’s the Difference? Read Our Latest Blog...
Databricks Data Lakehouse vs. a Data Warehouse: What’s the Difference? Read Our Latest Blog...
Start Free Trial

ChaosSearch Blog

5 MIN READ

Data is Cheap, Information is Expensive – Part 1

To say the volume of data is exploding is an understatement. It’s estimated that data will grow from 33 zettabytes in 2018 to 175 zettabytes by 2025, according to the most recent IDC and Seagate report. The reasoning is simple: the ability to generate data via cheap compute and save it via cheap storage is directly proportional. To say that “data is the new oil”, and when refined, fuels today’s information economy, is not only a trend but a truism. Any business not partaking in this revolution, let alone fully immersing themselves, is at a severe disadvantage.

The Issue

With the advent of the internet, cloud, and all things connected, data has become the lifeblood of companies' external communication, internal operations, and overall business value. These veins of data stream throughout a company. Each intersection can change and/or add to this flow of data. And to keep everything running smoothly, this data is stored and analyzed to promote the good health of the business. A portion of this data is also stored as it has a direct relation to the value a business produces.

However, it is not as easy as just wanting to utilize this growth in data. The issue is not in the ability to create mountains of data, nor the ability to stockpile it. The problem is in transforming raw data into valuable information. Data becomes information when questions can be asked of it and decisions and/or insights can be derived from it. And here lies the dilemma: the more there is, the harder and more expensive it is to refine.

The Reason

But why is this? What makes it so expensive? And here again, the reasoning is simple. It is far cheaper to generate and store data then it is to transform it into accessible information. This refining of data involves much more compute than it takes to generate. As data increases in volume, variety, and velocity (3Vs of Big Data) so does the amount of processing it requires. And if the refining process involves more storage than the original data source, the need for additional compute always seems to pop up when you least expect it.

However, there is some salvation with respect to how much additional compute is required to derive information from data. Instead of using brute force parsing through each aspect of this raw data, computer science algorithms/structures have been utilized to implement advanced database solutions. Each solution has different benefits and requirements, but in the end, all pretty much do the same thing: store data in a representation such that intelligent access can be performed more efficiently than manually analyzing the raw source.

Yet, the compute associated with databases can still be intense — though not as much as a brute force method. These solutions can be seen as the combination of compute and storage where data is moved into these systems to be algorithmically refined for intelligent/optimized access. And depending on the type of information needed, specific types of database solutions are used. For instance, hunting (i.e. search) for a needle in a haystack, Text Search databases are utilized. And if there is a need for correlating (e.g. joins) data relationships, Relational databases are employed. And yet, these are just a few of many use case “specific” solutions. Often there is a need for several of these databases within a company, where a variety of solutions are used in concert, compounding the need for additional compute .

First impressions, one would think today’s technology and associated databases would seem to address the cost of information translation. And for decades they did. But as the growth in the 3Vs, these solutions are teetering. Now, there have been introductions of new styles of databases to elevate the cost, but the philosophy of refining data into information has not changed, as well as the underlying science. And if it's not obvious yet, the amount of compute to generate data will always outpace the capacity to analyze it. In other words, the “cost of a question” will always go up as data grows. To truly wrangle the cost of information, innovation is needed.

The Answer

In part 2 of “Data is Cheap, Information is Expensive”, I will endeavor to describe an alternative to today’s database technology and associated solutions. A new outlook in how data should be stored and accessed. A philosophy that accessing information should be as simple as storing data without breaking the bank. A viewpoint that cost has a direct relationship to its life cycle and the science behind it. What I will be describing is CHAOSSEARCH and the patent pending technology and architecture it employs to make information inexpensive too.

About the Author, Thomas Hazel

Thomas Hazel is Founder, CTO, and Chief Scientist of ChaosSearch. He is a serial entrepreneur at the forefront of communication, virtualization, and database technology and the inventor of ChaosSearch's patented IP. Thomas has also patented several other technologies in the areas of distributed algorithms, virtualization and database science. He holds a Bachelor of Science in Computer Science from University of New Hampshire, Hall of Fame Alumni Inductee, and founded both student & professional chapters of the Association for Computing Machinery (ACM). More posts by Thomas Hazel