Store & Access Information at Scale: How Drawbacks Lead to Innovation
Ever since there was a need to both store and access information, there has been both physical and logical means to achieve it. Everything from stone tablets to paper, to a prolifera of technology in the digital age. As information became easier to create, databases were built to give it structure to simplify its access, accompanied by characteristics to improve performance and scale. And as data has exploded in volume and importance, analysis of information has been a major driver in storage and access innovation.
In Kevin Petrie's recent article, “Business Intelligence on the Cloud Data Platform: Approaches to Schema,” he outlines the evolution of information processing within analytics where “schema on write” and “schema on read” methods are described. Kevin goes on to not only describe techniques designed to store this information (e.g. data warehouses, data lakes), but also their inherent drawbacks.
In this blog, we’ll explore a different approach to storing and accessing information at scale, deriving benefits of both data warehouses and data lakes. A paradigm where information has the performance and structure of “schema on write,” but the flexibility and scale of “schema on read.” Much of what has been called “state-of-the-art” in software and architecture has been created and modified over decades of development. And a bit of that will be described here. However, at the core of these drawbacks is “information” itself. In other words, how data is actually stored forces so many technical and architectural decision points. How data is accessed, and the cost of the access, all comes into play.
The following are just a few examples of storage mediums:
- Network Drive (Block Storage) - Expensive, scalable-ish and fast
- Hadoop Distributed File System (HDFS) - Expensive, scalable-ish and fast
- Cloud Storage (Object Storage) - Inexpensive, very scalable and slow
There are many other technologies and solutions, but the above is a good categorization of how information is stored today. Most databases from the 1980s till now were designed around block storage (e.g. POSIX compliance) where the HDFS project is partially compliant. And until recently, these systems worked well. However, with the complexity and scale of today’s data, databases have been segregated into specialized siloes. This type of specialization often (if not always) leverages the “schema on write” paradigm, requiring much up front design, build and maintenance. Techniques like this (not at scale) have managed the time, cost, and complexity of information storage. However, when it comes down to big data analysis, it has achieved the opposite effect.
The need to analyze the explosion of information resulted in storage becoming a significant architectural decision/design point. This led to the development of HDFS and popularity of cloud object storage. However, this 1980s database thinking and structure has not kept up. As a result, “schema on read” paradigm became popular as the Hadoop open source project came into existence in the early 2000s. The idea to stream your data (without change) into distributed storage and build out a distributed code-based query engine to analyze it, seemed like a good idea. However, in practice this was unmanageable, helped coin the term “data swamp,” and for the most part ushered out the Hadoop paradigm.
What was left from all this were databases, data warehouses and data lakes (i.e. cloud storage), where data lakes continued to receive the tsunami of data, propped up by cloud providers and ETL-ed or ELT-ed into data warehouses. This has not reduced the time, cost and complexity of data at scale, but is the “state-of-art” when it comes to data analysis. There is a recent variant where federated query engines have entered the scene. Data is still “moved and transformed” into siloed solutions, but these query engines did partially help to cross analyze. This approach is not intended for performant query or to address previously described siloes, but is another tool in the toolshed.
As stated, the problem is not necessarily with the storage per se. It’s actually not related to any particular resource: storage, compute, or network. There is a viewpoint that utilizing the latest physical hardware is the answer; often it's a race to use the most powerful computer to execute code, with lots of memory, as well as fancy processors like GPUs. Chasing resource innovation typically results in increased cost and time, for information does require a physical component. However, logical innovation is what we’ll explore in this blog.
For OLTP systems, databases make use of row-based relational storage and indexing schemes like trees. For OLAP systems, databases commonly leverage column-based relational storage, typically using tree indexing as well. Both relational and indexing forces “schema and structure” where increase of time, cost, and complexity is directly related. And this is just for transactional type systems.
NoSQL, graph, columnar and even text search (i.e. inverted-index) databases all succumb to this design and architecture paradigm (i.e. schema and index on write). Shoot, file-systems use B+ tree designs to store and structure information. And when it comes to relational analytics at scale, such schema and structure require top of the line computation with a deluge of memory. And as the 5 Vs of Big Data seem to get more challenging, there needs to be a better way.
So where does this leave us? Takeaways are that cloud storage is a great storage platform, but is slow and results in data movement and transform (into a database) to do any kind of analysis. And like previously described, these segregated and specialized siloed databases/warehouses result in time, cost and complexity. What is needed is “schema on write, read performance” with “schema on read, write performance”. In other words, a solution that does not require the schema and structuring upfront (simple in), while supporting the performance/standardization of a database engine (value out). An engine virtually supporting any specialized database model without physically re-indexing or ETL / ELT.
And here is where ChaosSearch started. The idea that information is at the heart of the problem stated. We purposely took a pure in situ approach from top to bottom, or better said, from the bottom up. The first step was to rethink the format of information where information is not stored in a database, but has database capabilities. The idea to build in properties such compression, transformation, and querying. The idea that information is not indexed, but self-describing for fast analysis. The idea that if schema and structure is removed, so would entropy associated with such constructs. The idea that schema/model is applied upon request, virtually and instantly. And finally, the idea that the right information representation could transform cloud object storage into a new analytic database, without movement.
Yes. ChaosSearch took a different approach, centralizing everything around a unique data format that, on one hand, looks like a compression algorithm, on the other hand looks like a database index, all the while supporting distributed in-situ concepts. We call this format Chaos Index®, which has enabled us to create a Chaos Fabric® in association with a Chaos Refinery®. Each aspect follows a holistic in situ paradigm, reducing overall database resources: storage, network and compute.
Oh, and how did we address cloud object storage performance? It’s simple: change how we store and access information (see Chaos Index). Actually, cloud object storage (e.g. Amazon S3) is fast. What is not is the type of structure databases use to store and access information. And since the Chaos Index is the same in memory as on disk, there is no impedance mismatch (in situ approach).
Listen to the Podcast: Making Sense of Data Quality Amongst Current Seasonality & Uncertainty
Check out the Whitepaper: The New World of Data Lakes, Data Warehouses and Cloud Data Platforms