In the tsunami of experimentation, investment, and deployment of systems that analyze big data, vendors have seemingly been trying approaches at two extremes—either embracing the Hadoop ecosystem or building increasingly sophisticated query capabilities into database management system (DBMS) engines.
At one end of the spectrum, the scale-out Hadoop distributed file system (HDFS) has become a way to collect volumes and types of data on commodity servers and storage that would otherwise overwhelm traditional enterprise data warehouses (EDWs). The Hadoop ecosystem has a variety of ways to query data in HDFS, with SQL-based approaches emerging in variety and maturity.
At the other end of the spectrum are both traditional and NewSQL DBMS vendors, with IBM, Microsoft, and Oracle among the former and Greenplum, Vertica, Teradata Aster, and many others emerging among the latter. These companies share unprecedented innovation and growth in analytic query sophistication. Accessing tables stored on disks organized in rows via SQL is no longer enough. Vendors have been adding the equivalent of new DBMS engine plug-ins, including in-memory cache for performance, column storage for data compression and faster queries, advanced statistical analysis, and even machine learning technology.
While the NewSQL vendors have introduced much lower price points than the traditional vendors as well as greater flexibility in using commodity storage, they haven’t made quite as much progress on shrinking the growth in storage hardware required relative to the growth in data volumes.
For some use cases, there appears to be room for a third approach that lies between the extremes and borrows from the best of each. RainStor in particular and the databases focusing on column storage more generally have carved out a very sizable set of data storage and analytics scenarios that have been mostly ignored. Much of the data that needs to be analyzed doesn’t need to be updated — it can instead be stored as an archive in a deeply compressed format while still online for query and analysis. Databases with column store technology, such as Vertica and Greenplum, have taken important steps in this direction, and the incumbent vendors are also making progress in offering this as an option.
Organizing data storage in columns makes it easier to compress. Column stores can accelerate queries by scanning just the relevant and now smaller columns in parallel on multiple CPU cores. But the storage and database engine overhead of mediating potentially simultaneous updates to and reads from the data still remains. In other words, the column stores are a better data warehouse. They are not optimized to serve as archives, however. An online archive can compress its data by a factor of 30 to 40 because it will never have to be decompressed for updates. New data only gets appended. Without the need to support updates, it’s much easier to ingest new data at very high speed, and without the need to mediate updates, it’s much easier to distribute the data on clusters of low-cost storage.
This paper is written for two audiences.
- One is the business buyer who is evaluating databases and trying to reconcile the difficulty of growth in data volumes running at 50 percent to 100 percent per annum with an IT budget growing in single digits. Of particular value to this audience are the generic use cases and the customer case studies. Also relevant is the price comparison with Oracle Exadata, which shows not just the capital cost of a traditional data warehouse solution but also the hidden running costs.
- The other audience is the IT infrastructure technologist who is tasked with evaluating the proliferation of database technologies. For this audience, the more technical sections of the paper will be valuable. These sections focus on the different technology approaches to creating online analytic databases. The paper will compare mainstream data warehouse technologies and column stores in particular with a database that focuses more narrowly as an online analytic archive. In order to use a concrete example of an existing analytic archive, the paper will explain how RainStor’s database works.