Twitter has scaled back its plans to store billions of tweets using Cassandra, a non-relational database project that Facebook created and open sourced. Friday night, Twitter said that it will still use Cassandra in a new real-time analytics project it is building, but the decision to move away from plans to migrate tweets from its current MySQL database to Cassandra is seen by some as a blow to startups and open-source projects that are attempting to move beyond relational databases.
But in reality, the level of interest about what database architecture some popular startup is using goes beyond Twitter and Cassandra, and touches on the changing nature of both the web and the software that underlies it. In short, the story here isn’t about Cassandra or databases themselves, but about groups of pioneering programmers reacting to the new ways they can build software in a world where computing is cheap.
Twitter or Digg deciding to use Cassandra, or LinkedIn using Voldemort as a key-value store, are the equivalent of pioneers traveling along the Oregon Trail as the U.S. expanded. The excitement around NoSQL data-store projects is a modern-day manifest destiny as programmers and companies lay claim to all the power that cheap computing (through the cloud or on their own hardware) has enabled.
The bottleneck is no longer around performance or the cost of computing — it’s about quickly getting the information to thousands, or hundreds of thousands, of nodes trying to act as one computer delivering a service. Google and IBM both have written about the data center as a computer, and Facebook says it thinks of adding hardware at the rack level rather than at the server level. But the current means of storing and accessing data have not made this leap from a single server to a rack — let alone an entire data center.
As programmers attempt this leap, they face several difficulties, which include working with existing software and programming languages and figuring out what problems and bottlenecks the new services built on these monolithic computer platforms will encounter. Plus, the IT world doesn’t all move at once, which means plenty of jobs and workloads will continue with the old way of doing things — that is, relational databases such as Oracle’s offerings and the open source MySQL, which Oracle now has a stake in thanks to its purchase of Sun.
The result is not a steady movement to non-relational databases or other methods of storing data, but a back-and-forth as programmers and businesses figure out what kind of architecture they need and what problems they want to solve. For a closer look at the issue and a bunch of charts detailing how the landscape is currently laid out, analyst Matt Sarrel, has penned a report over at GigaOM Pro (sub. req’d.) on the NoSQL movement called “NoSQL Databases – Providing Extreme Scale and Flexibility.” He writes:
Discussions with NoSQL vendors, project leads and enterprise customers yield estimates that between 15 and 40 percent of all RDBMS implementations would be better suited to non-relational platforms. Gartner and IDC have stated that only 15 percent of business data currently resides in an RDBMS. And according to Damien Katz, CEO and co-founder of CouchIO, developers of CouchDB, “At least one-third of projects using relational databases should be developed using non-relational technology. However, that doesn’t mean that 30 percent of [RDBMS] installations should be ripped out and replaced. It means 30 percent of projects moving forward should consider using NoSQL.
As that quote and Twitter’s own flip-flopping on its decision to use Cassandra illustrate, the efforts to take advantage of the processing power now available aren’t as simple as the westward push by pioneers in the 1800s. However, as a movement, NoSQL adherents are blazing a similar path in both importance and an opportunity for economic gains. Below, I’ve included one of many charts in the report, which offers a lay of the land for NoSQL projects. For more, read the full report.
|Data Store Type||Use Cases||Advantages||Disadvantages||Key Product|
|Key-Value||In-memory cache, web-site analytics, log file analysis||Simple, replication, versioning, locking, transactions, and sorting web-accessible, schema-less, distributed||Simple, small set of data types, limited transaction support||Redis, Scalaris, Tokyo Cabinet|
|Tabular or Columnar||Data mining, analytics||Rapid data aggregation, scalable, versioning, locking, web-accessible, schema-less, distributed||Limited transaction support||Google BigTable, Hbase or HyperTable, Cassandra|
|Document Store||Document management CRM, Business continuity||Stores and retrieves unstructured documents, map/reduce, web- accessible, schema-less, distributed||Limited transaction support||CouchDB, MongoDB, Riak|
|Traditional Relational||Transaction processing, typical corporate workloads||Well documented and supported, mature code, widely implemented in production||Cost, vertical scaling, increased complexity||Oracle, Microsoft SQL Server, MySQL Cluster|