18 Comments

Summary:

Twitter has scaled back its plans to store billions of tweets using Cassandra, but the interest in this news and NoSQL data stores in general goes beyond one company’s decision. It touches on the changing nature of the web and the software that underlies it.

Twitter has scaled back its plans to store billions of tweets using Cassandra, a non-relational database project that Facebook created and open sourced. Friday night, Twitter said that it will still use Cassandra in a new real-time analytics project it is building, but the decision to move away from plans to migrate tweets from its current MySQL database to Cassandra is seen by some as a blow to startups and open-source projects that are attempting to move beyond relational databases.

But in reality, the level of interest about what database architecture some popular startup is using goes beyond Twitter and Cassandra, and touches on the changing nature of both the web and the software that underlies it. In short, the story here isn’t about Cassandra or databases themselves, but about groups of pioneering programmers reacting to the new ways they can build software in a world where computing is cheap.

Twitter or Digg deciding to use Cassandra, or LinkedIn using Voldemort as a key-value store, are the equivalent of pioneers traveling along the Oregon Trail as the U.S. expanded. The excitement around NoSQL data-store projects is a modern-day manifest destiny as programmers and companies lay claim to all the power that cheap computing (through the cloud or on their own hardware) has enabled.

The bottleneck is no longer around performance or the cost of computing — it’s about quickly getting the information to thousands, or hundreds of thousands, of nodes trying to act as one computer delivering a service. Google and IBM both have written about the data center as a computer, and Facebook says it thinks of adding hardware at the rack level rather than at the server level. But the current means of storing and accessing data have not made this leap from a single server to a rack — let alone an entire data center.

As programmers attempt this leap, they face several difficulties, which include working with existing software and programming languages and figuring out what problems and bottlenecks the new services built on these monolithic computer platforms will encounter. Plus, the IT world doesn’t all move at once, which means plenty of jobs and workloads will continue with the old way of doing things — that is, relational databases such as Oracle’s offerings and the open source MySQL, which Oracle now has a stake in thanks to its purchase of Sun.

The result is not a steady movement to non-relational databases or other methods of storing data, but a back-and-forth as programmers and businesses figure out what kind of architecture they need and what problems they want to solve. For a closer look at the issue and a bunch of charts detailing how the landscape is currently laid out, analyst Matt Sarrel, has penned a report over at GigaOM Pro (sub. req’d.) on the NoSQL movement called “NoSQL Databases – Providing Extreme Scale and Flexibility.” He writes:

Discussions with NoSQL vendors, project leads and enterprise customers yield estimates that between 15 and 40 percent of all RDBMS implementations would be better suited to non-relational platforms. Gartner and IDC have stated that only 15 percent of business data currently resides in an RDBMS. And according to Damien Katz, CEO and co-founder of CouchIO, developers of CouchDB, “At least one-third of projects using relational databases should be developed using non-relational technology. However, that doesn’t mean that 30 percent of [RDBMS] installations should be ripped out and replaced. It means 30 percent of projects moving forward should consider using NoSQL.

As that quote and Twitter’s own flip-flopping on its decision to use Cassandra illustrate, the efforts to take advantage of the processing power now available aren’t as simple as the westward push by pioneers in the 1800s. However, as a movement, NoSQL adherents are blazing a similar path in both importance and an opportunity for economic gains. Below, I’ve included one of many charts in the report, which offers a lay of the land for NoSQL projects. For more, read the full report.

Data Store Type Use Cases Advantages Disadvantages Key Product
Key-Value In-memory cache, web-site analytics, log file analysis Simple, replication, versioning, locking, transactions, and sorting web-accessible, schema-less, distributed Simple, small set of data types, limited transaction support Redis, Scalaris, Tokyo Cabinet
Tabular or Columnar Data mining, analytics Rapid data aggregation, scalable, versioning, locking, web-accessible, schema-less, distributed Limited transaction support Google BigTable, Hbase or HyperTable, Cassandra
Document Store Document management CRM, Business continuity Stores and retrieves unstructured documents, map/reduce, web- accessible, schema-less, distributed Limited transaction support CouchDB, MongoDB, Riak
Traditional Relational Transaction processing, typical corporate workloads Well documented and supported, mature code, widely implemented in production Cost, vertical scaling, increased complexity Oracle, Microsoft SQL Server, MySQL Cluster
  1. J. Andrew Rogers Monday, July 12, 2010

    The advantage/disadvantage chart does not represent just how limited NoSQL options actually are because of their reliance on hash partitioning. Several of the “advantages” are mutually exclusive in implementation, severely restricting the types of data models that can be implemented in practice. Many described features of NoSQL databases disappear or become less useful once you scale beyond a single machine. Analytics is particularly limited in most NoSQL architectures at scale.

    The real advantage of NoSQL is a developer interface that is better tuned for specific applications than the somewhat crusty SQL stack. You can get these features in a SQL system, it is just very painful to use.

    The good idea of NoSQL is a second-generation interface to data management systems — it was sorely needed. However, this is not nearly as compelling as it sounds on the surface unless it can be mated with a new type of database that has much more capable support for diverse data models and analytics. Big data does not do anyone much good unless it can support rich data analytics, which is currently lacking.

    Share
  2. Here’s the problem with the “NoSQL” craze… it solves a problem that 99% of us will never have. When a traditional RDBMS fails to scale, more often that not it just isn’t be used right. I work on a Web site that attracts tens of millions of users every month, run on a SQL database, and our scale is still more of a fringe case. Most systems never need more.

    NoSQL has a cost in terms of maintenance and data integrity because your app has to enforce logic that a relational database typically would. Most of us never have the scalability issues to worry about that could be mitigated by a non-relational store.

    Most of us are not, and never will be, Facebook, Digg or Twitter. Let’s keep that in perspective.

    Share
  3. The NoSQL family also includes graph databases. In your table the columns could be:
    use cases: transactional information for highly interrelated data, social network analysis, geographic information systems
    advantages: high-level/semantic information representation, easily extensible, flexible schema or schema-less
    disadvantages: less suitable for traditional, tabular business information
    key products: InfoGrid, Neo4j, Sones, HyperGraphDB, InfiniteGraph

    Disclosure: involved with InfoGrid

    Share
    1. J. Andrew Rogers Monday, July 12, 2010

      While graph databases are one of the purest non-relational database types, I can see why they are excluded. NoSQL has traditionally been about inexpensive scalability above all else. The obvious problem is that graph databases have great features and dreadful scalability such that they are only useful for applications with trivial scaling requirements. You see something similar with geospatial databases; a pure geospatial implementation does not scale so there is a proliferation of narrow solutions that scale by dropping most of the capabilities that matter.

      It is all the same problem: many data models cannot be usefully implemented as a distributed hash table for well-understood theoretical reasons. Generalized graphs and spatial are two important examples.

      Share
      1. Well, that “dreadful scalability” is good enough for Twitter … http://engineering.twitter.com/2010/05/introducing-flockdb.html … perhaps not so dreadful after all.

        Share
      2. J. Andrew Rogers Monday, July 12, 2010

        Scalable graph traversal (“transitive closures” for the geeks) requires index-less massively parallel joins to approximate. There is no such algorithm in literature but you will need it if you want to scale a general graph database past a billion or so edges. Any claim to scalability must solve this problem.

        There is a similar type of algorithm limitation for spatial data models; while it does not seem obvious, these problems are closely related theoretically.

        A billion edges and a billion polygons are “dreadful scalability” in that most non-trivial use cases currently exceed that scale limit by 2-3 orders of magnitude.

        Share
  4. Matt Sarrel Monday, July 12, 2010

    J. Andrew – Yes, it’s true that NoSQL solutions are limited and are certainly not for everyone. There are significant differences between the different types of NoSQL solution. Plus, the code base for each project changes very quickly. Definitely, trying to run multiple distributed clusters of key-value tables would be difficult if not impossible. But on the other hand, document stores like CouchDB and MongoDB have built in features to automate things like cluster management that would otherwise have to be done manually.

    So I see a lot of trade-offs between NoSQL projects and also traditional RDBMS. In the report referenced by Stacey, I outline 5 use cases where NoSQL would make sense. And yes, the most appropriate uses are for projects involving enormous scale. I have to agree with Jeff that these products solve problems 99% of us will never have. But for the 1% who have hit a wall (and the rest of us who like to tinker with new tech), there is some value here.

    I do not believe that people should throw out their tried and true RDBMS and jump on the NoSQL bandwagon. We need to pick and choose our projects and like everything else, use NoSQL where appropriate.

    On the graph database question – it is possible to argue in favor of their inclusion or exclusion in this study. I agree with J. Andrew regarding the limited utility of graph databases. This is the reason I omitted them from the report. They are highly focused solutions – even more focused than most of the other NoSQL projects. The other side of the coin is that they are more NoSQL than they are something else. And there’s also the reality that I had to actually stop adding to the report or I would never finish it…

    I believe that we will see a lot of development around the analysis of data in NoSQL systems where appropriate. It’s sort of parallel to the traditional database world. In the beginning it’s about gathering, storing, managing, and retrieving. Soon after we evolve into requiring analytics.

    Thank you, everyone, for reading and contributing!

    Share
  5. NoSQL sounds like a good idea but you’re right that it’s gonna take time for developers to start adopting it

    Share
  6. [...] Twitter has scaled back its plans to store billions of tweets using Cassandra, a non-relational database project that Facebook created and open sourced. Friday night, Twitter said that it will still use Cassandra in a new real-time analytics project it is building, but the decision to move away from plans to migrate tweets from its current MySQL database to Cassandra is seen by some as a blow to startups and open-source projects that are attempti … Read More [...]

    Share
  7. “[...] just how limited NoSQL options actually are because of their reliance on hash partitioning.”
    It turns out that several technologies listed in the comparison support multiple partitioning schemes. MongoDB supports both range based and key based partitioning. I believe Cassandra supports range based partitioning. And of course relational databases support many forms of partitioning, where n=1.

    Share
  8. There is another benefit to NoSQL other than high scale, heavy concurrency and fast performance – extensible schema.
    While many sites and systems may not see a need to scale, there may be many systems that would benefit from a document store that does not have the rigid schema that a relational database imposes. In some situations that is a great feature, and in other situations this is to be avoided – but it’s a new offering and a new capability that programmers should be able to use when appropriate and there are no alternatives within the RDBMS world.

    Share
  9. YeSQL
    I think that the main reason so many people have come to see SQL as the source of all evil is the fact that, traditionally, the query language was burned into the database implementation. So by saying NoSQL you basically say “No” to the traditional non-scalable RDBMS implementations.

    SQL is actually a fairly good query language and will continue to serve a major role in the post only-SQL world. However, the concept of one size fits all doesn’t hold up.

    NoSQL implementations such as Hive/HBase as well as JPA/BigTable can be a good example of how next-generation databases can support both linear scaling and a SQL API.
    The key is the decoupling of the query semantics from the underlying data-store.

    We’ve already seen a similar trend with dynamic languages. In the past, a language had to come with a full stack of tools, compiler, libraries, and development tools behind it, making the selection of a particular language quite strategic. Today, a JVM in Java or a CLR in .Net provides a common substrate that can support a large variety of dynamic languages on top of the same JVM runtime. Good examples are Groovy and Java or Jruby.

    I’ve wrote a more detailed post on how the two SQL/NoSQL worlds can live happily together on my recent post:

    YeSQL: An Overview of the Various Query Semantics in the Post Only-SQL World

    Nati S.
    GigaSpaces

    Share
  10. [...] want to use Cassandra for and how he plans to make services business-scale. I even asked him about Twitter’s decision not to store tweets in [...]

    Share

Comments have been disabled for this post