59 Comments

Summary:

SimpleDB is hugely disruptive. Sure, it will take some time to evolve the new thinking patterns and new design disciplines that this technology forces us to consider. To do so, consider this breakdown of the similarities and differences between SimpleDB and conventional relational databases.

Amazon continues to amaze us with its Amazon Web Services series of offerings. The latest is SimpleDB, which will be available in limited beta in a few weeks. And it is bound to have a major impact on web infrastructure. As Amazon says in its email to existing developers:

This service works in close conjunction with Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2), collectively providing the ability to store, process and query data sets in the cloud.

As we’ve already noted,

…the center of gravity is shifting away from monolithic centralized data management to massively parallel distributed data management.

If you are in the business of managing massive amounts of distributed data, you cannot gloss over the Amazon WS trifecta — data-in-the-cloud is the future and with WS, Amazon is way ahead of the pack.What about the offerings of other vendors? Google, for example, has BigTable, and truth be told, SimpleDB has a distinctly BigTable-ish feel to it. But a side-by-side comparison makes it clear that Amazon WS in general – and SimpleDB in particular — is superior, for the following reasons:

  • Google’s offerings – not only BigTable but GoogleBase, Gdisk, etc. — all have an ad hoc, grab-bag-of-tools feeling to them, devoid of any integrated strategy. Or if there is one, it is well-hidden.
  • Amazon WS clearly involves a well-designed master plan aimed at changing the face of software as a service, each new offering akin to a chess piece in a game focused on creating strategic long-term value. And with SimpleDB, the queen has moved to the center.
  • Amazon WS is based on the YOYODA principle — You Own Your Own Data, Always. Along with Amazon S3, SimpleDB is a sharp arrow in the quiver of open data proponents.
  • Amazon WS includes a built-in, flexible payment system so users are neither forced to offer their app for free nor have an “ad-supported” model forced upon them. Now you can build a data-based web app on SimpleDB and seamlessly charge for it.

Tersely put, SimpleDB is hugely disruptive. It will take some time to evolve the new thinking patterns and new design disciplines that this technology forces us to consider. To do so, consider this breakdown of the similarities and differences between SimpleDB and conventional relational databases.

Very, very simplistically speaking, domains are like tables, with items like rows and attributes like columns. A query cannot cross domains, so in this analogy you can’t “join” domains. But that sort of thinking is a holdover from the relational database normalized model.In reality a domain is much more like a database, so we have to stop thinking in terms of tables and joins.

Say we had an SQL database, with tables for “Company,” “Departments” and “Employees.” In SimpleDB, the items (rows) for all three could all go in one domain (database), with it you can run queries on this domain and using operators like UNION and INTERSECT, you can do the equivalent of joins.Existing web technologies such as Ruby on Rails, Django and Hibernate all have an Object Relational Mapper (ORM), which maps language objects to relational database tables.

If designers of these ORMs want to stay in the scalable apps game, they should take a serious look at using SimpleDB as a data store. Better yet, they should build ORMs from the ground up to integrate with SimpleDB.More than two years ago I wrote that Web 2.0 needs Data 2.0. The combination of EC2, S3 and SimpleDB is a toolkit for assembling massively scalable REST addressable web databases. Data 2.0 is now officially here. May the fun and games begin.

Nitin Borwankar is a database guru based in San Francisco Bay Area. You can find his writings on hisblog, TagSchema.

  1. Except it doesn’t. The issue with AWS comes down to cost and performance. At the same cost for AWS it will always cost you LESS to get better performance doing it your self. As you grow the disparity becomes larger. AWS is great for a startup that wants to demo something. The minute it starts pulling any real traffic it is time to pull it in house. The costs of any AWS service have never scaled with performance. Quite the opposite the more you use the more expensive it becomes.
    Nitin, I really question your judgement here as a “database” expert. Please explain to me how it will be possible to maximize performance on this service. Since you have no granular control of how the DB app it self is working, nor any of the hardware (and god forbid it is running on RAID5 SATA S3 storage) how exactly are you supposed to make sure that the platform performs? If the answer is federate your data set, you are only making the problem worse and paying Amazon more.

    Share
  2. True. You can do it yourself cheaper and with better performance given that you have the following:

    1. The expertise.
    2. The time.
    3. The money.

    I use S3 for a fairly large project. It is cheaper and better than building it myself.

    EC2, not so much. People tend to confuse it with having a real rack of dedicated servers. Which is not its target market.

    The cost aspect is only an issue for companies and startups that are missing a crucial aspect of their business – a revenue stream.

    Share
  3. The real advantage is for services that experience variable load. If you tend to operate based on spikes, this architecture is a huge shift. Rather than build for the spikes, you build for the baseline and use these services to be able to scale up and down in real time.

    Share
  4. Hey Nick,

    Please consider that there is a massive need for scalable lightweight structured data management where the overhead of a relational database is not needed and massive scalability is the big requirement. Massive throughput on a single query is not required but the ability to support hundreds of thousands of concurrent small reads and writes all over your data set is.

    Yes if your app has the need for full control over every aspect of performance then by all means retain full control of the hardware and software. For instance one application area that SimpleDB isn’t cost effective for is data warehousing and business intelligence. High performance data loads and queries require that the disk be physically as close to the CPU as possible. Moreover storing and moving terabytes on S3 becomes prohibitive quickly.

    So, IMHO, there is the proverbial 80/20 split where the 80% of web apps do not need high performance high overhead database software and hardware. Consider all those LAMP stack webapps being built with MySQL (or Postgres or Oracle or ..) where the database is the point of contention. That architecture has problems scaling horizontally. Most of the time it is only managing small chunks of text, per user and per request, with some key-value attributes. The logical data model has large numbers of many-many relationships between users, items, tags, friends etc. – something that is not well suited to implementing in SQL databases.

    But we have seen a whole generation of web apps that have “grown up” assuming that’s the only way to go. It’s not. And control over RAID configurations that then support unnecessary overhead doesn’t help either.

    On the other hand if you are managing billing systems and transactions that require
    vertical scaling, i.e. sophisticated features such as business logic in Java stored procedures, segmenting data physically by month over multiple spindles etc, SimpleDB is the wrong solution. No one suggests this is a one-size-fits-all solution. But it certainly fits a large number of until-no-poorly-clothed web apps.

    The subtext of this post and other posts referred to in the post, is that there is a large unserved market for lightweight data management that absoultely must have massive horizontal scaling, where performance means supporting 10’s of millions of users without periodic outages rather than 10’s of millions of records per second on a single query. Amazon is addressing this market way ahead of the rest of the pack.

    I did gloss over one point consciously for now and that is the pricing – S3 pricing is 10c a Gig and SimpleDB is $1.50 a Gig. There will need to be a noticeable difference in latency for this price difference to be justifiable. We will need to wait and see.

    Share
  5. Paragraph breaks!

    Share
  6. Hi Nick,

    Another aspect that is often forgotten is how long it takes to provision racks of servers and to re partition your data for the new sets of servers. What happens when your app is past the bend in the hockey stick curve and it’s going to take at least a week to get your hardware into your CoLo?

    Yes it’s not just good for demo’ing stuff it’s good for a startup to go from zero to the point where it’s funded with xMill$ rather than x100K$ and can hire 2 full time sysadmins. What’s the option till now? Start paying 2000k$ a month for two dedicated servers when you’re just using 5% of the capacity?

    Also remember that there will be a memcached layer in front of all this which has to be managed by sysadmin staff – until of course Amazon offers a memcached service ;-)

    Share
  7. Hi Nitin, long time no see :) You really got it backwards though! Google’s public facing products/API might not be as well packaged, but its underlying technology, namly Bigtable (and friends) is much more flexible (in terms of workloads) than Amazon’s Dynamo, which is a glorified DHT. SimpleDB is probably built on Dynamo. It’s good for a bunch of small structured data (only 10GB/domain). This means that horizontal scaling will still be ad-hoc and you’ll have to twist your arms if you want to store tera/petabyte sized tables for large scale logging, crawling, indexing and analytics.

    If we’re talking about disruptive technologies, an open-source, high quality implementation of Bigtable like stack would really changed the landscape, where every ISP can easily offer more competitive services to the current AWS offerings.

    Share
  8. Nitin,

    I am unclear how you can use simpleDB to do the equivalent of a join. If I have a database with bands, artists, and songs, and I want to see all of the songs written by a member of the beatles, I am not sure how simpleDB gets you there in one domain. I am clear how a relational DB gets you there, but without some system of pointers, it seems to me you are in trouble. And the problem with rolling your own join in this situation (i.e. grabbing data and then doing the join in memory) is that you can only grab 250 items at a time, so depending on the dataset it appears there will be lots of queries that just, in practical terms, are not possible. This is a great step, and I was with you until you said that you dont need joins or some equivalent. Unless you can explain how, I think that is a major mistake.

    Share
  9. [...] Techcrunch thinks you should fire all your DBAs. Nitin is impressed too. [...]

    Share
  10. Let the epic battle begin, there is a lot at stake. Simple DB is but one shot fired, but it is an important step:

    http://www.techcrunch.com/2007/12/14/amazon-takes-on-oracle-and-ibm-with-simple-db-beta/trackback/

    Share

Comments have been disabled for this post