Blog Post

Amazon SimpleDB 101 & Why It Matters

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Amazon continues to amaze us with its Amazon Web Services series of offerings. The latest is SimpleDB, which will be available in limited beta in a few weeks. And it is bound to have a major impact on web infrastructure. As Amazon says in its email to existing developers:

This service works in close conjunction with Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2), collectively providing the ability to store, process and query data sets in the cloud.

As we’ve already noted,

…the center of gravity is shifting away from monolithic centralized data management to massively parallel distributed data management.

If you are in the business of managing massive amounts of distributed data, you cannot gloss over the Amazon WS trifecta — data-in-the-cloud is the future and with WS, Amazon is way ahead of the pack.What about the offerings of other vendors? Google, for example, has BigTable, and truth be told, SimpleDB has a distinctly BigTable-ish feel to it. But a side-by-side comparison makes it clear that Amazon WS in general – and SimpleDB in particular — is superior, for the following reasons:

  • Google’s offerings – not only BigTable but GoogleBase, Gdisk, etc. — all have an ad hoc, grab-bag-of-tools feeling to them, devoid of any integrated strategy. Or if there is one, it is well-hidden.
  • Amazon WS clearly involves a well-designed master plan aimed at changing the face of software as a service, each new offering akin to a chess piece in a game focused on creating strategic long-term value. And with SimpleDB, the queen has moved to the center.
  • Amazon WS is based on the YOYODA principle — You Own Your Own Data, Always. Along with Amazon S3, SimpleDB is a sharp arrow in the quiver of open data proponents.
  • Amazon WS includes a built-in, flexible payment system so users are neither forced to offer their app for free nor have an “ad-supported” model forced upon them. Now you can build a data-based web app on SimpleDB and seamlessly charge for it.

Tersely put, SimpleDB is hugely disruptive. It will take some time to evolve the new thinking patterns and new design disciplines that this technology forces us to consider. To do so, consider this breakdown of the similarities and differences between SimpleDB and conventional relational databases.

Very, very simplistically speaking, domains are like tables, with items like rows and attributes like columns. A query cannot cross domains, so in this analogy you can’t “join” domains. But that sort of thinking is a holdover from the relational database normalized model.In reality a domain is much more like a database, so we have to stop thinking in terms of tables and joins.

Say we had an SQL database, with tables for “Company,” “Departments” and “Employees.” In SimpleDB, the items (rows) for all three could all go in one domain (database), with it you can run queries on this domain and using operators like UNION and INTERSECT, you can do the equivalent of joins.Existing web technologies such as Ruby on Rails, Django and Hibernate all have an Object Relational Mapper (ORM), which maps language objects to relational database tables.

If designers of these ORMs want to stay in the scalable apps game, they should take a serious look at using SimpleDB as a data store. Better yet, they should build ORMs from the ground up to integrate with SimpleDB.More than two years ago I wrote that Web 2.0 needs Data 2.0. The combination of EC2, S3 and SimpleDB is a toolkit for assembling massively scalable REST addressable web databases. Data 2.0 is now officially here. May the fun and games begin. [digg=]

Nitin Borwankar is a database guru based in San Francisco Bay Area. You can find his writings on hisblog, TagSchema.

59 Responses to “Amazon SimpleDB 101 & Why It Matters”

  1. Gert Schmeltz Pedersen

    Database design, a perspective

    Hi SimpleDB users,

    I am an oldtimer in databases, who happened to come across the SimpleDB pages, and so out of curiosity started to read about it. It looked good … at first, then I realized some implications of the simplicity, then I looked at some of the threads here and elsewhere dealing with the relationship to relational databases and how to solve more complicated database problems. Then I realized how history repeats itself and how 38 years of accumulated wisdom of database modelling and design have been disregarded or overlooked or misunderstood or neglected.

    In the summer of 1975 I read the first edition of Chris Date’s text book on database technology, then I studied E.F. Codd’s papers on the relational model, starting in June 1970, for which he was given the Turing award in 1979. These two guys are behind the tremendous success of relational database technology. Date’s book in many editions, together with other text books on database technology and various forms of The Entity-Relationship Model, originally created by Peter Pin-Shan Chen in 1976, has educated countless computer science students in database design. All computer science students, all developers, all programmers should get familiar with relational database design, it is simple, it is powerful, it can be done in a one semester course, and then you can disregard it with open eyes. It is a scandal, if you were not taught relational database design alongside basic programming skills.

    What is it that you do with SimpleDB? You put your application logic into procedural code, where you query each domain and combine the resulting items by coding loops and comparisons and what have you; this is how you implement the equivalent of joins in SQL queries; this is the old procedural versus non-procedural debate, where your procedural version is much, much costlier to maintain, and much, much harder for others to understand, you bury the application logic in tons of hopeless code. If you want to avoid joins, you probably put everything into one domain, or as few domains as possible; if you know a bit of normalization of database design, you know about the anomalies that you have introduced, costly. The claimed advantage of SimpleDB that you may have more than one value in the same field is one aspect of an unnormalized database, therefore harmful. And by the way, it is not true that union and intersect can do joins for you.

    Once we had semantic nets and logic databases, they included the attribute-value pairs of SimpleDB, and did not require schemas or predefined fields, and in addition they provided powerful non-procedural queries. But where are they today? Maybe in student projects, maybe in research projects, but not in serious, important applications.

    If I were paying your salary, I would never allow you to implement my important applications in SimpleDB or the like. You should instead use MySQL or another free, powerful, yet simple to use, RDBMS. Use it with EC2 and S3, that is fine.

    You have other options, though, XML databases in the first place, where you also have non-procedural queries available. As a database developer, you should be able to judge, when and how to use XML databases for a given purpose. If your application needs full-text indexing or integrated storage of all types of files and documents, then take a look at things like the object-based, web service-based Fedora repository system, it has RDF triple storage with a non-procedural query language also.

    In conclusion, use SimpleDB only if your application’s database needs amount to one normalized relational table, else invest your time in today’s powerful technologies.

    PS! What if SimpleDB implemented joins, that is, “field1” = “field2”? That would be a great step forward. Then you would need indexing behind the scene, so that performance of joins could be as good as in RDBMS.

    • That was a nice comment exactly expressing my thoughts and feelings. It feels like relational databases 30 years ago and the whole thing like reinventing the wheel.

      • It’s reinventing the wheel (or better, going back to pulling sledges) because it turns out there’s a scale where the wheel stops scaling. And the pain of scaling is then harder than the pain of not having the relational stuff. You give up one for the other.

        Most sites will never hit that painpoint in scaling, but also, plenty of sites do. And many sites don’t need the relational stuff. So it’s a trade-off.

  2. Nitin

    First of all thanks for the great writeup.
    Looking at this thread i think that the announcement around SimpleDB provoked an interesting discussion related to the role of data bases in web 2.0 archictecture. While i think that this discussion is very interesting and very relevant i also think that it is important to emphasize that SimpleDB is not yet another data base and shouldn’t be measured as such.

    I wrote a summary (Amazon SimpleDB is not a database!) that aims to clarify my point on this matter – I’ll appreciate your comments on this regard.

  3. Nitin Borwankar

    Hello all,

    Sorry but GigaOm comments section is probably not the best place for ongoing database design discussions.

    We had to have a balance between informing non-technical users and technical users.
    I aim to have an ongoing series of posts on my blog, tagschema about using SimpleDB vs RDBMS – where it
    makes sense and where it does not.
    This will be over the next few weeks as SimpleDB is made available.
    Stay tuned, go subscribe …. etc.

  4. Nitin Borwankar

    Hi Ericson Smith,

    While SimpleDB looks like a memcached service logically note the following comment in the Query section of the developer API docs

    “Because Amazon SimpleDB makes multiple copies of your data and uses an eventual consistency update model, an immediate GetAttributes or Query request (read) immediately after a DeleteAttributes or PutAttributes request (write) might not return the updated data.”

    In other words there a delay before your changes are propagated and made available.
    Not exact;y what you want from a cache.

    My suggestion for application architecture would be to front SimpleDb with memcached. That way recent updates are served from memcached rather than being lost in the “SimpleDB ether” for a while.

  5. If you wanted to get all songs written by any musician who had been with the Beatles, it looks like you’d have to either store multiple-values for artists in the song items [like “(artist, Paul McCartney) (artist, Beatles)”] or you’d have to do two passes manually — one to get the artists associated with Beatles and then creating the predicates for the query.

  6. hankunloaded asks, “If I have a database with bands, artists, and songs, and I want to see all of the songs written by a member of the beatles, I am not sure how simpleDB gets you there in one domain.”

    It looks like you’d have multiple traditional tables in one domain with one of the attributes for the items (say “type”) specifying the “table name.” Then you’d run a query like:

    [“type” = “song”] intersection [“artist” = “Beatles”]

    The query would return all the item names for the Beatles songs.

  7. hankunloaded


    This is in the API spec under Query:

    The Query operation returns a list of ItemNames that match the query expression. The maximum number that can be returned by one query is determined by MaxNumberOfItems which can be set to number between 1 and 250, inclusive. The default value for MaxNumberOfItems is 100. If more than MaxNumberOfItems items match the query expression, a NextToken is also returned. Submitting the query again with the NextToken will return the next set of items. To obtain all items matching the query expression, repeat until no NextToken is returned.

  8. Nitin,

    The limit of 250 items is on “Query” webservice method . See here>

    .I have the same concerns about JOIN as Hankunloaded has. The only way around I see is to put on every song not only the name of the band but also all the artists as well. So no JOIN would be required. However,this would lead to massive data duplication issues, which is unacceptable in DB relational model, but not sure if its a problem in SimpleDB model.

  9. Nitin Borwankar


    I must be missing something – where do you see the limit on grabbing 250 items at a time.
    I see the docs saying

    “Each item can have up to 256 attribute values. Each attribute value can range from 1 to 1,024 bytes.”
    I could be wrong but i don’t see a limit on number of items per result set.

    Did I overlook some piece of the doc?

  10. hankunloaded


    I am unclear how you can use simpleDB to do the equivalent of a join. If I have a database with bands, artists, and songs, and I want to see all of the songs written by a member of the beatles, I am not sure how simpleDB gets you there in one domain. I am clear how a relational DB gets you there, but without some system of pointers, it seems to me you are in trouble. And the problem with rolling your own join in this situation (i.e. grabbing data and then doing the join in memory) is that you can only grab 250 items at a time, so depending on the dataset it appears there will be lots of queries that just, in practical terms, are not possible. This is a great step, and I was with you until you said that you dont need joins or some equivalent. Unless you can explain how, I think that is a major mistake.

  11. Hi Nitin, long time no see :) You really got it backwards though! Google’s public facing products/API might not be as well packaged, but its underlying technology, namly Bigtable (and friends) is much more flexible (in terms of workloads) than Amazon’s Dynamo, which is a glorified DHT. SimpleDB is probably built on Dynamo. It’s good for a bunch of small structured data (only 10GB/domain). This means that horizontal scaling will still be ad-hoc and you’ll have to twist your arms if you want to store tera/petabyte sized tables for large scale logging, crawling, indexing and analytics.

    If we’re talking about disruptive technologies, an open-source, high quality implementation of Bigtable like stack would really changed the landscape, where every ISP can easily offer more competitive services to the current AWS offerings.

  12. Hi Nick,

    Another aspect that is often forgotten is how long it takes to provision racks of servers and to re partition your data for the new sets of servers. What happens when your app is past the bend in the hockey stick curve and it’s going to take at least a week to get your hardware into your CoLo?

    Yes it’s not just good for demo’ing stuff it’s good for a startup to go from zero to the point where it’s funded with xMill$ rather than x100K$ and can hire 2 full time sysadmins. What’s the option till now? Start paying 2000k$ a month for two dedicated servers when you’re just using 5% of the capacity?

    Also remember that there will be a memcached layer in front of all this which has to be managed by sysadmin staff – until of course Amazon offers a memcached service ;-)

  13. Hey Nick,

    Please consider that there is a massive need for scalable lightweight structured data management where the overhead of a relational database is not needed and massive scalability is the big requirement. Massive throughput on a single query is not required but the ability to support hundreds of thousands of concurrent small reads and writes all over your data set is.

    Yes if your app has the need for full control over every aspect of performance then by all means retain full control of the hardware and software. For instance one application area that SimpleDB isn’t cost effective for is data warehousing and business intelligence. High performance data loads and queries require that the disk be physically as close to the CPU as possible. Moreover storing and moving terabytes on S3 becomes prohibitive quickly.

    So, IMHO, there is the proverbial 80/20 split where the 80% of web apps do not need high performance high overhead database software and hardware. Consider all those LAMP stack webapps being built with MySQL (or Postgres or Oracle or ..) where the database is the point of contention. That architecture has problems scaling horizontally. Most of the time it is only managing small chunks of text, per user and per request, with some key-value attributes. The logical data model has large numbers of many-many relationships between users, items, tags, friends etc. – something that is not well suited to implementing in SQL databases.

    But we have seen a whole generation of web apps that have “grown up” assuming that’s the only way to go. It’s not. And control over RAID configurations that then support unnecessary overhead doesn’t help either.

    On the other hand if you are managing billing systems and transactions that require
    vertical scaling, i.e. sophisticated features such as business logic in Java stored procedures, segmenting data physically by month over multiple spindles etc, SimpleDB is the wrong solution. No one suggests this is a one-size-fits-all solution. But it certainly fits a large number of until-no-poorly-clothed web apps.

    The subtext of this post and other posts referred to in the post, is that there is a large unserved market for lightweight data management that absoultely must have massive horizontal scaling, where performance means supporting 10’s of millions of users without periodic outages rather than 10’s of millions of records per second on a single query. Amazon is addressing this market way ahead of the rest of the pack.

    I did gloss over one point consciously for now and that is the pricing – S3 pricing is 10c a Gig and SimpleDB is $1.50 a Gig. There will need to be a noticeable difference in latency for this price difference to be justifiable. We will need to wait and see.

  14. The real advantage is for services that experience variable load. If you tend to operate based on spikes, this architecture is a huge shift. Rather than build for the spikes, you build for the baseline and use these services to be able to scale up and down in real time.

  15. Lon Baker

    True. You can do it yourself cheaper and with better performance given that you have the following:

    1. The expertise.
    2. The time.
    3. The money.

    I use S3 for a fairly large project. It is cheaper and better than building it myself.

    EC2, not so much. People tend to confuse it with having a real rack of dedicated servers. Which is not its target market.

    The cost aspect is only an issue for companies and startups that are missing a crucial aspect of their business – a revenue stream.

  16. Except it doesn’t. The issue with AWS comes down to cost and performance. At the same cost for AWS it will always cost you LESS to get better performance doing it your self. As you grow the disparity becomes larger. AWS is great for a startup that wants to demo something. The minute it starts pulling any real traffic it is time to pull it in house. The costs of any AWS service have never scaled with performance. Quite the opposite the more you use the more expensive it becomes.
    Nitin, I really question your judgement here as a “database” expert. Please explain to me how it will be possible to maximize performance on this service. Since you have no granular control of how the DB app it self is working, nor any of the hardware (and god forbid it is running on RAID5 SATA S3 storage) how exactly are you supposed to make sure that the platform performs? If the answer is federate your data set, you are only making the problem worse and paying Amazon more.