Blog Post

Data 2.0: How the Web disrupts our relational database world

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

The great online shift is creating massive amounts of data – whether it is videos on YouTube or social networking profiles on MySpace. And that data is stored in databases, making them the key component of the new web infrastructure. But managing that information isn’t easy, and there are signs that database management will be vastly different in the future.

By Nitin Borwankar

Relational databases are to software what mainframes are to networked hardware:the monolithic beast at the core that needs magic incantations from high priests to run, and consumes unsuspecting junior engineers for breakfast.

We love to hate them, but we can’t do without them. As much as anyone predicts or wishes they would go away, databases just grow more and more indispensable as Internet users create more and more unstructured data on an unprecedented scale, and at an unprecedented rate.

The good news is that database management will be vastly different in the future. In fact, change has already begun; it just isn’t (cliché alert!) “evenly distributed” yet.

The demands of data management in the Internet era have already spawned new ways of handling large volumes of “old-school,” or structured, data. Yahoo (YHOO) long ago created its own user management software, based on bsd-db (Berkeley DB), to help organize “name:value” data pairs, like those created by “user name and password” profiles.

More recently, Google (GOOG) engineers published a paper on “MapReduce,” essentially a parallel computing framework designed to aggregate and process large data sets generated by Web apps and searching.

At OSCON, 2007 Yahoo engineer Doug Cutting ( also the author of Lucene and Nutch ) talked about how the company is using and backing “Hadoop,” an open-source implementation of MapReduce, for the massive data mining of Webserver logs.

Danga Interactive recently released “memcached” an open-source software layer that actually redistributes traditional database caching so that more “data requests” can be fulfilled without ever hitting the database at all.

Amazon S3 and SQS, meanwhile, allow operators to externalize the storage and workload queuing that is typical of massive database systems, two things conventional databases could never have previously done on the Web.

Of course, most unstructured data is created by Internet search. Lucene is the standard bearer for managing such massive data hauls, followed by descendants Nutch and Solr. All three manage to externalize the indexing function necessary for managing unstructured data, something an old relational database cannot do well. This is not your father’s database.

Social networks create reams of data, too. Relational databases in the 80’s and 90’s had trouble managing hierarchical data sets or “nested parts sub assemblies” – a.k.a “tree structures.” This was called the “parts explosion” problem.

But with Web 2.0 we have a new data structure called the social network . These data sets, which are more complex by several orders of magnitude, are prone to what I call the “friends explosion” problem. This arises from attempting to capture and manage the numerous individual data sets/networks — the forest of tree structures, if you will — created by all the “friends” that are intrinsic to social networks. Traditional relational databases are brought to their knees by this problem.

All these are signs for the perceptive; They are signs that as far as database functionality is concerned, the center of gravity is shifting away from monolithic centralized data management to massively parallel distributed data management. The days of Data 1.0 are past. The days of Data 2.0 are dawning, and it promises to be very disruptive for mainstream database architectures on the Web.

Nitin Borwankar is a database guru based in San Francisco Bay Area. You can find his writings on hisblog, TagSchema.

15 Responses to “Data 2.0: How the Web disrupts our relational database world”

  1. You might be interested in a software product that turns this relationship upside down. A recent presentation showed a Web 2.0 software platform that used social bookmarks to tag structured and unstructured data using a very unique knowledge capture process. You and your readers might be interested

  2. I believe that data bases are not going to go away anytime soon however their role as a solution for everything is going to change quite drastically due to some of the limitations mentioned in this article . You can read more about it in one of my earlier posts:

    Putting the Database Where It Belongs

    There are number of alternative technologies mentioned in this article as one that would fill-in the gap such as Lucense (Search Engine), Memcache (In memory caching), and Map Reduce (parallel processing). The question is which one to use and when and if they can be integrated together in our architecture.

    One combination is integration between luncene and distributed caching technologies. This combination enables the scaling-out of lucene. Compass is an opensource project that does just that by integrating lucene and GigaSpaces In Memory Data Grid – see more details here

    Another approach is to put an in-memory cloud acting as the front-end to the data base while enabling transparent and reliable synchronistion of the data in the clound to the one that leaves in the data base. This has lots of benefits since it fit very nicely with distributed data model – You can read more about it here: Persistency As A Service

    Nati S.

  3. That’s all well and good, but I see these things (MapReduce, Lucene) more as tools which supplement a relational data model, and are used for more specialised search and computation tasks (unstructured text indexing; massively-scalable data mining).

    They are not, and will never be, a replacement for the relational model.

    By the way, the relative inefficiency of current RDBMS offerings at recursive, graph-based problems is not a flaw in the relational model itself – it’s a flaw in SQL and it’s current implementations. Graph-based search actually fits very well into the conceptual framework of the relational model, but I expect to see more work on support for amongst current relational database implementations.

  4. Bill Riski

    […] the center of gravity is shifting away from monolithic centralized data management to massively parallel distributed data management.[…]

    Good article and I like the way you’ve characterized the problem. But it seems like an awful big leap of faith to state that this is the solution. Article doesn’t provide much rationale for this statement. Just cause Google is doing it doesn’t mean its the only solution.

    Don’t you think?