15 Comments

Summary:

The great online shift is creating massive amounts of data – whether it is videos on YouTube or social networking profiles on MySpace. And that data is stored in databases, making them the key component of the new web infrastructure. But managing that information isn’t easy, […]

The great online shift is creating massive amounts of data – whether it is videos on YouTube or social networking profiles on MySpace. And that data is stored in databases, making them the key component of the new web infrastructure. But managing that information isn’t easy, and there are signs that database management will be vastly different in the future.

By Nitin Borwankar

Relational databases are to software what mainframes are to networked hardware:the monolithic beast at the core that needs magic incantations from high priests to run, and consumes unsuspecting junior engineers for breakfast.

We love to hate them, but we can’t do without them. As much as anyone predicts or wishes they would go away, databases just grow more and more indispensable as Internet users create more and more unstructured data on an unprecedented scale, and at an unprecedented rate.

The good news is that database management will be vastly different in the future. In fact, change has already begun; it just isn’t (cliché alert!) “evenly distributed” yet.

The demands of data management in the Internet era have already spawned new ways of handling large volumes of “old-school,” or structured, data. Yahoo (YHOO) long ago created its own user management software, based on bsd-db (Berkeley DB), to help organize “name:value” data pairs, like those created by “user name and password” profiles.

More recently, Google (GOOG) engineers published a paper on “MapReduce,” essentially a parallel computing framework designed to aggregate and process large data sets generated by Web apps and searching.

At OSCON, 2007 Yahoo engineer Doug Cutting ( also the author of Lucene and Nutch ) talked about how the company is using and backing “Hadoop,” an open-source implementation of MapReduce, for the massive data mining of Webserver logs.

Danga Interactive recently released “memcached” an open-source software layer that actually redistributes traditional database caching so that more “data requests” can be fulfilled without ever hitting the database at all.

Amazon S3 and SQS, meanwhile, allow operators to externalize the storage and workload queuing that is typical of massive database systems, two things conventional databases could never have previously done on the Web.

Of course, most unstructured data is created by Internet search. Lucene is the standard bearer for managing such massive data hauls, followed by descendants Nutch and Solr. All three manage to externalize the indexing function necessary for managing unstructured data, something an old relational database cannot do well. This is not your father’s database.

Social networks create reams of data, too. Relational databases in the 80’s and 90’s had trouble managing hierarchical data sets or “nested parts sub assemblies” – a.k.a “tree structures.” This was called the “parts explosion” problem.

But with Web 2.0 we have a new data structure called the social network . These data sets, which are more complex by several orders of magnitude, are prone to what I call the “friends explosion” problem. This arises from attempting to capture and manage the numerous individual data sets/networks — the forest of tree structures, if you will — created by all the “friends” that are intrinsic to social networks. Traditional relational databases are brought to their knees by this problem.

All these are signs for the perceptive; They are signs that as far as database functionality is concerned, the center of gravity is shifting away from monolithic centralized data management to massively parallel distributed data management. The days of Data 1.0 are past. The days of Data 2.0 are dawning, and it promises to be very disruptive for mainstream database architectures on the Web.

Nitin Borwankar is a database guru based in San Francisco Bay Area. You can find his writings on hisblog, TagSchema.

  1. GigaOM The New Data Management, Green Linux & The Flipper « Friday, August 10, 2007

    [...] isn’t easy, and there are signs that database management will be vastly different in the future. Read more [...]

  2. Louisville real estate Saturday, August 11, 2007

    Great post, I especially liked the term, “friends explosion” problem. Thanks for making me smile!

  3. GigaOM The Future Of Software: The Series So Far « Sunday, August 12, 2007

    [...] Data 2.0: How the Web disrupts our relational database world [...]

  4. [...] the center of gravity is shifting away from monolithic centralized data management to massively parallel distributed data management.[...]

    Good article and I like the way you’ve characterized the problem. But it seems like an awful big leap of faith to state that this is the solution. Article doesn’t provide much rationale for this statement. Just cause Google is doing it doesn’t mean its the only solution.

    Don’t you think?

  5. That’s all well and good, but I see these things (MapReduce, Lucene) more as tools which supplement a relational data model, and are used for more specialised search and computation tasks (unstructured text indexing; massively-scalable data mining).

    They are not, and will never be, a replacement for the relational model.

    By the way, the relative inefficiency of current RDBMS offerings at recursive, graph-based problems is not a flaw in the relational model itself – it’s a flaw in SQL and it’s current implementations. Graph-based search actually fits very well into the conceptual framework of the relational model, but I expect to see more work on support for amongst current relational database implementations.

  6. Data 2.0 – web databases of the future « The Wow! Report Tuesday, August 28, 2007

    [...] Read the article here [...]

  7. Data Digga » Data 2.0 – web databases of the future Tuesday, August 28, 2007

    [...] Read the article here Published in: Uncategorized [...]

  8. eliteab » Blog Archive » Data 2.0: How the Web disrupts our relational database world Monday, October 22, 2007

    [...] check the full story here [...]

  9. Amazon SimpleDB 101 & Why It Matters – GigaOM Friday, December 14, 2007

    [...] we’ve already noted, …the center of gravity is shifting away from monolithic centralized data management to [...]

  10. Another article that I’d like to see with examples/scenarios. Do you realize that even with 10 years of IT experience, I don’t understand what you just noted?

Comments have been disabled for this post