The great online shift is creating massive amounts of data – whether it is videos on YouTube or social networking profiles on MySpace. And that data is stored in databases, making them the key component of the new web infrastructure. But managing that information isn’t easy, and there are signs that database management will be vastly different in the future.
By Nitin Borwankar
Relational databases are to software what mainframes are to networked hardware:the monolithic beast at the core that needs magic incantations from high priests to run, and consumes unsuspecting junior engineers for breakfast.
We love to hate them, but we can’t do without them. As much as anyone predicts or wishes they would go away, databases just grow more and more indispensable as Internet users create more and more unstructured data on an unprecedented scale, and at an unprecedented rate.
The good news is that database management will be vastly different in the future. In fact, change has already begun; it just isn’t (cliché alert!) “evenly distributed” yet.
The demands of data management in the Internet era have already spawned new ways of handling large volumes of “old-school,” or structured, data. Yahoo (YHOO) long ago created its own user management software, based on bsd-db (Berkeley DB), to help organize “name:value” data pairs, like those created by “user name and password” profiles.
More recently, Google (GOOG) engineers published a paper on “MapReduce,” essentially a parallel computing framework designed to aggregate and process large data sets generated by Web apps and searching.
At OSCON, 2007 Yahoo engineer Doug Cutting ( also the author of Lucene and Nutch ) talked about how the company is using and backing “Hadoop,” an open-source implementation of MapReduce, for the massive data mining of Webserver logs.
Danga Interactive recently released “memcached” an open-source software layer that actually redistributes traditional database caching so that more “data requests” can be fulfilled without ever hitting the database at all.
Amazon S3 and SQS, meanwhile, allow operators to externalize the storage and workload queuing that is typical of massive database systems, two things conventional databases could never have previously done on the Web.
Of course, most unstructured data is created by Internet search. Lucene is the standard bearer for managing such massive data hauls, followed by descendants Nutch and Solr. All three manage to externalize the indexing function necessary for managing unstructured data, something an old relational database cannot do well. This is not your father’s database.
Social networks create reams of data, too. Relational databases in the 80’s and 90’s had trouble managing hierarchical data sets or “nested parts sub assemblies” – a.k.a “tree structures.” This was called the “parts explosion” problem.
But with Web 2.0 we have a new data structure called the social network . These data sets, which are more complex by several orders of magnitude, are prone to what I call the “friends explosion” problem. This arises from attempting to capture and manage the numerous individual data sets/networks — the forest of tree structures, if you will — created by all the “friends” that are intrinsic to social networks. Traditional relational databases are brought to their knees by this problem.
All these are signs for the perceptive; They are signs that as far as database functionality is concerned, the center of gravity is shifting away from monolithic centralized data management to massively parallel distributed data management. The days of Data 1.0 are past. The days of Data 2.0 are dawning, and it promises to be very disruptive for mainstream database architectures on the Web.
Nitin Borwankar is a database guru based in San Francisco Bay Area. You can find his writings on hisblog, TagSchema.