Webscale Databases: Is Open Source Really Necessary?


When it comes to deploying databases — or any infrastructural pieces, really — at web scale, many large sites opt to “go cheap, go custom or go home.” Given their unique needs, this credo makes sense, but I wonder if the companies following it aren’t making more work for themselves than is necessary. Might the resources spent developing open-source projects or building tools from scratch not become extraneous if companies could buy solutions that would work just fine?

Isn’t it plausible that a proprietary vendor –- Oracle, let’s say –- could launch a webscale database or analytics solution that would do the trick for a company like Facebook? If there’s one thing Larry Ellison knows better than relational databases, it’s how to make a buck. Hypothetically speaking, Oracle could offer database and data-analysis solutions that could save a company like Facebook from having to act like a software company itself. It certainly hasn’t hesitated to buy its way into alternative markets in the past.

Another consideration is where web companies draw the line regarding commercial solutions: Is an open-source but subscription-based vendor like Red Hat out of the question? What about any of the emerging startups tackling file systems, memcached and other issues?

I’m not suggesting that Facebook et al are heading down the garden path with their current approaches, or that there’s a glut of proprietary products on the market, only that it’s not inconceivable that commercial vendors could meet the needs of these companies. You can read my full column over at GigaOM Pro (subscription required). What do you think? Are open-source and DIY solutions really the best bet for webscale companies?


Jürgen Messing

Many associate Oracle with RDBMS. And I have seen many business solutions esp. in the telcoms that do not ask which persistence technology meets the application’s requirements best, they do Oracle because they have always done and the customer requires it because they have always been requiring it.
Not long ago Oracle bought good old key-value veteran Berkley DB from SleepyCat and they market it with a dual license: open source and commercial. Yes, we are talking about Oracle. Many if not most of the newer kind of custom DB solutions are based on nothing else than Berkley DB, for example, LinkedIn’s Voldemort backing store.
Writing something like Berkley DB on your own is simply stupid. It will take you years to get it done right. So, use it. Ha, it’s open source. Weaving it into your application is not that difficult or cumbersome, but it will give you great opportunity to improve application performance. Especially social media platforms and high scalable cloud apps will benefit from that approach.
From my personal experience, open source projects that come paired with a commercial license have the best coding quality and documentation. And getting valuable and immediate feedback from developers who are actually using it in real world no-nonsense products is something even the biggest companies cannot leverage.
And if you get huge, buy the commercial license. Even if you don’t need it. Be kind.

Derrick Harris

@ everyone, really

If you haven’t read the full post on GigaOM Pro, I think it adds some important points to my argument (if you want to call it that). I’m not advocating for this model, just suggesting that with a set of hypothetical circumstances (like Oracle buying its way into this market) and the growing numbers of products designed to address webscale needs, it’s not inconceivable that COTS could do the trick.

It’s the same principle that guides cloud computing — spend your resources on your core competency. eBay, for example, is an auction platform and developed backend tools only because it couldn’y buy them. These types of needs are not so unique any more, and IT vendors are trying, at least, to build products that address them.


I don’t see a clear line between building it yourself or tayloring open source software to your needs. In fact I think the approaches are not mutually exclusive, they are complementary.

The options should actually be Build-your-own versus buy packaged or something in between. Many products allow you to extend them to your needs. And if the vendor wants to charge you millions to do that, then you’re not using the right product.


Either this article is a clever click-bait, or a bit naive.

Of course, like with any software development, there are 3 options viz. a) COTS b) DIY c)OSS.

COTS (Commercial, Off-The-Shelf) database software is for companies using computers but primarily in other businesses. Think of Nike or Wall mart. It makes perfect sense that these companies use Oracle or SAP or whatever else to manage their businesses.

Open Source Software (inc. crowd sourcing..) competes with COTS (like Linux and Windows) but more often than not, requires value-added-services to qualify for mass adoption.

DIY software is the route you’d go when you are in a niche / highly specialized domain. Google database software runs only within Google. And even if it were available for use by others, no more than a handful would want (and afford) to use it. Ditto with the innards of Facebook and Twitter.

For a COTS company to make a specialist solution is no mean task. Firstly, it must unlearn the jack-of-all-trades concept and learn agility that a web company fundamentally needs. Next, it needs to build evolving systems that scale exponentially (and not linearly) and this involves thinking fundamentally different. Finally, they will need to accept that a webscale database is more a project than a product – meaning that they will have no more than a grand dozen customers, each making outrageous demands – instead of thousands of companies accepting standard shrink-wrapped software.

One can equally well argue the case (or lack of it) for Open source software for web databases. It is not impossible to do so, but as long as there is business intelligence embedded in here, is it practical to expect a credible open source software for this? Essentially, OSS works best when there is a standard – explicit or defacto. Think of TCP/IP or Linux, Android. But Bigtable ? Fat chance.

If I were running a webco, I wouldn’t ask a Larry to cook my DB. I would probably invest in a research project that can isolate my business intelligence from a “normal” web-scale DB, and then try to get industry traction for an OSS DB that allows me to plug-in my business intelligence. If such a solution is already available (or in the making), I would gladly join in. But Oracle or SAP, I wouldn’t bother looking them up.

Darwin Ling

Cost is the top reason why these companies used open source software to start with. They were all once start-up with shoe string budgets. Oracle’s cost structure simply would not work for them. And as these companies amassed the users and volume, switching to a software stack like Oracle doesn’t help much. The fundamental problems are not about how fast a DB can perform a lookup. It is all about the DB design or how I can partition my DB to support the next 10 million users without some major surgery of the existing infrastructure. Both the LAMP & Oracle stack will face the same challenges. I am not aware of any NoSQL type solutions offered by Oracle. But if they do, with an appropriate pricing structure, Ellison might be able to package this up as a compelling solution to start ups. In my opinion, it will be even more compelling if the package includes an analytics solution.


Had google waited for commercial systems – we would have never had search services that didn’t take forever to return results. Facebook, Yahoo, Google etc. know better when it comes to deploying webscale services. Heck, oracle can’t even compete with Teradata, Netezza or Greenplum data warehouse appliances. Commercial software vendors are way too behind the curves when it comes to webscale architectures and unique challenges it throws. Besides, it is impossible for them to predict and develop systems that would earn big bucks in the future – thus, they stick with already deployed systems and milking clients by providing incremental upgrades.


I don’t see how buying this software could ever make sense, considering that there are only a handful of potential customers who would use it at scale, and they’re all software development houses. Heck, something like Bigtable probably only needs a dozen or so full-time engineers to write and maintain the code–call it $3-5m/year. Just the test lab that Oracle (or whoever) would need to be able to run full-scale tests would cost more than that annually. Plus, they’d really need to be able to replicate Google, Yahoo, Facebook, MS, etc’s internal system’s performance bugaboos. Do they have 10gigE to each machine, or 1 gigE, or Infiniband? Are they sharing uplink bandwidth with other users? How much latency do you see across a 10,000+ machine cluster? What sort of drives are involved? Which OS/kernel is in use? How do you integrate into each company’s in-house monitoring framework?

What happens when an unexpected power outage in the middle of the night exposes a kernel flaw that has left slightly corrupt data on 1,000 drives? Waking up employees with a P0 bug is relatively easy, but what does it cost to get a vendor to respond quickly enough?

If Facebook suddenly launched an internal initiative to change their on-disk storage format within the next 6 months, because it’d save $50m company-wide, could Oracle actually respond quickly enough?

At this scale, building it yourself or hiring a boatload of open-source developers is really the only thing that makes sense. The math is more or less the same for a lot of datacenter infrastructure–see http://netseminar.stanford.edu/seminars/10_25_07.ppt, where Microsoft explains why it’s stupid to buy large-scale Ethernet switches when you can build your own.

Richard Cunningham

Facebook were using NetApp for storage of photos and they moved away to write their own because it was too expensive. Essentially you pay again and again for the same software with something like NetApp or Oracle, on a big scale this is a massive cost, unless the vendor gives it away as a special. Writing their own is actually cheaper at that scale.

It’s small companies that shouldn’t write their own, though often the best solutions are open source, Oracle for example is very complex and requires expensive dedicated staff to run it in house.

Barry Kelly

It would be better for your credibility if you ran your ideas past a competent engineer. I’m guessing you didn’t learn about RDBMSes at law school, and how a distributed database like Cassandra isn’t particularly similar to them.

Oracle sells very expensive database solutions to non-software companies, for them to run in centralized clusters, “optimized” for general-purpose querying. I say “optimized” because the jack of all trades is the master of none; so it has to work very hard to become somewhat decent at anything. Cassandra is a distributed data store that runs in multiple different data centres and handles the consistency issues that arise from that infrastructure. But retrieval works on the basis of a single key rather than generalized SQL queries, which is far simpler and much easier to optimize.

Bill Hurst

You really don’t want an Oracle DB for something like facebook or most other typical web app. And you really don’t need a bunch of formula1 engineers to build a faster and better suited solution based on open source tools.

Grzegorz Daniluk

Oracle DB it is a general solution. Which works very well for many applications. FB is building its own solution for very specific application. However this makes sense only for biggest players.

BMW M series seem to be fast but F1 guys are building their own cars.

Comments are closed.