14 Comments

Summary:

[qi:gigaom_icon_cloud-computing] Collectively, Yahoo, Facebook, Amazon and Google are rewriting the handbook for big data. Startups intending to reach these proportions must also change their thinking about data, and enterprises need this model for internal deployments as a way to retain an economic edge.The four leading web […]

[qi:gigaom_icon_cloud-computing] Collectively, Yahoo, Facebook, Amazon and Google are rewriting the handbook for big data. Startups intending to reach these proportions must also change their thinking about data, and enterprises need this model for internal deployments as a way to retain an economic edge.The four leading web giants have designed systems from scratch, evidence that workloads have altered, business models are different, and economies have changed — all demanding a new approach.

Yahoo revealed a few weeks ago how it approaches unstructured data on an Internet scale with MObStor, the technology that “grew out of Yahoo Photos” but now serves the unstructured storage needs across the company. Earlier this year, Facebook unveiled Haystack, its solution to managing its growing photo collection (which could reach 100 billion photos in 2009 if it continues with current growth rates). In 2007, Amazon outlined Dynamo, an “incrementally scalable, highly available key-value storage system.” All of these were predated by The Google File System, presented as a research paper in October 2003.

While none of these systems are exactly alike, together they represent a complete change from traditional file systems and data stores. The Google GFS authors note that their design “reflects a marked departure from some earlier file system assumptions,” causing them to “re-examine traditional choices and explore radically different design points.” These are not the systems we once knew.

Since MObStor, based on when information was released, is the new kid on the block, let’s take a look at some of its standout characteristics:

  • It’s designed for petabyte-scale content that is site-generated, partner-generated, or user-generated
  • Handles tens of thousands of page views every second
  • Unstructured storage/objects are mostly images, videos, CSS, and JavaScript libraries
  • Reads dominate writes (most data is WORM: write-once read-many)
  • Only a low level of consistency is required
  • It is designed to scale quickly and efficiently

These capabilities ensure that Yahoo can maintain its ability to store and monetize content effectively, and they are a far cry from solutions developed just 5-10 years ago. The scale, load, file types, read/write pattern, and consistency requirements represent another world compared with conventional enterprise solutions.

Perhaps as part of a migration effort, Yahoo’s MObStor incorporates existing storage systems, like NAS filers. This makes sense for Yahoo, which over the years has been one of NetApp’s largest customers. Facebook has jettisoned any attachment to storage devices other than commodity servers with internal drives, at least in Yahoo’s description of Haystack and the Facebook engineering blog post. And Amazon and Google appear to have made this all-commodity move long ago.

The telling shift is the overwhelming focus on smart software on inexpensive servers. This is not how storage industry giants like EMC, IBM, HDS and NetApp were born. But if the advance of Internet computing continues, the Goliath web properties will provide a crystal ball to how we will more broadly handle unstructured data on an Internet scale. Startups reliant on big data for their business have little choice but to innovate as well, finding ways to accelerate time to market and maintain outstanding service. Enterprises handling big data will need to modify their approach, too, otherwise they leave the door open to competitors that will take advantage of these cloud infrastructure economics.

  1. [...] Post By Google News Click Here For The Entire Article Review Google Cash [...]

    Share
  2. Great article! This applies to our business at http://www.binfire.com perfectly. We already have passed a few tetra bytes of storage and our storage needs are growing rapidly. We have decided to look into Amazon EC2 and S3 for future expansion.

    Share
  3. I was quite impressed when I took a look at Amazon’s Web Services (which implement’s Google’s Map Reduce for some of its operations) and Haystack. The companies above also do a great job sharing the constraints and requirements they address and how they go about solving them. Great brief on the topic Gary.

    Share
  4. [...] How Yahoo, Facebook, Amazon & Google Think About Big Data – Collectively, Yahoo, Facebook, Amazon and Google are rewriting the handbook for big data. Startups intending to reach these proportions must also change their thinking about data, and enterprises need this model for internal deployments as a way to retain an economic edge.The four leading web giants have designed systems from scratch, evidence that workloads have altered, business models are different, and economies have changed — all demanding a new approach. [...]

    Share
  5. [...] How Yahoo, Facebook, Amazon & Google Think About Big Data [...]

    Share
  6. [...] you think you’ve got a lot of data, check out GigaOm’s look at some of the largest data giants ever.  Facebook, for example, is expecting to store its 100 [...]

    Share
  7. [...] the original:  How Yahoo, Facebook, Amazon & Google Think About Big Data var AdBrite_Title_Color = '0000FF'; var AdBrite_Text_Color = '000000'; var [...]

    Share
  8. [...] our left are nabobs from the misnomered NoSQL movement, shaggy kids from the likes of Facebook and Twitter . They’ve rebelled against the shackles of relational tables (and bear the scars of MySQL [...]

    Share
  9. [...] Check Out a Big Primer on Big Data By Stacey Higginbotham Mar. 3, 2010, 10:41am PST No Comments          0 Big data is certainly on the tip of everyone’s tongues these days as both the amount of data entered online expands and the ways to track objects and people grows with wireless connectivity and sensors. We have both more information being entered and more sources of that information, providing a river of data that somehow we’re going to capture and use to make money and better decisions. [...]

    Share
  10. [...] How Yahoo, Facebook, Amazon & Google Think About Big Data : Pivot3, RainStor, Scale Computing            0 [...]

    Share
  11. [...] as collecting and aggregating data on the web has made companies like Facebook, Amazon and Google darlings of investors, large enterprises are waking up to the potential of the “big data” they have locked up [...]

    Share

Comments have been disabled for this post