3 Comments

Etsy
Summary:

E-commerce site Etsy has grown to 25 million unique visitors and 1.1 billion page views per month, and it’s generating the data volumes to match. Using tools such as Hadoop and Splunk, Etsy is turning terabytes of data per day into a better product.

Etsy, the e-commerce site specializing in homemade and vintage goods, has grown to more than 11 million users, resulting in 25 million unique visitors and 1.1 billion page views per month, and it’s generating the data volumes to match. Today, for example, Etsy detailed some of its work with Splunk to manage and analyze up to a terabyte of machine data per day.

This is a huge increase — about 200x — since 2007, when Etsy signed on with Splunk, and was capturing a mere 5GB of data per day. Etsy’s usage of Splunk has probably evolved, too, from a focus on troubleshooting (i.e., noticing a problem and tracking down the cause) to a focus on what Splunk calls “operational intelligence.” Because users can search and analyze server logs and other machine-generated data pretty much as it streams in, they can, for example, monitor traffic patterns in real time to uncover ongoing issues that might be causing visitors to drop off pages or leave the site.

Splunk isn’t Etsy’s only big data solution — it’s also a big Hadoop user. Etsy runs dozens of Hadoop workflows each night on Amazon’s cloud-based Elastic MapReduce Hadoop service. According to this very detailed (and technical) presentation (PDF here, video here) explaining Etsy’s Hadoop usage, it ran nearly 5,000 Hadoop jobs in May 2011 to analyze both internal operational data as well as external activity such as customer behavior. Etsy actually uses MATLAB within its Elastic MapReduce clusters to analyze the data and perform predictive analytics. The presentation also highlights Etsy’s experimentation with Tableau to visually display the results of its internal data after it has been cleaned up by Hadoop.

At the product level, Hadoop powers Etsy’s Taste Test feature that helps the site determine what products best suit a particular customer’s tastes. It also helps with a feature that analyzes Facebook profile information in order to let visitors shop for their friends. At Hadoop World next week, an Etsy engineer will discuss how Etsy uses Hadoop to improve its search recommendation engine.

Operationally, Hadoop helps Etsy analyze server logs to figure out what customers are doing on the site and how they’re accessing it.

Etsy, like so many other companies — especially on the web — is both drowning in data and trying to leverage it. That’s why we’re seeing such a huge focus on Hadoop among all varieties of enterprise data-management vendors, and why you can’t escape the omnipresent references to “big data.” Although, as Etsy proves, dealing with big data requires a multi-pronged approach that goes well beyond simply deploying a Hadoop cluster and watching the insights pour in.

  1. Emilia Palaveeva Wednesday, November 2, 2011

    Very interesting case study from an unexpected big data user. Underscores the pervasiveness of big data…

    Share
  2. Is there anywhere I can get a big data solution like what etsy has built for my company?

    Share
    1. @ Big Data: There are many Hadoop alternatives out there that may be suitable for your company. The HPCC Systems platform is among them for tackling Big Data problems. Unlike Hadoop distributions which have only been available since 2009, HPCC is a mature platform, and provides for a data delivery engine together with a data transformation and linking system equivalent to Hadoop. The main advantages over other alternatives are the real-time delivery of data queries and the extremely powerful ECL language programming model. More at http://hpccsystems.com

      Share
  3. @hideh Etsy comes to mind first – perhaps they’d be interested in hosting a big-data-focused gathering. See here: http://t.co/9sqsl4lt

    Share
  4. How Etsy handcrafted a big data strategy http://t.co/hQOUv3aO

    Share

Comments have been disabled for this post