Blog Post

Why a DIY Big Data Stack Is a Better Option

plastic stackedToday, many conversations within the big data community are centered around the rise of the standard, big data stack, which includes utilities like HDFS, HBase, and other increasingly-popular applications. While settling on a standard big data stack is deeply important to the big data industry as a whole, I’m nonetheless questioning the operational and competitive consequences for companies who choose to buy into this standard without first considering the value of building their own proprietary solution.

Where We Are Today: Limited Choice

First, let me say that the Hadoop big data stack is an impressive achievement and something many respectable big data players have been rooting for. Players like Infochimps, Cloudera and Riptano have shown significant support and found success. With vendors like these as evidence, it’s clear the Hadoop stack is great for optimal data processing in many scenarios.

Before people get up in arms, I’ll say that there really is no perfect data solution that works for every situation. This realization led us to build our own solution.

The Road Less Traveled

At 80legs, we chose the proprietary solution — which is a far more appealing option than most would acknowledge today — for many reasons. For one, we have a unique need to collect large volumes of data when crawling millions of websites, rather than simply processing it. Our stack has dramatically lower costs as a result of our unique situation, and unlike the standard big data stack, ours doesn’t need to support storing large volumes of data. Instead, the nodes handle storage all in the same location data is being processed.

    1. 1. A Unique Need for Collecting, Rather than Processing Big Data: 80legs is a web crawling service.  While most of the big data world is focused on how to process data, we (and our customers) worry about how to get that data in the first place. Across all 80legs users, we’re crawling anywhere between 15 to 30 million web pages per day.  That requires downloading 10 TB of data every month. We want to be doing at least 5x that by the end of the year. The standard big data stack isn’t ready to scale in this manner. Rather than spending a fixed amount on bandwidth for this data influx, our stack shifts the bandwidth work to the nodes.  We have about 50,000 computers using their excess bandwidth. You can forget AWS if you want to do this much crawling – the bandwidth cost you’ll incur will be too much.
      2. Dramatically Lower Costs: Shifting the cost of bandwidth to the nodes is a bigger deal than you might think. The standard big data stack would have you pay for bandwidth in terms of, well, bandwidth. Our system can scale to more CPU time and bandwidth than most clouds, instantly – that’s the big payoff. Instead of paying for volume of data collected, we pay for the time spent getting that data.  In other words, we’ve setup a nice little arbitrage. While the standard big data stack has made huge strides in making big data more accessible to everyone, it will always fall short against our stack when it comes to the cost of collecting data.
      3. No Need to Store Big Data: We actually don’t store that much data. Because 80legs users can filter their data on the nodes, they’re able to return the minimum amount of data in their result sets.  The processing (or reduction, pardon the pun) is done on the nodes.  Actual result sets are very small relative to the size of the input set.

For all of you evaluating whether to do it yourself, keep in mind, sometimes what’s available on the market isn’t going to solve every one of your problems. Especially as you start to throw around petabytes of data.

Realistic Roadblocks

Now, we wouldn’t be telling the whole truth if we neglected to point out 2 big potential pitfalls in building your own big data stack.

First, with a custom-built stack, there is almost no support. Bugs and problems tend to be unique to your system.  At 80legs, the only experts we can call upon are ourselves, and as a business, we have to be ready to pivot into problem solving mode because we can’t rely upon anyone else. We’re on our own.

We’ve spent a lot of time debugging things for which there are no support forums, and it can be quite frustrating when you see your open source friends resolve their problem in minutes by logging on to IRC.

Second, building features into your system over time can create a lot of moving parts: lots more than even the most decked-out big data stack. As part of this, you won’t have access to technologies that would handle multiple components, such as Cloudera’s CDH.

Significant Competitive Advantage

We had strong reasons to go our own road, and the downsides can be maddening at times, but there are still key advantages you get from doing it yourself that will make a significant impact on the success of your business. One advantage is optimization — an “off-the-shelf” system is going to have some generalities built into it that can’t be optimized to fit your needs. The opportunity cost of going “standard” is a slew of competitive advantages.

At 80legs, our back-end is optimized to the byte-level, by hand, literally. This is no small thing. Squeaking out a 2-4 percent improvement in your data/processing throughput is important at this scale. If you don’t believe us, then look to Facebook, Twitter and Google (s goog), all of whom have built their own systems, even if some eventually spun out as open source solutions (e.g. Cassandra). Facebook chose to build its own solutions in certain areas because it gives it a technology advantage, which in turn bolsters its dominance in the market.

The other benefit to going our own way is a sustainable competitive advantage over time. An important lesson I’ve learned is that most true competitive advantages are operational and cultural ones, contrary to popular thinking. If you drive the same, standard vehicle as everyone else, you’re not going to get the same performance as a custom-fitted ride. In the world of big data, your stack is the most important component of your operational  strategy.  So do you want to be just as good as everyone else, or better?

Here’s an important side note: If you’re looking at this from a startup’s perspective, there’s another consideration. Which company is more likely to be acquired? One with a unique, well-performing back-end “secret sauce”, or one with a standard, run-of-the-mill stack?  Where is your IP?

People have often asked me whether or not I would have used the big data stack if 80legs was built today, and the answer is still a resounding “No.” If you’re not trying to compete on data processing throughput or other performance metrics, I recommend you use the standard big data stack. You’ll save time and money and have more support resources available in the community.

However, if you want to compete on operational performance in your market, you need to seriously consider building a proprietary solution in-house. Frankly, provided you have the chops to outperform the standardized stack, why wouldn’t you pursue a technology advantage that provides sustained business value?

As CEO, Shion Deysarkar is responsible for the overall business and development of 80legs. In a previous life, he ran a predictive modeling firm. He enjoys playing poker and soccer, but is only good at one of them.

Image courtesy flickr user Joost J. Bakker

13 Responses to “Why a DIY Big Data Stack Is a Better Option”

  1. I’m not trying to bash EC2 here, but I think it’s important for anyone to consider that if their needs don’t fit what EC2 is designed for, then perhaps there are better alternatives for them.

    Our current infrastructure suits us quite well. If there are changes to EC2 that make it more suitable for us, then we’ll look at how we can take advantage of those changes. It’s as simple as that.

  2. Funny all this talk about optimizing byte code and building out “big data” infrastructure, when everybody knows that 80legs is just piggybacking on Plura Computing & tens of thousands of unsuspecting Digsby IM chat users. They admitted as much in a GigaOM writeup from May of last year.

    With all the EC2 bashing, being utterly dependent upon Plura doesn’t seem much different, does it? Not to mention Digsby (which has earned a Malware/Spyware label from many due to their working with Plura/80legs and renting out unsuspecting Internet users PCs/bandwidth).

    One must wonder if 80legs customers realize their crawls are running on unsuspecting users machines, many of whom are inside corporate networks w/ usage policies that explicitly forbid this sort of stuff (Is the crawl data even legal to use, at that point??)

    Also, the idea of submitting ‘processing / reduce’ code to 80legs for ‘execution on their nodes’ just seems crazy. You’re subsequently uploading your code to tens of thousands of unprotected computers on the Internet. IP concerns? You betcha!

  3. @Akshay Bhat

    Yes, Infochimps is like an eBay/Amazon for data.

    We also deal with big data ourselves. Some of the data we host (such as the Twitter data on we develop and analyze ourselves. We have collected around 2 billion tweets and have the infrastructure in place to store and analyze this data. When we analyze it, it translates into things such as our Trstrank or Influence metrics. We are often dealing with data at the gigabyte/terabyte scale.

    To give you an idea of some of the tools we use, here’s a partial list:
    – Amazon EC2
    – Amazon S3
    – Pig
    – Hadoop
    – Tokyo Tyrant
    – Wukong

  4. Akshay Bhat

    An advantage of 80 legs over EC2 is that, the range of IP addresses from which requests are being made is substantially large, thus you don’t get blocked that easily.

  5. The basic principle behind 80legs seems to be a webcrawler-for-hire. The implementation seems to be some kind of [email protected] distributed grid, which is why they need 50,000 computers to process such a relatively small amount of data (10TB/month).

    The problem the company seems to face is their business model, technology, and reason for existence doesn’t make much sense when anyone and their brother can log in to ec2, start up an hadoop/nutch/lucene cluster and crawl their brains out for next to no money. Certainly less then 99$/341 gig. Maybe they have some value add in their user apps and analytical layer? Who knows.

  6. I could not understand several points made in this article:

    1: He mentions infochimps, but according to my knowledge its more of an ebay for datasets rather than support/provider for Big Data Stack, also is it successful? I am unsure how infochimps is related to the Big Data stack.

    2: From what I remember reading about 80legs, is that it uses distributed grid computing to run the crawlers (something like SETI @ Home), I doubt Hadoop was ever designed for such applications. So this is surely isn’t a Hadoop use case.

    3: Quoting:

       While the standard big data stack has made huge strides in making big data more accessible to everyone, it will always fall short against our stack when it comes to the cost of collecting data.  We actually don’t store that much data. Because 80legs users can filter their data on the nodes, they’re able to return the minimum amount of data in their result sets.  The processing (or reduction, pardon the pun) is done on the nodes.  Actual result sets are very small relative to the size of the input set.

    Again I am unsure how it is different from Hadoop? First Hadoop uses same principle “to move computation closer to data” hence a crawler implemented using Hadoop (something Hadoop is not intended to do) will also store data locally and not on some other node.

    Also he mentions “”” We have about 50,000 computers using their excess bandwidth.””” 50,000?? The biggest Hadoop cluster That I know (Yahoo) has ~10-20k nodes, and Hadoop was never meant to be used at 50K scale for crawling. So they had no option other than building their own system, even if they had to build it today.

    4: Quoting

      One advantage is optimization — an “off-the-shelf” system is going to have some generalities built into it that can’t be optimized to fit your needs. The opportunity cost of going “standard” is a slew of competitive advantages.

    The only issue I can think of regarding Hadoop is that its written in JAVA, otherwise its an extremely extensible piece of software. Unless you are designing a real time messaging system or distributed system for High Frequency Trading, Hadoop is good enough for most of the applications. Also what about cost of finding good enough programmers who are capable of building a system? Another advantage of Hadoop is that in case of a low load the remaining nodes can be used to do something else, maybe processing some data, with your own solution it would be harder to do it. Also your IP and your Secret Sauce isn’t of much use, if you dont have solid Patents for them, otherwise they would mostly end up becoming a maintenance nightmare, after original engineers cash out. Also what if the the big company already has Hadoop cluster, it would be even difficult for them to integrate with your computing power.

    While I kind of agree with Authors conclusion that a highly focused startup should make their proprietary solution, I cannot agree with his evidence behind that argument. A grid based crawler with 50K machines isn’t something that Hadoop was ever designed to support.

  7. I’ve heard this song before

    1: The volume you are talking about collecting is actually not very large and well within reach of the hadoop stack. Take a look at quantcast as an example. They do something like 10x the volume you are discussing, also crawl based

    2: Data bandwidth costs IN to AWS are currently free. If they charges, which they will likely start doing soon, it would probably run you around $10K / month to source your 50TB. How much are you spending on engineering time to optimize your own code base?

    Your IP has no value if all it is “something like hadoop only not as good” You will not keep up. You are already behind.

    • Based on my estimations, using an EC2 equivalent to our back-end server setup would cost at least $10,000 per month for the compute cost alone. It’s probably more like $20,000, but I’d need to play with their calculator some more to make sure. Our monthly cost for our compute requirements is on the order of $2,000.

      Adding the cost of bandwidth from AWS, which will no longer be free after October, adds at least a few thousand on top of that. As a CEO, I’m not going to depend on a “limited time offer” from another company to maintain my finances. Other folks understand this:

      So while we did pay an upfront fee to buy our back-end servers, we’re saving a lot of money over AWS each month. Actually, I guess we come out of positive after 1-2 months by using our own system.

      As for why someone would pay for using us.. maybe you should ask one of our several customers:

      • You will be better off buying your own hardware if you keep it pretty busy all the time, ec2 might make sense if your workload is spikey. You can also always use a combination, use ec2 for overflow. Either way doesn’t speak to whether hadoop does or doesn’t make sense for you, since there is no law that you cannot run hadoop on your own hardware.

        I re-read your article and it still doesn’t make much sense to me, honestly. Most of your points seem to be founded on some pretty shaky cost accounting or some misconceptions about what hadoop can and cannot do.

        The segway into cultural is odd. If you follow the reasoning too far, you end up building your own OS or burning your own chips. Good engineers need to be innovative sure, but they also need to have the sense to use the right tool for the job and not roll their own if an adequate tool already exists

        The IP argument only holds if you truly derive a competitive advantage, which means you can do something no one else can do, or at least do affordable. Crawling and processing 10 TB/ month or even 50TB/month does not fall within this criteria. It’s well within everyone’s reach on a reasonable budget. It barely qualifies as “big data”. You should not have to be hand optimizing byte code to process that volume.

        The only thing I’ve read so far that made any sense was Akshay Bhat’s comment about the IP ranges

        To answer my own question about why people would use you? It’s because you make it easy not because you crawl the web scalably. The value I see on your website are the tools and groupings you provide for your end users. Those end users don’t care if those tools run on top of hadoop or not, they just want them easy to use and cheap.

        If you used hadoop as your backend stack, that would free more of your engineering time up to deliver value to your end users. If you continue to develop your own then you have to pass those costs on to your users.

      • Just a few thoughts:

        1. Our system is busy all the time.
        2. My costing has been vetted by VCs and HPC veterans several times over. I just don’t find it shaky anymore.
        3. The performance of our crawler has been benchmarked by our customers, most of whom are engineers and CTOs of technology firms. For broad crawls or crawls of high-traffic domains, our crawler has been shown to be superior.
  8. disagree - closed will hurt you.

    @Shion I agree with your idea, that companies/people that need data storage should consider rolling their own, but I do NOT agree that proprietary is the solution.

    Specifically, creating a propriety server solution may be cheaper, fit your needs and ultimately boost the $$$ (on paper) value of the company. However, building a closed source system (proprietary) also has significant drawbacks that you can’t even begun to adress. Since this list is so long, I will only mention 1.

    1. hiring a developer who needs little / no training ramp up time to maintain, or expand, your closed source system – it’s impossible. Even companies who use only open source software cannot find and train developers fast enough.

    I’m not suggesting people shouldn’t roll their own, in fact, I am suggesting that more people should build their own, then open source that code and share it. After all the entire internet was built using open source (TCP/IP) everything else is just a ‘layer’.

    Your example companies of ‘closed’ systems are laughable. Facebook, while closed, open sources large portions of their code base. If anything these companies are just infants, additionally, you are ignoring the 100 (if not 1000s) or companies who rolled their own and failed.

    What you are attempting to do is paint a technical justification for a bad business decision. The more open your company, the better.

    I’ll just go ahead and point to SQLlite (public domain software) Ubuntu, MongoDB, MySQL, CoucheDB . All these projects are used and successful (many probably in your ‘closed’ system) and are successful because they are open and people can go and make them better. It’s the same reason why wikipedia works.

    Regardless, you don’t actually care since you are a CEO and your goal is to sell the company and not about enriching the world we all live in. So take your closed source ideas elsewhere.