Social data analytics company DataSift left beta this week; one of only two companies licensed to repurpose the “fire hose” of data pouring out of Twitter, DataSift processes over 250 million fresh tweets every day. The company consumes data exclusively from the cloud and provides services to customers via the cloud, which surely makes it a natural choice for hosting its data there. But DataSift actually runs its core operation from a physical data center, highlighting areas in which mainstream cloud offerings remain lacking.
Founder Nick Halstead recognizes that public cloud infrastructure is of value both to DataSift as a company and to the customers. DataSift makes extensive use of Amazon’s public cloud for development and testing, and for processing of large historical data sets. But for its customer-facing service, and for working in real time with the fire hose itself, Halstead remains convinced that the public cloud is not ready.
He stresses that there are parts of DataSift’s business that require it to invest significantly in its own infrastructure. The biggest choke point Halstead sees is around bandwidth. An average of 250 million tweets per day is more than 10 million per hour, with peak loads that are far higher. Each of those tweets carries more data than the 140 visible characters and includes over 40 elements of metadata about tweet and tweeter.
Traditional public cloud solutions such as those from Amazon and Rackspace, Halstead argues, simply cannot guarantee sufficient network bandwidth for his business to function. He goes as far as to suggest that “If you are trying to sell yourself as a cloud platform, you cannot put yourself on someone else’s cloud platform,” because you will be unable to control your customers’ experience. While companies with significant bandwidth requirements of their own, like Netflix, appear content to rely on the public cloud, there is a broader trend in the real-time social media space toward moving off the cloud and into dedicated data centers. Facebook, of course, has even designed its data centers and servers from scratch.
DataSift receives its data from Twitter in a traditional data center, connected to the Internet via a heavily customized set of dedicated Cisco routers and switches. Redundancy in network connectivity seeks to ensure that DataSift receives data quickly and reliably and gets it back to customers just as fast. DataSift’s customers today are paying for real-time perspectives drawn from the entire Twitter fire hose, and they are not prepared to accept unpredictable delays or data loss just because DataSift was unable to quickly pull data over the network.
Twitter itself recognizes the issues that Halstead raises, investing heavily in data centers that can cope with the type of load the site experiences. The company has struggled in the past to handle demand, and it allegedly experienced difficulty finding data center providers that could meet its needs. A constant flow of the relatively small chunks of data that make up tweets, Facebook wall posts and other social media activities challenge traditional data centers designed to process the less-time-sensitive movement of larger digital files. New server designs from companies such as SeaMicro are rising to the challenge, but the network still lags behind.
For DataSift’s use case, Halstead may be correct about the limitations of the public cloud. As there are few instances in which so much data needs to be transferred so quickly so often, it is unlikely that public cloud providers will directly invest in providing the necessary infrastructure. Bandwidth to the cloud’s data centers will increase, but so will the number of competing demands on it. Most applications may be capable of waiting an extra second or two for a data transfer to complete, but for DataSift, every second’s delay is potentially 3,000 lost tweets and a corresponding drop in the company’s credibility.