How Twitter processes tons of mobile application data each day


It’s only been seven months since Twitter released its Answers tool, which was designed to provide users with mobile application analytics. But since that time, Twitter now sees roughly five billion daily sessions in which “hundreds of millions of devices send millions of events every second to the Answers endpoint,” the company explained in a blog post on Tuesday. Clearly, that’s a lot of data that needs to get processed and in the blog post, Twitter detailed how it configured its architecture to handle the task.

The backbone of Answers was created to handle how the mobile application data is received, archived, processed in real time and processed in chunks (otherwise known as batch processing).

Each time an organization uses the Answer tool to learn more how his or her mobile app is functioning, Twitter logs and compresses all that data (which gets set in batches) in order to conserve the device’s battery power while also not putting too much unnecessary strain on the network that routes the data from the device to Twitter’s servers.

The information flows into a Kafka queue, which Twitter said can be used as a temporary place to store data. The data then gets passed into Amazon Simple Storage Service (Amazon S3) where Twitter retains the data in a more permanent location as opposed to Kafka. Twitter uses Storm to process the data that flows into Kafka and also uses it to write the information stored in Kafka to [company]Amazon[/company] S3.

Data pipeline

Data pipeline

With the data stored in Amazon S3, Twitter than uses Amazon Elastic MapReduce for batch processing.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]We write our MapReduce in Cascading and run them via Amazon EMR. Amazon EMR reads the data that we’ve archived in Amazon S3 as input and writes the results back out to Amazon S3 once processing is complete. We detect the jobs’ completion via a scheduler topology running in Storm and pump the output from Amazon S3 into a Cassandra cluster in order to make it available for sub-second API querying.[/blockquote]

At the same time as this batch processing is going on, Twitter is also processing data in real time because “some computations run hourly, while others require a full day’s of data as input,” it said. In order to address the computations that need to be performed more quickly and require less data that the bigger batch processing jobs, Twitter uses a instance of Storm that processes the data that’s sitting in Kafka, the results of which get funneled into an independent Cassandra cluster for real-time querying.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]To compensate for the fact that we have less time, and potentially fewer resources, in the speed layer than the batch, we use probabilistic algorithms like Bloom Filters and HyperLogLog (as well as a few home grown ones). These algorithms enable us to make order-of-magnitude gains in space and time complexity over their brute force alternatives, at the price of a negligible loss of accuracy.[/blockquote]

The complete data-processing system looks like this, and it’s tethered together with Twitter’s APIs:

Twitter Answers architecture

Twitter Answers architecture

Because of the way the system is architected and the fact that the data that needs to be analyzed in real time is separated from the historical data, Twitter said that no data will be lost if something goes wrong during the real-time processing. All that data is stored where Twitter does its batch processing.

If there are problems affecting batch processing, Twitter said its APIs “will seamlessly query for more data from the speed layer” and can essentially configure the the system to take in “two or three days of data” instead of just one day; this should give Twitter engineers enough time to take a look at what went wrong while still providing users with the type of analytics derived from batch processing.

Comments are closed.