Twitter nows lets users search for every public tweet since the service was launched in 2006, the company announced via blog post on Tuesday. This capability was a long time coming, and was no doubt made more difficult as Twitter’s use began to grow in the past couple years.
The post goes into a lot of detail about how Twitter built its new historical search index, but the main challenges are obvious to anyone who has followed the evolution of Twitter’s infrastructure (or Facebook’s, Google’s or any other large web service’s infrastructure) over the years — speed, scale and cost. According to the post, the full search index now includes “roughly half a trillion documents” and “is more than 100 times larger than our real-time index and grows by several billion Tweets a week.”
Because Twitter needs to return search results quickly, the speed of its system also matters. But speed is partially a function of how much money a company is willing to spend on its storage media. Twitter’s real-time search, which contains about a week’s worth of tweets, is stored entirely in RAM — an option that would be too costly for the full index. Early work with solid-state drives resulted in significant performance hits, but Twitter was able to tune the system to run on SSDs and still achieve acceptable latencies.
Twitter began the whole process in 2012, the post explains, by indexing 2 billion “top” historical tweets. In 2013, the company grew that project by an order of magnitude and began optimizing for SSDs.
Here’s how Twitter says the index will work for users:
For more on how Twitter has built its infrastructure over the years, check out this interview with Raffi Krikorian, the company’s former vice president of platform, from this year’s Structure conference.
Correction: This post was updated at 11:30 a.m. to note that Raffi Krikorian is no longer at Twitter.