Adventures in big data: How AddThis’ Hydra works

Clearspring, a social publishing platform, on Thursday changed its name to AddThis as part of an emphasis on the analytics side of its business. The company, which was formed in 2006 and provides those social sharing buttons that are ubiquitous across the web, also offered analytics services under the AddThis brand. But it realized that the data analytics had become the essential element for its customers and changed its name accordingly.

The company serves 14 million web sites and reaches 1.3 billion unique users a month. On any given day sites using the service see 3 billion views. Together, those visits create 10 terabytes of data each day, leading the folks at AddThis to realize their data was the asset. As Hooman Radfar, founder and executive chairman of AddThis, put it, “Every week, we see the same amount of information that’s stored in the Library of Congress.”

For now, the company is selling its customers the ability to tap into the social graph those terabytes of data can generate, so site owners can see relationships between their content and how people react to their content. The end result will be new products that AddThis will be able to offer, as well as a data set that tracks sharing and connections over a large portion of the Internet. Such social data is attracting a lot of interest at the moment.

The sheer size of the data set and the real-time nature of tracking shares in real-time puts a lot of demands on the database, so the engineers at AddThis have managed to build something pretty cool: A data store that can handle both batch processing and real-time analytics. They call it Hydra, and it’s possible that in the near future they will open it up as a service, said Radfar.

Much like Google’s (s goog) BigQuery or 1010data does, offering Hydra as a service would allow other companies to put their own information on Hydra and query it for analysis. As more and more of these on-demand and cloud-based analytics services become available, the question will soon be what kinds of data and queries will fit best with each. Generally moving terabytes of data from one cloud to another will cost a lot and take hours if not days.

Clearly, social graph data is what Hydra does best at the moment, but I talked to Matt Abrams, who is an engineer at AddThis, to understand how Hydra works and how it scales to understand what else might mesh.

Hydra is a custom-build distributed file system and distributed database. It’s stored on four clusters of about 80 servers each in co-location facilities. When asked about the decision to go in-house, Abrams said that at a certain point it just became too expensive to keep everything on Amazon Web Services (s aws). The service needed fast IO and ensuring that became difficult and costly to do when AddThis didn’t control the infrastructure.

The data is sharded in multiple ways to ensure that a query for different elements can be performed quickly. So a single piece of data is replicated across different servers to make it faster to find in specific queries and is also replicated for backup in case the storage fails –something Abrams says happens to about 30 percent of his storage hardware right out of the box.

The Hydra platform obtains speed by having the right data ready in a variety of places for the query, much like you or I might keep a flashlight in every room of the house so we can get to it if the power goes out. To keep all these pieces of data stored in various hash sets requires 1.8 petabytes of storage.

For the deep dive, check out this three-part series Abrams wrote a year ago when AddThis had half the data it has now, or check out the more recent post on High Scalability.

But the result is a platform that AddThis uses to build out a social graph that rivals Facebook’s in breadth — and one it makes available to its customers so they can one day do cool things such as offer personalized web sites for users with little effort on their part.

Image courtesy of Flickr user tpholland.