“We want to apply data to every decision. We want to be a very data-driven company.” It’s a sentiment you hear echoed wherever you turn in Silicon Valley, at least since Google became one of the world’s most-powerful companies.
This particular quote comes from Airbnb Vice President of Engineering Mike Curtis, who joined the popular apartment-sharing startup almost six months ago after almost two years as an engineering director at Facebook. We spoke last week about how expansive Airbnb’s data-driven vision actually is, and how Curtis and his team of engineers can help make it a reality. Like most of his peers operating at web scale with web data, Curtis sees his work, the work of Airbnb’s data scientists and the work of the company strategic leaders as being intrinsically linked.
“We think we might be pushing data science in the field of travel more so than anyone has ever done before,” Curtis said. Doing this over the long run — and capitalizing on it — will require some cutting-edge tools.
The vision: Apartment sharing made even more personal
One of Airbnb’s biggest data problems right now is figuring out the best way to do personalized search, something the company is leaning toward implementing. “We want guests to find the best place for them close to what they’re searching for,” he said.
However, he added, figuring out how to redo search rankings for individual users presents a difficult algorithmic problem. We didn’t delve into the details, but the problems seem pretty clear. Ordering search results around community-wide rankings or geographical proximity is easy enough, but figuring out how to accurately factor in users’ own preferences, social connections, rental history, reviews and other data points adds in a whole other layer of complexity. (Airbnb’s data around specific cities, guest and host demographics, and other rental metadata might factor in, as well.)
The data science behind Twitter’s personalized search engine — which takes into account numerous factors in order to determine relevance — provides a good example of how difficult this can be.
Curtis said Airbnb also crunches numbers to help hosts figure out the best-possible rates for their rentals.
Internally, the company wants to take on the characteristics of Curtis’s previous employer, Facebook, which is renowned for how it has built tools that make Hadoop consumable by nearly everybody in the company. Facebook is great at letting employees “really get deeply in touch with the data and [figure] out what questions need to be asked,” Curtis said. “… That’s a thing I’d love to apply as we build thing out at Airbnb.”
It’s all about Mesos
One of the most-strategic tools in Airbnb’s belt for making its data dreams come true is an open source cluster-management project called Mesos. The technology, which emerged from the University of California, Berkeley’s AMPLab, lets users run multiple types of computing frameworks (or maybe just multiple distinct Hadoop clusters) on a single set of resources. Twitter helped make Mesos famous in web circles, and the project became a top-level Apache project last week.
In the case of Airbnb, Mesos is the key to letting the company’s engineers make the most of its Amazon Web Services-based infrastructure beyond just Hadoop. Airbnb uses Hadoop heavily, but it wants to experiment with Storm for stream processing, Curtis explained, and it wants to do more work with Spark (also an AMPLab creation) to run Hive queries faster than Hadoop allows it to do.
Actually, Spark could be particularly useful for things like for search rankings, pricing and detecting “bad behavior” on the service. “A lot of those things involve machine learning models,” Curtis said, and Spark’s performance edge over Hadoop means it can run these models over and over again in much less time.
Chronos, a distributed job-scheduler that Airbnb created to account for the realities of running in the cloud, also runs on Mesos.
While resource management and efficiency is certainly a big reason for using Mesos, Curtis said it also helps advance Airbnb’s general engineering strategy of building small teams that can move fast. The better Airbnb can automate resource allocation, the more work its engineers can spend doing other things. “Ideally,” he said, “[the idea is to] make it so a smaller number of engineers can have higher impact through automation on Mesos.”
Cloud: Yay! Elastic MapReduce: Meh?
Although Airbnb still runs on the AWS cloud, Mesos has allowed it to migrate off popular Elastic MapReduce Hadoop service. According to Curtis, there were several reasons for the move, although the primary ones were making it so Mesos can manage all the other frameworks Airbnb wants to run, and to have finer-grained control over its Hadoop environment. Elastic MapReduce, he said, is pretty much Amazon’s own distribution of Hadoop, which means users depend on AWS for patches and the like, and it’s only for Hadoop jobs.
Brenden Matthews, another Airbnb engineer, presented on the company’s migration to Mesos from Elastic MapReduce last week at Twitter’s headquarters. In his slide deck, he lays out some more reasons for the switch, as well as some of the technical challenges generally involved with running Hadoop in the cloud.
Still, AWS has been reliable overall, Curtis said, and the cloud’s flexibility — combined with Mesos — means Airbnb can do what it needs to when it needs to. Airbnb’s ad hoc analytic queries don’t interfere with its long-running batch workflows, and vice versa.
“The speed at which we can run jobs on the cluster is really a question of resource allocation,” Curtis said. “How many resources do we put behind the pool?”
And generally speaking, Curtis — who cut his teeth at AltaVista in the late ’90s and has since spent time at AOL, Yahoo and Facebook – is all smiles about what cloud computing can let startups like Airbnb do with so little upfront investment in buying and managing servers. “To think today that so much of that is abstracted away,” he said “… it is really a wonderful and amazing thing.”