Last night, Bundle, a spin out of Citibank that uses data gleaned from 20 million of Citi’s customers’ transactions, announced a few new apps on its website that offer restaurant recommendations for NYC and Los Angeles and a calculator for folks who are moving so they can see what people spend in the city they are transferring to. The idea behind Bundle is to use the billions of Citi customer transactions to draw conclusions about spending habits and draw correlations between those habits and what other people might enjoy or even shell out cash for. While the idea of using the data may sound simple, Bundle’s CTO Phil Kim notes that wrangling even highly structured data such as what Citibank captures takes a lot of organizing, a lot of computing and a lot of time.
Kim wrote me an email in response my questions about how Bundle analyzes restaurants for its new Restaurant Recommender App. The first part of the process involves scrubbing the data of personally identifiable information, and organizing it in a way that the algorithm can run against it to produce the correct results. Kim explains the process:
Also, as with most intensely data driven analyses, some of the biggest challenges were not in the math itself, but in acquiring the raw data and structuring it in a form appropriate for mathematical analysis. … Starting with a large population of transaction records (sanitized of any customer personally identifiable information) in one hand, and a semi-complete population of merchant records in the other, the data team had to match transactions to merchants. For hundreds or even thousands of transaction and merchants, this could be done manually (albeit painfully).
For our many millions of transactions and tens of thousands of merchants (in each launch city, NYC and L.A.), this would have been impossible to do by hand, so the team created a collection of utilities and processes to automate this process. For the technically inclined, the team here uses a common toolkit for data analysis including Perl, Python, R, and a healthy dose of Excel, actually.
After that point, Kim says the data is run against algorithms that measure the strength of the associations between restaurants as defined by their shared customer base. The underlying mathematical approach is called “eigenvalue centrality scoring“, and is related to the algorithm that Google uses to rank the authority of web pages, according to Kim. For Restaurant Recommender, it’s used to rank the authority of restaurants with particular groups of people. However, running that analysis against files that are just under one terabyte can take days, according to Bundle’s CEO Jaidev Shergill. So don’t expect any real-time transaction analysis influencing results anytime soon; right now, it’s updated once a month. Kim provided more details, saying Bundle doesn’t rely on a public cloud to process its analysis in part because of Citi’s regulations about what it can do with the data. From Kim:
The hundreds of gigabytes of data (hundreds of millions of records) used to build the Restaurant Recommender in its current form is stored in a combination of flat files and relational database tables (using a mix of SQLServer and MySQL). The processing jobs, many of which take days to complete, are run on a growing cluster of data servers located in one of our secure data centers.
The size of the datasets will grow quickly as we expand to more cities…we’re just under a TB in what data is live now, but will be many TB’s once we support all of the top cities/markets.
The lesson learned here is that even a simple application to offer you a restaurant recommendation requires the effort of mathematicians, thousands of servers and layers of algorithms to provide you with a place to eat tonight. For startups looking to wrangle such data on their own, public clouds can help reduce the time spent on processing, and startups offering cleaned data sets such as Infochimps or Factual can also help. But in the end, taking advantage of big data may be easier nowadays, but easy is relative. It’s not rocket science, but it’s still heavy on the stats.
Related GigaOM Pro Research (sub req’d):