7 Comments

Summary:

Bundle uses the billions of Citi customer transactions to draw correlations between spending habits and what other people might enjoy or buy. CTO Phil Kim explains that wrangling even highly structured data takes a lot of organizing, a lot of computing and a lot of time.

5150947200_21a5322d17

Last night, Bundle, a spin out of Citibank that uses data gleaned from 20 million of Citi’s customers’ transactions, announced a few new apps on its website that offer restaurant recommendations for NYC and Los Angeles and a calculator for folks who are moving so they can see what people spend in the city they are transferring to. The idea behind Bundle is to use the billions of Citi customer transactions to draw conclusions about spending habits and draw correlations between those habits and what other people might enjoy or even shell out cash for. While the idea of using the data may sound simple, Bundle’s CTO Phil Kim notes that wrangling even highly structured data such as what Citibank captures takes a lot of organizing, a lot of computing and a lot of time.

Kim wrote me an email in response my questions about how Bundle analyzes restaurants for its new Restaurant Recommender App. The first part of the process involves scrubbing the data of personally identifiable information, and organizing it in a way that the algorithm can run against it to produce the correct results. Kim explains the process:

Also, as with most intensely data driven analyses, some of the biggest challenges were not in the math itself, but in acquiring the raw data and structuring it in a form appropriate for mathematical analysis. … Starting with a large population of transaction records (sanitized of any customer personally identifiable information) in one hand, and a semi-complete population of merchant records in the other, the data team had to match transactions to merchants. For hundreds or even thousands of transaction and merchants, this could be done manually (albeit painfully).

For our many millions of transactions and tens of thousands of merchants (in each launch city, NYC and L.A.), this would have been impossible to do by hand, so the team created a collection of utilities and processes to automate this process. For the technically inclined, the team here uses a common toolkit for data analysis including Perl, Python, R, and a healthy dose of Excel, actually.

After that point, Kim says the data is run against algorithms that measure the strength of the associations between restaurants as defined by their shared customer base. The underlying mathematical approach is called “eigenvalue centrality scoring“, and is related to the algorithm that Google uses to rank the authority of web pages, according to Kim. For Restaurant Recommender, it’s used to rank the authority of restaurants with particular groups of people. However, running that analysis against files that are just under one terabyte can take days, according to Bundle’s CEO Jaidev Shergill. So don’t expect any real-time transaction analysis influencing results anytime soon; right now, it’s updated once a month. Kim provided more details, saying Bundle doesn’t rely on a public cloud to process its analysis in part because of Citi’s regulations about what it can do with the data. From Kim:

The hundreds of gigabytes of data (hundreds of millions of records) used to build the Restaurant Recommender in its current form is stored in a combination of flat files and relational database tables (using a mix of SQLServer and MySQL). The processing jobs, many of which take days to complete, are run on a growing cluster of data servers located in one of our secure data centers.

The size of the datasets will grow quickly as we expand to more cities…we’re just under a TB in what data is live now, but will be many TB’s once we support all of the top cities/markets.

The lesson learned here is that even a simple application to offer you a restaurant recommendation requires the effort of mathematicians, thousands of servers and layers of algorithms to provide you with a place to eat tonight. For startups looking to wrangle such data on their own, public clouds can help reduce the time spent on processing, and startups offering cleaned data sets such as Infochimps or Factual can also help. But in the end, taking advantage of big data may be easier nowadays, but easy is relative. It’s not rocket science, but it’s still heavy on the stats.

Related GigaOM Pro Research (sub req’d):

You’re subscribed! If you like, you can update your settings

  1. “…wrangling even highly structured data such as what Citibank captures takes a lot of organizing, a lot of computing and a lot of time…running that analysis against files that are just under one terabyte can take days. So don’t expect any real-time transaction analysis influencing results anytime soon, right now it’s updated once a month. ”

    I think Jeff would beg to differ!

    http://gigaom.com/2010/10/11/jeff-jonas-big-data/

    :)

    1. Thanks, Todd. I absolutely agree with a lot of what Jeff has to say about big data, but I would assert that it takes a lot of up front processing, analysis, and development to build a system that can intelligently respond to real-time stimuli.

      1. You must not know Jeff. Otherwise you would know his distaste for statistics or “intelligent” data processing.
        Here’s a question:
        How “intelligent” has a point analyzing system to be to read this?
        ……..
        No voodoo(stats) required. Just put the data on a pane,use feedback loops based on speed differential (myelin sheat), close related (BAC, back propagating activated calcium) to name a few and it doesn’t matter owh ouy sepll orwd a. Correct spelling just makes it faster to read. But a single core processor can read that text as fast as you can.
        Or:
        hear canyou me know (same problem, as are most recommendations)
        It’s called context processing or as Jeff calls it “data finds data”. Again no complex stats, just simple math. You guys must be pretty googly.

        *The above solution is mine, Jeff uses a different approach (hence he has out of sequence problems, just had to rub that in). But we agree on the underlying definitions, our implementations just differ (which is a good thing).

      2. Ronald — Fair enough… I am certainly a big fan of simple systems that exhibit “intelligent” (apologies for tossing around a loaded term like that lightly) behavior, by accident or by design. However, my experience tells me that when such systems are expected to solve specific problems, you have to invest some time in understanding the data you’re working with, even if only to frame the problem properly and understand how to connect the pipes together. Yes, in the case of this app, we employed a particular approach that involved some heavy number crunching, but to clarify my earlier point, much of the “up front processing, analysis, and development” went into simply connecting the pipes together.

        In any case, absolutely open to your thoughts… feel free to ping me directly at phil@bundle.com.

  2. It reminds me Netflix recommendation engine: people who like movie X also like movie Y. It works OK for movies. As for restaurants I would prefer to see customer rating based on actual customer loyalty. If people keep going to the same place over and over again may be I need to check it out myself? Of course, these ratings must be separate for lunch and dinner places.

    Another note on actual results: according to your tool average family with kids in NY spends $593 per month on groceries? Its hardly to believe.

  3. The Locker Project: Why Leave Data Tracking To Others? Do It Yourself: Tech News and Analysis « Friday, February 4, 2011

    [...] spun out a project called Bundle that takes millions of anonymous user transactions and builds recommendations based off of that for customers. But the future lies in taking in personal data and crafting highly [...]

Comments have been disabled for this post