17 Comments

Summary:

The web is becoming more dynamic, context-aware and personalized by the day, and the amount of information consumed by each person is increasing exponentially. But software infrastructure is not keeping pace. We need to develop data processing architectures that go beyond technologies like memcache, MapReduce, NoSQL.

The web is becoming more dynamic, context-aware and personalized by the day, and the amount of information consumed by each person is increasing exponentially. But while hardware performance is improving, except when it comes to the simplest of parallel programming tasks, software infrastructure is not keeping pace. We need to develop new data processing architectures — ones that go beyond technologies like memcached, MapReduce, NoSQL, etc.

Think of this as a search problem. Traditionally, there was an index of every document in which every word occurred. When a query was received the search engine could just look up the precomputed answer to which documents had which word. For a personalized search, an exponentially larger index is needed that includes not only factual data (words in a document, brand of cameras, etc.) but also taste and preference data (people who like this camera tend to live in cities, be under 40, love “Napoleon Dynamite,” etc.).

Unfortunately, personalizing along 100 taste dimensions leads to nearly as many permutations of recommendation rankings as there are atoms in the universe! Obviously there isn’t enough space to precompute what recommendations to show every possible type of person that queries a site. Additionally, precomputing the answer to queries is too slow. People expect real-time results, not hours- or days-old precomputed answers. If I tell Amazon I don’t like a book, I want to immediately see that reflected in my recommendations.

We’re at a turning point in how we need to build web sites to handle these sorts of personalization problems. While first-generation distributed systems split the application into three tiers — web servers, application servers and databases — second-generation systems build large non-real-time back-end clusters to analyze huge amounts of sales data, index billions of web documents etc.

A third generation of systems is now emerging, with the computation shifting from those back-end clusters into front-end real-time clusters. After all, you just can’t build a back end that precomputes personalized results for millions of Internet users. You have to compute it in real time.

Adding complexity, many personalization problems are more difficult to parallelize than a lot of traditional back-end applications. Indexing the words in web pages is actually a lot easier to parallelize than are the long sequence of matrix calculations required to optimize a user’s recommendations.

Matrix calculations tend to involve complicated data access patterns that mean it’s hard to partition calculations and their data across a cluster of computers. Instead there tends to be a lot of sharing among many different computers, each of which holds a piece of the problem and updates the others as data changes. This back-and-forth data sharing is both incredibly hard to keep track of for the programmer, and can significantly degrade application performance.

The systems we’ve built at Hunch to solve this started off using distributed caching with memcached but very quickly veered into something more akin to distributed shared memory (DSM) systems, complete with multiple levels of caching, coherency protocols with application-specific consistency guarantees and data replication for performance. With an abundance of processing cores at our disposal, the real challenges tended to revolve around getting the right data to the right core.

I think that in a few years we’ll look back at this time as an era in which a slew of new large-scale programming challenges and their solutions were born. Hopefully we’ll also see more open-source solutions along the lines of memcached and Hadoop, so that building personalized and real-time web applications is easy for everyone.

Tom Pinckney is the co-founder & VP of engineering of Hunch.com.

Related GigaOM Pro content:

  1. Tom, great post. The problem is spot on. The solution makes one key assumption — that you can’t approach the problem with pre-computation.

    That’s true for a service like Hunch which sees a tiny amount of data about me, especially when I first show up. Given how little you know about me, you have two choices: either you pre-compute info in an impossibly large space, which is impractical, just as you describe, or you do the type of real-time processing which is much more effective. So far so good.

    But that’s not necessarily the best way to approach the problem from the standpoint of someone who had a lot of data about me, e.g., Google or Facebook or even Amazon. The set of Internet-connected humans is small from a computational standpoint and the meta-data trail we leave is growing at a much slower rate than compute/storage. Pre-computing starting with 100 dimensions doesn’t work. Pre-computing starting with a few billion humans works really well, if you have a lot of data on the humans.

    This is one of the fundamental advantages Amazon, FB, GOOG and others have compared to point services such as Hunch. It’s not a fair fight. So you have to innovate like crazy to compensate. Rock on!

    Share
    1. More content on the topic on my blog.

      Share
    2. Hi Simeon, you raise a couple of interesting questions.

      By real-time what I’m talking about is having new information immediately taken into account. When I say I don’t like a book, I want to see that immediately reflected. I didn’t get into this in the article, but there’s obviously a continuum between pre-computed and real-time. For example, it may be possible to pre-compute fragments of information that can then be efficiently combined in real-time to offer personalized content.

      As to services like Hunch vs Amazon recommendations, I absolutely agree that whomever has the best information wins. That’s why I like the fact that Hunch can ask any question to get to know you, while Amazon is only looking at correlations between products. Presumably we can make better recommendations over time if we know things like whether you’re looking for light summer reading or not, what kind of dog you have (http://hunch.com/media/reports/dogs/images/FrenchBulldogBook.png) etc.

      Share
  2. [...] ownership, Hunch, personal Web, personalization, startups trackback Tom Pinckney has written a great post on GigaOm about the new kinds of processing companies have to do to create a highly personalized [...]

    Share
  3. Hey Tom, love the space you’re thinking in. But I believe the reverse is the right solution. We need smaller more intimate virtual assistants that communicate with each other, other users, and potent interactive search clients (like the uber search you are thinking of, but not that complex).

    I’m founding a company based on the idea. I’m starting out simple with a web service that listens to users social data, compresses it to semantic entities, and finds matching information, people, and business ads. Checkout victusmedia.com if you’re curious.

    Share
  4. I think you missed one important point. Learning.
    System today don’t learn, as in self organizing data. The best they do is memorizing something, very different and only a small part of learning. Hence we have to program our self to death, with no way win that game.

    Context is organized data on which information is processed. Language is not static. At any one moment you read this your brain augments my text and in parallel processes a lot of other data, from emotions , smell, hearing to ….

    Second speech, reading is not processed sequential. In a very simple way, one could say seq.-input, parallel augmentation, seq-output. In all of this we convert input to abstractions or symbols to be able to keep up. Ever wondered why you can read a book, stop for 6 or whatever month, pick it up again and recall the storyline but have forgotten most of the details?

    Point is we need systems to build and learn their own abstractions to organize the data we throw at it, otherwise we will always be ten steps behind no matter how powerful any given massive parallel machine, which btw has to be lock free (see augmentation). Or if you look at any given context on the web, there are only so many new ideas. They are just worded differently, which your brain has no problem to organize into the same thing. But throws any indexing scheme into a near endless loop. Or is not a matter of processing power anymore it’s again about smart parallel SW, and no any form of automated normalization won’t cut it.

    Just search iPad on site:gigaom.com. How much new information is really revealed? That’s the problem we have to solve.

    Share
  5. Hi Tom, great post as always!

    I wonder if there’s also room to elegantly reduce the ambition of the problems we’re trying to solve. For example, instead of solving the full bayes net people often use naive bayes which, well, is naive, but greatly simplifies the problem.

    In the matrix multiplication case, instead of / in addition to redefining the problem, there might also be elegant ways to approximate the solution. A quick search returns a paper called Fast Monte-Carlo Algorithms for Approximate Matrix Multiplication. Of course I don’t know if that particular paper’s solution is parallelizable.

    The higher level community problem is that most people don’t understand the basics of a personalization system. E.g. I don’t understand where matrix multiplication comes into play / what the specifics of your functionality are.

    Anyway, hope there’s more where this came from. Would love to learn more specifics about the particular personalization algorithms.

    Adam

    Share
    1. Thanks Adam! I definitely agree that parallel programming is (really) hard to do once you move beyond embarrassingly easy parallel problems. I think the only way you solve that is by figuring out basic patterns that people can-reuse without being parallel computing specialists themselves.

      MapReduce and Hadoop are two great examples of this for back-end batch processing. What I’d love is if we can figure out what additional tools are needed for the types of problems I’m talking about here and then ideally make them open source.

      Share
  6. People search for things for different reasons. People create context by what they do, watch, buy, read, share, attend and listen to. No wonder all search engines are in the media business. The “doing” comes before or after the searching. Coherency comes from studying what is ultimately done with the information delivered.

    Share
  7. You should probably check out DirectEdge, who are a YC start-up working on exactly this problem – and they “get” it in terms of the importance of it in the greater scheme of things.

    You can find the main founder on Twitter as @scotchi.

    Share
  8. Tom, very interesting post.

    Its interesting that you’re using the Distributed Shared Memory approach, pulling the data to where the computation is being done. Did you also investigate mechanisms to package up the current state of the processing and send it over to where the data is? That way the data being worked on at the moment is always local.

    I suppose some of the tradeoff involves the size of the current state of processing, and also how “clumpy” the data is. If the size of the current state is enormous, sending it around to different nodes would not be practical. Likewise sending the processing around would work better if the data is relatively clumpy, allowing significant amounts of work to be done before having to move the processing to yet another node.

    Share
    1. This is a great question.

      In our particular case, as our system gets bigger and bigger, there is no one node that has all the data the computation needs. For a smaller problem or one where the data a computation needs could be more localized, this would be a great solution.

      Instead, each computation we do tends to need a different arbitrary subset of the data. So we can’t find a single way to partition the data such that node 1 has all the data that some computations will need while node 2 has all the data some other computation will need.

      Ideally I’d like a toolkit where shipping computations to the data was one option and pulling the data to the computation was another option.

      Share
  9. I believe the solution is personalized, distributed search.

    A couple of years ago, I suggested such a system. The idea is to have your own “search engine” (it could be as simple as a wordpress installation) propagate the query to your network (back then, this used to be one’s blogroll, but today it could be your social network defined in terms of twitter or facebook), and collect the answers.

    I think it’s worth reconsidering the idea:
    http://vrypan.net/log/2004/10/25/how-about-a-distributed-query-system/

    Share
    1. I’m with you on the “personalized distributed search approach”. We have created some tools to do just that although our data source has been twitter instead of blogs so far.

      Share
  10. Tom,

    We’re shipping a solution that works with all web servers, all existing backend infrastructure and allows you to personalize the web in real time. You can download the software and see how it works. It’s very simple. It consists of a mobile app that aggregates your Who, What and Where meta data – it then adds that data to the HTTP request headers (as an X header) when you navigate to ANY web server. All you have to do is read the headers and then you can personalize the response. We also built in the ability to use HTML to dynamically change the browser menus to add your services to the Menu.

    Result is – personalized, real time web that works with all your current infrastructure and without the requirement to learn anything new.

    Cheers,

    Peter
    5o9 Inc

    Share

Comments have been disabled for this post