Blog Post

How Network Statistics Can Make Search Better And More Relevant

[qi:013] This week I read a fascinating article by Joe Weinman that was published in Business Communications Review. In it, he proposes an innovative concept that could initiate a paradigm shift in Internet search, fixing what may be its biggest problem: too many results, many of which are of limited relevance.

He argues that the search engine portals have been ignoring a wealth of information that can be gathered from the network layer. Search engines, he says, could use deep packet inspection, sampling, and other techniques to collect network traffic statistics, analyze and parse the statistics to better understand usage and popularity of sites and the pages within those sites — then add this knowledge to their algorithms to return more relevant results.

As Weinman explains:

Network traffic statistics such as unique visitors, interval between visitor arrival at a page or site and departure from a page or site, packets transferred, subsequent clicks from a page vs. reloads of prior pages, clicks leading to other pages within a site, and similar types of measures could be an excellent indicator of average user interest in a page or site, which in turn is a proxy for relevance.

Weinman provides a great analysis of today’s Internet search technologies and builds a very credible argument, citing numerous examples of ways in which network traffic statistics can be used to aid search relevancy. One example asks: If you were trying to select from among several restaurants, would you count the number of reviews written about each one (Google’s PageRank (GOOG) algorithm simplified) or go look in each of their front windows to see which of them were empty and which of them were packed with happy diners?

Although refining search relevance is the main thrust of the article, it is clearly in the interest of large global telecommunications providers to find a way to add value for their customers beyond providing commodity bandwidth and connectivity services. One logical assumption is that telcos and ISPs would want to gather network traffic statistics and sell the results to Internet search companies for inclusion in their algorithms.

The technology involved sounds suspiciously like a domestic surveillance program that was exposed last year and as such, could raise privacy concerns. But this may not be an issue if the data was gathered anonymously and aggregated to show generic network traffic flows, not personal information. After all, you don’t need to know anything about the diners in a restaurant to evaluate its popularity.

After reading Weinman’s article, I believe that it is technically possible to make search results more relevant using network traffic statistics. Such an approach could give birth to a whole new search algorithm — driven by network traffic statistics — which could, in turn, shift the strategic balance between companies who run search engines and the network service providers who provide commodity bandwidth. But it begs the question: Are you comfortable with your service provider helping to provide this relevancy by using your network traffic data?

15 Responses to “How Network Statistics Can Make Search Better And More Relevant”

  1. By the way…

    I believe that has entered into arrangements with various ISPs/Service Providers to take advantage of just this kind of information. If I recall, the CEO of Compete has publicly stated that the data corps will sell this information for the right price.

    Of course, this appears to open up a Pandora’s Box regarding privacy issues, but I’m also pretty sure that the agreement the user signs to get access allows the ISP to do whatever they choose with the data…


  2. Allan Leinwand

    @John and Chris – Agreed that others are collecting this data (AKAM and others). The key difference here in my mind is the ubiquity of data that could be collected by a service provider. They could grab every bit off the wire for every user. That type of data collection is not an opt-in service like gmail, Google Desktop or browser cookies but would be done without your consent (or it might be put into a EULA). I’m truthfully not sure if this sparks privacy concerns – every bit you put on the Internet should be consider public domain anyway.

    The fascinating part here to me is that this algorithm could make search more relevant and would make search engines customers of service providers (and not just for bandwidth).

  3. John Gannon

    Seems like it would be a good fit not only for Google, but the CDNs of the world (Akamai, etc). They are serving a ton of traffic and have a good sense of who is doing what and from which geographies. If you could securely repackage that data in a way that didn’t compromise anonymity, you’d have a pretty compelling offering to shop to the search providers.

  4. It really is amazing the progress that has been made in search accuracy as far as getting you closer to what you really want to see. Yet, on the same hand, there are countless cases in which some topics are flooded by things you most certainly don’t wish to see.

    In any case, it’s obvious that there are improvements to be made, but it’s nice to see a little delineation of the topic.

    Also, I’m a big privacy buff myself, yet I don’t find any data collecting by Google or anyone else to be particularly bad. In using their service, I am consenting to provide my data to improve that service, and even beyond that, if giving up some personal data enables me to find what I want to find faster (or even find things I never would have otherwise found), I’m all for it.

  5. The concept does put a new way to bring the improvements in searches. But, as Niraj pointed out, the concern lies in the fact that it talks more about popularity than relevancy. So if we say that popular items has to be more relevant to the search, it may lead to a problem when searches deal with complex queries.

    One more thing that would need to be tackled is the fallacies in the data provided.

  6. I think we’re missing the obvious here. Guess who has detailed stats on visitors, time per page, click path and so on for thousands of sites already?

    You got it – Google. Via Google Analytics, they’re collecting tons of this data every day. It seems natural and obvious that they will at some point start incorporating that knowledge into search.

  7. Allan,
    Cool idea – I think US legislation prevents the carriers from using this data in the way you describe.

    However, they maybe able to give this data to a third party on a “no names basis”. This third party could then use the data for targeting. Think – “people in your zip code go to Joe’s plumbing site more than any other plumber… inferred local link love at last”. This search engine optimization combined with a yellow pages local advertising/and a “AT&T” browser = a very interesting model for the carriers –

    IF they could ever pull the threads together….

  8. Sounds like an interesting idea, and I agree that there could be ways to avoid the privacy issue. But I’d be worried about how the information gets integrated because there’s the risk that it would end up making popular sites appear higher in results (and thus make them more popular), while failing to let smaller sites rise to the top.

    Basically I think it could potentially skew the search results from relevancy to popularity if it is given too much weight.

  9. Allan Leinwand

    sam – True, you don’t absolutely need a service provider to gather this information, but if you can gather traffic statistics in the network the data would be complete and potentially more relevant.

  10. hey alan….

    why the need for the isp to give this information…

    you can get most of what you’d need for this kind of analysis by creating a browser plugin that the user would manage for his/her own searching.

    the plugin would be able to track/capture various data that the user then deems to make available as he/she sees fit. you’d need millions of users to use this kind of system to be able to have relevance, but the benefits of this kind of approach are many:

    1) it’s above board, no spammy/slimmy apps
    2) user in complete control of the data and where it goes/who uses it
    3) a solid core of users who’ve downloaded/installed the plugin.

    the downside, you need to give/provide the user community with something of value in return for being able to access the information… might be better search results, might be coupons, might be who knows…