12 Comments

Summary:

Skype, the Internet telephony service went on the blink last week, stranding millions who rely on it to communicate. In conversation, CEO Tony Bates revealed that the problem might be some errant Windows clients, but that only hinted at the true cause, which I explain here.

skype_wallpaper_by_msttmz

Skype, the Internet telephony service went on the blink last week, stranding millions who use it for their communication needs. It took more than a day for the service to be restored. In conversation, CEO Tony Bates told me that the problem might lie with some errant Windows Clients. Well, make that many errant Windows clients! Today Skype’s Chief Information Officer, Lars Rabbe offers more details in a blog post.

In a nutshell, Skype says it was bug in a Windows Client software which lead to overloading of certain super nodes, which crashed and thus caused a chain reaction of problems.

On Wednesday, December 22, a cluster of support servers responsible for offline instant messaging became overloaded. As a result of this overload, some Skype clients received delayed responses from the overloaded servers. Because of a bug identified in a version of the Skype for Windows client (version 5.0.0152), the delayed responses from the overloaded servers were not properly processed, causing Windows clients running the affected version to crash.

Around 50 percent of all Skype users globally were running the 5.0.0.152 version of Skype for Windows, and the crashes caused approximately 40 percent of those clients to fail. These clients included 25–30 percent of the publicly available supernodes, also failed as a result of this problem.

I wonder if some of these problems were brought on by recently introduced aggressive “forced updates” which have not gone down well with some users. Voxeo CEO Jonathan Taylor offered up the theory that buggy software that was pushed on to Windows users was to blame.

If you had the latest Skype for Windows (version 5.0.0.156), older versions of Skype Windows (4.0 versions), Skype for Mac, Skype for iPhone, Skype on your TV, and Skype Connect or Skype Manager for enterprises, you were not initially affected by this problem. However, with nearly a quarter of Skype’s super nodes going down, it quickly became a network-wide problem.

A supernode is important to the P2P network because it takes on additional responsibilities compared to regular nodes, acting like a directory, supporting other Skype clients and establishing connections between them by creating local clusters of several hundred peer nodes per each supernode.

Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25–30 percent fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes. A significant proportion of users were also restarting crashed Windows clients at this time. This massively increased the load as they reconnected to the peer-to-peer cloud.

In order to deal with the problem, Skype essentially introduced “thousands of instances” of the Skype software into its P2P network and created temporary supernodes. The biggest lessons learned from this, Rabbe writes:

  1. More investments in their infrastructure so that the system becomes and stays reliable.
  2. More rigorous testing procedures that don’t let buggy software out into the market.This is not the first time Skype systems came under pressure because of faulty bugs. In August 2007, Skype had software problems as well, which in turn caused a flood of log-in requests and crashed the network.

Related content from GigaOM Pro (sub req’d):

 

  1. One more reason NOT to use Windows.

    Share
  2. oh come on, you can’t blame windows for this one. It was a skype problem with some badly written skype code for the windows problem.
    So put down your Apple / Nix soap box and shut up.

    Share
    1. Hmmm…. how am I on my Apple soapbox. It seems you want to read what you want to read. The article clearly states that it is a problem with Skype.

      Share
      1. Do you really not understand he was responding to the post above?

        Share
      2. And they both are rude.

        aep528: muppets should have clicked the Reply button – IF that was the comment he meant to shut up (rude).

        fwiw: Apple was not referenced in my comment, was it?

        Share
  3. No, it’s a Skype problem. They rely on their end users to provide the computing power necessary (instead of running centralized servers), and bill you for the privilege when you want to use premium services. A brilliant business model.

    Share
    1. +1 to that. And there is a reason why they don’t to draw attention to the forced updates that seem to have pushed out buggy software so quickly.

      Share
      1. True but their architecture is arguably more resilient and redundant than any major SP (though maybe their software dev/QA/update processes need some tuning).

        Question: do we know if the temporary supernodes were regular, user-owned computers, just promoted, or if they were in fact Skype owned/rented nodes?

        Share
  4. [...] propel mobile video conferencing into the mainstream. The communications platform, still fresh off its 24-hour outage, is expected to unveil a mobile video offering next week at CES and has been recently teasing what [...]

    Share
  5. Buggy Windows client may be the instigator and forced upgrade could have exacerbated the problem. But lack of some operational procedures are more glaring:
    1. not ensuring that the population of supernodes are diverse (not same OS, not same version of app)
    2. protecting overloaded supernodes from new additions to the network
    3. not preventing new nodes from being added which would increase signalling traffic between the supernodes

    Share
  6. [...] propel mobile video conferencing into the mainstream. The communications platform, still fresh off its 24-hour outage, is expected to unveil a mobile video offering next week at CES and has been recently teasing what [...]

    Share
  7. [...] which was hit by more than 24 hours of downtime just a few weeks ago. That outage was the result of buggy Windows clients overloading certain supernodes that are part of the Skype network. In essence, after the holidays and the Consumer Electronics [...]

    Share

Comments have been disabled for this post