7 Comments

Summary:

Last week, Orange France’s mobile network tanked, knocking out the mobile phones of millions of subscribers. This week the same thing happened to O2 in the U.K. U.S. carriers like Verizon and T-Mobile aren’t immune either. Global networks have developed a big signaling problem.

no-phone-service

Updated. Last week, Orange France’s mobile network tanked, knocking out the mobile phones of millions of subscribers. This week the same thing happened to O2 in the U.K. The U.S. isn’t immune either. Just last week T-Mobile suffered from a smaller glitch, but the granddaddy of all network failures hit Verizon Wireless in December when its LTE network went down on three separate occasions in a single month.

Why are networks suddenly conking out all over the world? It looks like global networks are developing a signaling problem – more specifically a signaling overload problem.

Details are starting to emerge about just what caused the Orange and O2 outages. Computerworld UK and Information Age separately reported that the network element at fault in both cases was the home location register, or HLR. It’s not exactly the most commonly known piece of gear, but in brief the HLR acts as an anchor point to which we remain tethered as we move about the network. It stores our subscriber identities and knows what services we can access, but most importantly, it tracks each device’s present location so the network knows where to direct inbound and outbound traffic.

The HLR plays its dispatch role by receiving a constant stream of signals from devices updating the database on their current locations and activities.  According to Computerworld, a data glitch in an Orange HLR node generated error messages, which then multiplied as they got knocked back and forth around the network. Just because the HLR was failing, that didn’t stop devices from sending out their updates. Like a million kids screaming “look at me!” from the backseat while you’re trying to deal with the coffee you just spilled in your lap, smartphones kept pinging the suffering HLR creating a huge bottleneck. The end result: the whole system fails, leaving millions of handsets without their lifelines to the network core.

If the Orange and O2 failures sound familiar, it’s because the exact same thing that happened to Verizon in December. Since Verizon’s network is an LTE system, not an HSPA one, its core architecture is a bit different, but the basic problem seems to be the same. A software bug generated error messages that backed up its core elements, causing them to be oversaturated by signals and ultimately forcing the whole core to crash.

A whole lot of bandwidth but nowhere to go

In all three situations, the radio networks weren’t the problem. The networks still had plenty of capacity, and all devices were capable of connecting to their towers to send and receive data. But with a broken core, the networks had no idea where and whom to send that data to. Imagine playing Where’s Waldo? with 10 million people in a single storybook frame.

In Verizon’s case you could chalk it up to the relative newness of both the network and the LTE standard, but in the case of Orange and O2, their UMTS networks have been up and running for nearly a decade. For their HLRs to now start developing random terminal bugs seems rather odd. The problem doesn’t appear to be inherent in the equipment itself but in the sheer volume of signaling traffic traversing mobile networks driven by the smartphone boom.

That constant network chatter from smartphones and their applications are overwhelming network cores. On normal days they can handle that traffic, but even a small glitch throws everything out of whack. Smartphone use is only increasing, so this problem is only going to get worse.

What’s to be done?

If you talk to the signaling system vendors such as Tekelec, Acme Packet, Traffix Systems, Intellinet and Openet, you’ll get a single resounding answer: Diameter! Diameter is signaling protocol used in LTE core networks, and those aforementioned vendors claim that more robust and flexible routers using that protocol will nip the signaling problem in the bud. Diameter’s load balancing techniques would allow the network to shift the signaling load away from elements experiencing problems — isolating failures rather than allowing them to infect everything around them.

Given O2 and Orange’s failures, those vendors are jumping at the chance to claim diameter routers are now necessary for 3G networks as well, and they’re probably right. The vast majority of smartphone traffic currently runs through 3G towers, and it’s going to remain that way for a while. But diameter is by no means a cure-all.

Verizon has experienced a record number of network failures, even though its uses the next generation signaling protocol. despite the fact it implemented Tekelec’s diameter platform last year. Tekelec certainly isn’t to blame for the outages – they The outages were caused by software bugs in other elements, yet its diameter routers weren’t able to contain the problem, either, when the network started going haywire. Update: While Tekelec in August revealed that Verizon was a customer for its Diameter signaling router, Tekelec officials told me that Verizon hadn’t actually deployed its equipment by the time of the December outages.

Whatever the eventual cure, the wireless industry had better find it quick. O2’s London outage was particularly embarrassing because of the upcoming Olympics. But other operators should be just as worried. A network that needs to be shut down and rebooted every few months isn’t much of a network at all. 

Tower Image courtesy of Flickr user Nikhil Verma; Compass photo courtesy of Shutterstock user Sashkin

  1. I was just going to ask if there isn’t some sort of anycast equivalent for cellular – sounds like diameter may be it.

    Share
  2. Telco Engineer Friday, July 13, 2012

    You clearly don’t understand what you are talking about and are just regurgitating some sales spin sold you by a bunch of telco vendors looking to push their kit.

    In these failures it generally was one of the HLR (most networks have multiple ones and users are stored on a specific node) this is why the affect was seen accross a the whole network but only a subset of customers.

    However any case that it is related to the amount of smartphone traffic & apps is just jumping on the latest trend in order to blame that.
    The HLR is involved in authentication and mobility management but not in traffic handling, an massive increase in data traffic though the packet core would have very little increase in load on the HLR the only thing that would increase HLR signalling would be users moving between VLR’s which are roughly the area of a city.

    In addition the newer HLR’s are orders of magnitude larger than their predecessors and therefore scaled to handle the kind of subscriber volumes of modern networks. The latest generation of systems from people like Nokia & Ericsson are in fact clustered systems with multiple front ends to handle the signalling from the network and multiple backed databases to hold the records. These are usually then geographically distributed across the operators sites. These are extremely resilient but then downside is that if the entire cluster fails it can take longer to restore, also they hold a lot more users in a single cluster than the traditional single platform solutions so an outage of one node rarer but affects a bigger section of your customer base.

    Share
    1. You are right ..he doesn’t know what he is talking about. The HSPA networks use HLR/VLR’s but the LTE networks use HSS systems as specified in IMS. In fact the VZ outage was caused by Diameter signalling overload in the IMS PCRF. In fact Tekelec was almost certainly partially to blame along with the vendor of the CSCF proxy.

      Share
  3. Tsahi Levent-Levi Saturday, July 14, 2012

    It is interesting to see how these networks cope better than their cloud counterparts: all major cloud providers had an outage in the past year or so – Amazon had one just last week.
    The complexity of both is rather similar, but we tend to forget it and think at the carrier networks as things that must never fail – a bit more than we do for cloud providers.
    This will probably change as we start relying on cloud providers more with each passing day.

    Share
  4. So here’s the thing. Networks run and then they fall over, which suggests that something went wrong whether that is a software upgrade to a critical node, a new technology being introduced that isn’t quite as well understood at scale as it might be, or a threshold of load gets surpassed. The thing about networks is that they are pretty good at resisting the minor issues – network nodes get deployed with redundancy built in to componentry and when a node is critical, with geo-redundancy and warm stand by. So when something breaks, the network generally adapts and no one notices unless you happen to be sat in the NOC.

    That equally means, when something goes wrong that people do notice, it is usually something huge. Kevin might have got it wrong by blaming Diameter for everything, but he is right on the money for Verizon. Diameter is a protocol for AAA, but it is being used in LTE for something slightly skewed from that, and Verizon’s LTE network is the biggest, shiniest LTE network there is right now. If anyone was going to find the bugs that come with stress testing Diameter as a protocol and the scalability of Diameter interfaces, it was them. The were bitten by it, they have deployed a solution. No one thinks Verizon’s LTE subscriber numbers have flatlined, so we can only assume that whatever they have done to rectify the issue has solved the problem.

    For other outages, there is no Diameter in play. That means something else has gone wrong. Because the issues have been non-geographic, it suggests they are not related to access networks and so they are more likely core nodes that have fallen over. A duff HLR upgrade has been reported as being the culprit in Orange, which would explain why it impacted pretty much everyone. O2’s outage affected only some customers so it could be something more involving core network signalling interfaces. The problem being that when a core node dies in a big and spectacular way, there can be a domino effect where other nodes try to take up the traffic load, but get swamped by signalling which then causes those nodes to either back traffic off or close up shop themselves.

    However, I do think Kevin’s point on traffic load is a valid one. Networks are still engineered in many cases on the basis of some old world thinking – phones make phone calls, send and receive text messages and attach to the internet when the customer wants them to. This isn’t true anymore and whilst all of these use models still exist, network signalling has massively increased because smartphones have ‘a mind of their own’, attach and detach from networks at the behest of applications and do this often multiple times for each individual application the device has running. It is not the VLR that is creating the signalling load but the SGSN and GGSN, both between themselves and towards the HLR. There has been work done to try and offset some of this signalling, but that doesn’t change the fact that more device are being used to attach to networks to do more things more often.

    Whether it is Diameter, MAP, GTP-C or, in the futue in all likelihood, SIP, there is going to be more signalling traffic and networks need to adjust their engineering principles to account for that. Many have – the famous outages of AT&T in Manhattan and O2 in London of a few years back have not been repeated for a little while – but throw in LTE, plus the potential step-change in connections that M2M is suggested to create, and it will be signalling as well as data that needs to be considered in terms of designing a network that scales for the purpose of supporting all traffic from all sources.

    Share
    1. I completely agree with those who say some more background is required to write such an article..otherwise nothing but random pieces of information are put together…being of little use,since technically the whole thing doesn’t make any sense!!
      What you’re saying here is that there was an increment of signalling messages that lead to a network crash..mm…sounds familiar to me for in 2001 vodafone spain network crashed for a similar reason…and tadaaa!! There were no smartphones on scene.
      Whenever a new architectural design takes place for core nodes the whole thing is at stake. HLR architecture has remained the same for decades.Eventhough they’ve always been cared of, they were old friends with network designers and engineering.Now, when the different subscriber’s profiles (LTE,IMS, GSM,FNR…)are doomed to become together in a single database ( the so called next generation HLR, splitting FE from DDBB) the unstable behaviours of new designs long ago forgotten for HLR, have arisen.
      Now, resilience has to be strenghtened in all vendors’ Next generation HLR. Meanwhile, let’s hope no other similar case arises.
      If it were only a matter of increasing capacity for signalling, i must say all operators are already skilled and used to it.
      Future evolution to ip is a different topic to discuss on, but nothing to do with the latter.

      Share
  5. Kevin,
    It would be interesting to see which of the carriers that have deployed Diameter core networks have tested their network architecture in the lab and which have not.

    Share

Comments have been disabled for this post