2 Comments

Summary:

CloudFlare’s web security service suffered an hour-long outage after the company tried to respond to a DDoS attack on one of its customers.

cloudflare

CloudFlare’s web security service went down for about an hour starting at 2:47 PDT Sunday morning, taking its customers down with it. The service was back up at 3:49 PDT, according to a post-mortem. CloudFlare attributed the outage to a system-wide failure of its Juniper edge routers that started after the company tried to prevent a DDoS attack on one of its customers.

Affected sites include Wikileaks, 4chan and others according to this Techcrunch report.

One reason CloudFlare opts for Juniper is the latter’s support for the Flowspec protocol which enables customers to propagate router rules across a large number of routers fast, according to the company post. This comes in handy because CloudFlare is always updating rules to combat ever-changing attacks and to re-route traffic as needed to optimize performance.

This morning CloudFlare detected a DDoS attack on one of its customers and its attack profiler ascertained the offending packets were  between 99,971 and 99,985 bytes.

That attack profile was sent out to Flowspec to stop the spread of attacks. From the post mortem:

“Flowspec accepted the rule and relayed it to our edge network. What should have happened is that no packet should have matched that rule because no packet was actually that large. What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed.”

Service was restored after about an hour, although CloudFlare said it continues to examine the issue and has contacted Juniper to see if there is a known bug involved or the problem is unique to CloudFlare’s implementation.

Update: On Monday, Juniper said via email that it is looking into the reported network outage.  “While we have not completed our investigation, we believe this incident was triggered by a product issue that Juniper identified last October, when a patch was also made available. Our customer support team is actively supporting Cloudflare in its efforts to resolve the issue and we are not aware of any other customers experiencing similar issues.”

Cedexis' Radar view of CloudFlare outage.

Cedexis’ Radar view of CloudFlare outage.

Given that the number of DDoS attacks is on the rise, web sites had better gird themselves and hope their security vendors are taking proactive steps to keep ahead of the problem.

This story was updated at 12:25 p.m. PDT with Juniper’s comment.

  1. “Given that the number of DDoS attacks is on the rise, web sites had better gird themselves and hope their security vendors are taking proactive steps to keep ahead of the problem.”

    If anybody was using our proactive nameservers ( https://web.easydns.com/proactive.nameservers/ ) then they wouldn’t have felt a thing as our system would swap in their hot spares while the outage lasted and then put their regular nameservers back once cloudflare was back up again.

    Given that any DNS provider is vulnerable to these types of episodes, using multiple providers with hot backups and some active monitoring and on-the-fly switching, it makes sense.

    Share
  2. Blame a DDoS attack??? Give me a break! This is another example of network engineering MIA. Various tests are part of the design process prior to a release being pushed to the production environment. I wouldn’t be pointing finger’s at the Juniper SE. No, whoever is responsible for network engineering at CloudFlare needs to have their a!! handed to them. I’d put them in “Herding Cats” in the below referenced article:

    http://www.networkperformanceinnovations.com/blog/where-are-you-along-the-network-service-design-continuum/

    Share

Comments have been disabled for this post