What happens if you can’t believe your own dashboard? Whether it’s for your car, your plane or your computing cloud, it’s not a good thing if the console that’s supposed to tell you what’s really going on just isn’t doing so.
That’s why the recent Heroku-Rap Genius dustup is important. To recap: About two years ago Rap Genius, which runs its Ruby-based application on Heroku’s platform as a service, started noticing performance issues. As traffic grew, it dutifully added more Heroku resources, a.k.a. “dynos” in Heroku parlance. But performance still lagged. Rap Genius dealt with lots of customer complaints although its Heroku log files and related New Relic dashboard said nothing was amiss.
Customer mandate: transparency and trust
It turns out that Heroku, the PaaS company acquired by Salesforce.com in 2010, had tinkered with the routing underpinnings of its site in such a way that jobs were not getting deployed optimally. This move from “intelligent load distribution” to “random load distribution,” plus the fact that this change was not documented — let alone publicized — to customers, was the issue.
In a February 13 Rap Genius blog post detailing the issue, the company said:
“A Rails dyno isn’t what it used to be. In mid-2010, Heroku quietly redesigned its routing system, and the change — nowhere documented, nowhere instrumented — radically degraded throughput on the platform. Dollar for dollar a dyno became worth a fraction of its former self.”
Rap Genius co-founder Tom Lehman described what happened in a recent phone interview: “We had been running 90 dynos at $20,000 a month, which we thought was sufficient based on the incorrect data we were getting, but it turned out that 90 dynos was woefully insufficient. So we upgraded to 300 dynos at $40,000 per month and performance is still bad. We can’t pay $40,000 a month for this.”
On February 16, Heroku issued a more detailed apology and outlined a plan of action including:
- Improving our documentation so that it accurately reflects how our service works across both Bamboo and Cedar stacks
- Removing incorrect and confusing metrics reported by Heroku or partner services like New Relic
- Adding metrics that let customers determine queuing impact on application response times
- Providing additional tools that developers can use to augment our latency and queuing metrics
- Working to better support concurrent-request Rails apps on Cedar
When asked for comment, Heroku referred back to its blog post.
Lehman said his company is in a tight spot. It can’t sustain payments of $40,000 per month. “Unless something changes we have to move.”
The likely destination? Amazon Web Services, a transition he would not take lightly because Heroku does much that AWS cannot. On the other hand, many of Rap Genius’ third-party providers are already on AWS. “I still have love for Heroku. Without it we couldn’t get to where we are today, but they have not been 100 percent upfront with customers.”
In his view, this should not be the end of the story. “We feel Heroku (and therefore Salesforce.com) overcharged and misled a bunch of small (and big!) start-ups and if they indeed did something wrong they should be held accountable.”
As if on cue, Kristensen Law Group started soliciting plaintiffs for a class action suit against Heroku.
The bigger picture
I’ve asked Lehman if he is party to the class action suit and will update when he responds, but lets get back to the broader issue. (Update: Lehman said he is not part of the lawsuit.)
Companies already get the heebie jeebies over the perception that moving to the cloud involves a “loss of control” over their IT. Imagine the impact if they think they can’t trust or believe in the metrics they’re given by their providers.
This is about way more than Heroku and Rap Genius. It’s about customer trust and the lack of that is a real danger to cloud adoption.
Photo courtesy of Shutterstock user 3Art
This story was updated at 8:42 a.m. PST to reflect Lehman’s position on the class action suit.