When community-friendly web services get big enough, they all hit the same problem: spammy comments and other kinds of suspicious behavior. The bigger you are, the more of a target you become – and the more difficult it is to manage abusive visitors manually.
SoundCloud, with its 200 million monthly visitors, is no exception, so it built itself a tool for combating such problems. The software is called Sketchy, and the Berlin-based audio platform company has just open-sourced it for the benefit of other up-and-coming web outfits.
“Every email service has its own spam-fighting system, and every web company that reaches the scale of SoundCloud has to solve this problem one way or another,” SoundCloud developer evangelist Erik Michaels-Ober told me. “What we’re trying to do by open-sourcing it is to make it free and customizable so that everyone can use it to solve this problem for once and for all, so companies don’t have to build their own.”
The name “Sketchy” points to the fact that this isn’t just about spammy comments, as Michaels-Ober explained: “If you do anything over a certain threshold, it turns into spam. If you’re following people at a really high threshold, that can be spam. It’s just general sketchy behavior.”
Sketchy isn’t a straight-up blocking tool. Well-meaning users can sometimes give the appearance of spammy behavior, but that doesn’t mean they’re trying to promote dodgy services – they may just be really enthusiastic.
So what the tool does is to detect what might be malicious behavior, then report it to SoundCloud’s human community management team. That team will usually give the user a first warning, and if the user persists in their behavior then Sketchy will “make it so those spammers are shouting into the wind,” as Michaels-Ober put it. The team also feeds back into Sketchy to help adjust the system’s thresholds.
“We can set thresholds and we also have machine learning algorithms that can detect behavior that’s out of the norm,” he said. “But again there’s a little bit of secret sauce and tuning. We obviously have users who exceed our expectations, so … we’re constantly tuning these things.”
SoundCloud is no stranger to open-sourcing its internal tools – the company actually has more than 100 projects on GitHub. One of the most popular of those tools is the Large Hadron Migrator, the name of which is inspired by CERN’s particle smasher, the Large Hadron Collider.
The Large Hadron Migrator is a tool for performing large database migrations on the fly. When you’re at SoundCloud’s scale, your database may too large to quickly add or rename columns, or add indexes, without briefly having to take the site offline. The Migrator is designed to fix that.
“It makes a copy of the database table it’s making a migration on, then migrates the copy, and then just instantaneously renames the copy to the original name, so that basically requires next to zero downtime,” Michaels-Ober said. “Then all of the data that had been inserted into the database since the copy was made gets back-filled to the original table and the copy. Once that’s done and we can verify the migration was successful, we can remove the original.”
The Migrator has around 450 stars on GitHub and just over 40 forks. As Michaels-Ober noted, it’s not the most popular project, but a number of large companies are using it. “It’s not a problem that everybody has, but it’s a problem enough people have,” he said.
SoundCloud also intends to open-source its homegrown, Heroku-style platform-as-a-service. It’s called Bazooka, but that’s about all I can tell you – the company is quite tight-lipped about it for now, even though it wants to open it up to others in the long term.