As with most things Twitter, fighting spam messages is not as easy as it is in other parts of our online lives. The traditional machine learning models and other techniques used to learn what spam looks like and classify messages don’t always work when you’re dealing with content that users expect to see in real time. However, Twitter has developed a system called BotMaker to address its unique situation — a system the company claims has resulted in a 40 percent reduction in spam since it was rolled out.
Engineer Raghav Jeyaraman explained BotMaker in a blog post on Thursday that like many other systems in place among web companies, Twitter included, the trick to BotMaker is breaking it down into real-time, near-real-time and batch jobs. Essentially, a tool called Scarecrow tries to stop spam messages before they’re written to Twitter, by spotting problem account names or URLs, for example. Next, a tool called Sniper is constantly scouring written messages looking for things Scarecrow missed, possibly because it didn’t have enough time to analyze certain features. Finally, batch jobs periodically analyze large amounts of offline data in order to uncover long-term behavior patterns that can help make the online models smarter.
Aside from the 40 percent overall spam reduction Twitter has seen from BotMaker, Jeyaraman notes that the ability to detect spam in the write path has been particularly beneficial.
This is not the first attempt Twitter has made to combat spam using machine learning. It’s not clear whether BotMaker uses techniques from this research, but Twitter did team up with University of California, Berkeley, researchers in 2012 to develop a system that can detect spambots based on characteristics such as email addresses or the time it takes them to fill out a registration page. One of researchers, Chris Grier, told Gigaom last year that while the resulting algorithm had been used to periodically purge Twitter’s roles of bots, it could also be turned into an online system that could spot spam accounts in real time.