Ever wonder how a site like Mozilla.org could display a constantly updating download counter when it released Firefox 4 in early 2011, showing exactly how many copies of the popular browser had been retrieved by users? SQLstream CEO Damian Black explained at GigaOM’s Structure conference in San Francisco Thursday that the secret is dataflow — the parallel processing of data.
Dataflow was invented in the ’70s, when computer scientists first made advances in parallel computing. “Parallel computing in the ’70s was ahead of its time,” said Black, but the technology has made huge advances since, and is now ready for real-world applications. “Dataflow has finally come of age,” said Black.
But what is dataflow exactly, and how does it differ from other types of data processing, including Hadoop Map Reduce? The latter breaks down streams of data into large chunks and then processes one chunk after another. “That leads to quite a bit of latency,” Black explained. The relational streaming approach favored by Black’s company SQLstream instead processes streams immediately. “As soon as data comes it, it will flow out with minimal latency,” he said.
However, Black reminded his audience at Structure that this isn’t really a choice of one of two approaches. Mozilla’s Glow counter used Hadoop for historical, batch-based processing for long-term data analysis and at the same time relies on real-time dataflow processing to deliver that neat, constantly updating counter.