It’s game on in the AI-on-a-chip race. Alongside Nvidia’s successes turning Graphics Processing Units into massively performant compute devices (culminating in last year’s release of the ‘Volta’ V100 GPU), we have ARM releasing its ‘Project Trillium’ machine learning processor on Valentine’s Day and Intel making noises around bringing the fruits of its Nervana acquisition to market, currently at sample stage. Microsoft with Catapult, Google with its TPU — if you haven’t got some silicon AI going on at the moment, you are missing out. So, what’s going on?
We have certainly come a long way since John von Neumann first introduced the register-based computer processing architecture. That, simple design (here’s a nice picture from Princeton), which fires data and instructions into an Arithmetic Logic Unit (ALU), still sits at the heart of what we call CPU cores today, though with their pre-fetching and other layers of smartness, these are to von Neumann’s original design a souped-up, drag racing version of the original design.
Three things about this, bling-covered workhorse of a design. First, that it was planned to work really well for a certain type of maths — the clue’s in the ‘A’ of the ALU — which is great for arithmetic, but less great for other kinds of maths, such as that requiring floating point operations.
This brings to the second, based on a principle named after Alonso Church and Alan Turing: that anything you can compute with one computer of any type, you can program with another. Don’t ask me to explain the proof, but the result is that you can do any kind of processing on an ALU, it just might take a bit longer than a processor conceived for the purpose.
There’s a third point: it’s all maths, really. We talk about computing in terms of data, insight, algorithms, programming, processing or whatever are the latest buzzwords of the day, but behind it all, deep in the bowels of any piece of silicon are chunk of electronics that allow us to do something mathematical. Like add, or subtract, or multiply. It’s a corollary of the Church-Turing principle that anything you want to program, you can convert into a series of mathematical formulae to be calculated.
Why (oh why) is this relevant? Because the name of the game is efficiency. Just because a CPU can process anything we throw at it, it won’t always be the best tool for the job. You can go to school on a tractor but it will be neither fast nor comfortable. We have CPUs as the de facto mechanism because the cost of fabrication has, traditionally been so great that we have gone for a one-size-fits-all solution. The downside, however. many bells and whistles we have attached to them, is that CPUs will have an overhead when they try to do things they weren’t designed for.
Traditionally, the answer has been to create different processors. In 1987 for example, Intel released the 80387 as a floating-point processing companion to its rather popular 386 CPU: while this was later incorporated on the same piece of silicon (and exists within today’s cores), it’s still a separate processing capability. It’s also fair to say that the Graphical Processing Unit, designed to display information on a screen and therefore geared up around the mathematics of symbol processing, was never conceived for use beyond graphics — but the fact was, and is, that it can do certain maths much more efficiently than a CPU. Ergo, Nvidia’s dramatic recent success.
It’s only more recently that anyone beyond a core (sorry) of organisations has had the ability to create silicon. The costs are astonishingly high, largely because the margins for error are astonishingly small: this meant that while engineers may have desired to create specialized hardware back in the 1980s, it was not possible to beat general-purpose machines at a workable cost (”most of the training times are actually slower or moderately faster than on a serial workstation,” says this architectural survey). But over the decades (and no doubt thanks to computers), high-tech manufacturing has become more affordable: the same logic that has given us personalised Coca-Cola and the orange Kit Kat also acts in the favour of those thinking that it would be nice to make their own computer chips.
It’s actually quite fascinating (to be fair, I once worked in chip design, but I’m sure it has a broader appeal) to peruse the paper released about Google’s AI-oriented Tensor Processing Unit (TPU). Here’s a key paragraph, which says, in other words, that the elements of the processor were ’simply’ geared up around the specific needs of neural networks (NNs):
“The TPU succeeded because of the large—but not too large—matrix multiply unit; the substantial software- controlled on-chip memory; the ability to run whole inference models to reduce dependence on host CPU; a single-threaded, deterministic execution model that proved to be a good match to 99th-percentile response time limits; enough flexibility to match the NNs of 2017 as well as of 2013; the omission of general-purpose features that enabled a small and low power die despite the larger datapath and memory; the use of 8-bit integers by the quantized applications; and that applications were written using TensorFlow, which made it easy to port them to the TPU at high-performance rather than them having to be rewritten to run well on the very different TPU hardware.”
The result is that the chip can scale to workloads far beyond other types of hardware. In much the same way as the 80387, it’s designed to do one thing well, and it does it very well indeed. Back to the Church-Turing principle, general-purpose processors could do so as well, but will inevitably slower — so, if you can afford to make your own, more specific chip, why wouldn’t you? Indeed, the paper suggests that the best is still to come: “We expect that many will build successors that will raise the bar even higher.”
Ultimately, it’s easier to think of computer chips as combinations of task-specific modules, each designed around doing a certain kind of maths. We have now arrived at a point where the door is opening for those who want to design their own modules, or architect them into processors aimed at a specific purpose. So this isn’t about speed but architecture. In much the same way as home-grown app transformed the data management industry, so we can do the same with chips.
We shouldn’t be surprised that so many players are getting in on the game, nor that significant performance hikes are being seen; more important is the likely impact of diversification, as the bar falls still further on chip design. In much the same way as apps transformed the mobile device industry, we might be about to see the same with computer chips, particularly for so-called ‘edge’ devices appearing in homes, offices and factories. This industry may be decades old but the real game may only be just starting, leading to advances we have not yet even considered.