GPUs are widely used to accelerate AI computing, but are the limitations of GPU technology slowing down innovation in the development of neural networks?
Graphcore (Bristol, UK), has developed a new type of processor for AI acceleration called the intelligence processing unit (IPU). Launched with VC backing in 2016, the company raised $200 million at its last funding round in December 2018, based on a company valuation of $1.7 billion, making Graphcore the only Western semiconductor “unicorn.” Investors include Dell, Bosch, BMW, Microsoft, and Samsung.
EETimes spoke to Nigel Toon, Graphcore CEO, about the company and its vision, the market for AI accelerators, and the future of AI.
EETimes: Is Graphcore selling IPU chips today? In what form?
We have a production product, we are shipping for revenue, and we’re working with a fairly limited number of early access customers at the moment.
Our main product today is a double width, full-height 300 W PCI Express card that plugs into servers. There are connectors on top of the card that allow cards to be connected together. Each Graphcore C2 card has two of our Colossus IPU processor chips. The chip itself, the IPU processor, is the most complex processor chip that’s ever been built: it has just shy of 24 billion transistors on a single die, in 16 nm. Each chip delivers 125 teraFLOPS, and we can put eight of those cards into a standard 4U chassis and connect them together through IPU links. The processors can work together as a single processing element that delivers two petaFLOPS of compute, but in a different form to what exists in CPUs and GPUs, which provides a much more efficient processing platform for machine intelligence. These modules will go into servers for cloud computing, and potentially into autonomous vehicles as well.
EETimes: How does Graphcore address the challenges of running the deep-learning software stacks used in data centers?
Standard frameworks, like TensorFlow and PyTorch, have emerged over the last three or four years, alongside graph descriptors, like ONNX, that allow you to interchange between some of these frameworks. They allow developers to quickly design neural networks, but are fundamentally graph frameworks, that is, they describe a mathematical graph with operators and connections between the elements inside the graph.
We take the output from those high-level frameworks and feed that into a software layer that we call Poplar, our mapping and compiling tool, which takes high level graphs and maps them to a full-compute graph that runs on the IPU processor. Each IPU processor has 1200 separate specialized cores, plus all the control operations and transcendental functions you need for machine learning. Each core runs up to six program threads. So, if you had 16 processors, that would be over 100,000 separate parallel programs running in one 4U box.
It’s that level of parallelization that allows you to manipulate models quickly and also do that manipulation in real time — which allows us to make significant progress on natural language processing, for example, or work on understanding video for autonomous vehicles. So, that much more parallel nature is very important.
[With Graphcore’s IPU], the whole machine learning model fits inside the processor. The processor has hundreds of megabytes of RAM that runs at the full speed of the processor, over 1.6 GHz, where the latency is hidden by the program threads. We’re able to manipulate the model much more quickly than with memory technologies like high bandwidth memory (HBM) in GPUs. They’ll give you 900 gigabytes per second memory bandwidth, we have about 45 terabytes of memory bandwidth on a single IPU processor. So, with 16 of these in a 4U chassis, you’ve got massive amounts of memory bandwidth — all operating in parallel with thousands of program threads all operating on that — and that’s part of how we’re able to get the speed up for these kinds of machine intelligence jobs.
EETimes: How does the performance of Graphcore’s IPU compare to leading GPUs on the market?
It depends on the task. If you’re doing feed-forward convolutional neural networks used for classification of static images, GPUs do that quite well. We would be able to offer a performance advantage of two or three, sometimes five times.
With much more complex models, those that have data passing through and then feeding back to try and understand context (conversations, for example), you’re passing the data through a number of times and you need to do that very quickly. Because all of the model is held inside our processor, on applications like that, we’re much faster than a GPU, maybe ten, twenty or fifty times faster.
EETimes: Is Graphcore planning to submit results to MLPerf or any other benchmark?
We will. At the moment, we’re focused on working with early access customers, helping them solve real problems, but we will go back and do some of the benchmarks.
The challenge with benchmarks is that they are backward-looking; they are typically focused on standard convolutional neural networks, and the industry has moved on quite a lot from that. Although benchmarks are a helpful relative measure, it’s also important to see real performance on real applications.
New innovations are happening so fast, it’s hard to be sure you’re not comparing apples and oranges. If you’re working with standard frameworks, it’s pretty easy to try on different systems [for the purposes of comparison].
EETimes: Is the Graphcore IPU chip suitable for inference as well as training?
Yes, you can use the same IPU chip for inference as well as training. That was very important to us, from an architectural point of view, since as machine learning evolves, systems will be able to learn from experience.
The keys to inference performance are as follows: low latency and being able to work with small models, small batches, and trained models where you might be trying to introduce sparsity into the model. We can do all these things efficiently on the IPU. So, in that 4U chassis, where you’ve got 16 IPUs all working together to do training, we could have each of those IPUs running a separate inference task, controlled by a virtual machine running on a CPU. What you end up with is a piece of hardware that you can use for training. Then, once you’ve trained the models, deploy it, but then as the models evolve and we start to want to learn from experience, the same hardware can be used to do that.
EETimes: How will Graphcore cultivate a following of software developers to rival what NVIDIA has done with CUDA?
[Graphcore’s mapping and compiler tool] Poplar fits in at the same point as CUDA, but it’s really a programming language, not a framework, describing the graph at a lower level.
In Poplar you could describe a new type of convolutional function or a new type of recurrent neural network layer, then call that as a library element in your high-level framework. We provide a complete range of all the high-level operators and library elements. We also provide lots of low-level operators that you can easily connect together to make new library elements. Alternatively, if you’re doing something completely innovative, you can create your own using the Poplar C++ environment.
We hope that people will share some of their innovations and that others will pick them up. If you look inside the Google TPU or an NVIDIA GPU, a lot of the library elements are closed, they’re black boxes and you can’t see how they’ve been built. Ours are all open, so people can modify them and extend them. We’re hoping to build a community of people doing that.
EETimes: Graphcore is up against some big names in this space — Google, Baidu, Nvidia, and Intel as well as data center giants Facebook and Alibaba, who are rumored to be developing their own chips. How will Graphcore compete against these companies? Will there even be a data center market for AI accelerators if the data center companies build their own?
My view is that there will be three main markets.
There will be a market for fairly simple, small accelerators, typically delivered as an IP core that goes inside a mobile phone; we know that some of the big phone manufacturers are already working on that. We are not doing products for that market. There will be a market for [devices analogous to] ASICs. Take, as an example, a company with a very specific workload that has a lot of users — perhaps they run a big social network — they have an opportunity to create a very specific function and build it into a chip, and deploy that in their data center to speed that function up. These ASIC-type solutions will be a big market, but again, we’re not doing that.
What we are doing is a general-purpose processor that you can program to do a lot of different things, incredibly efficiently. If that technology is available in a cloud computing environment, it solves problems very easily, it’s versatile, easy to program, and gives you very efficient results … we think that is the technology that will win.
The fact that people are doing dedicated ASIC-type chips is almost proof that GPUs are not the answer. People need a more efficient, easy-to-use processor actually designed for machine intelligence and that’s what we’re producing. We think there is an opportunity for general purpose IPUs that’s going to be, by far, the largest market segment. We think we can drive the industry standard in that segment by having a more efficient solution designed from the ground up for these types of problems.
EETimes: It’s interesting that GPUs have become the market leader for AI acceleration despite not being originally designed for that purpose. Will they continue to dominate?
A GPU is a pretty good solution if all you are doing is basic feed-forward convolutional neural networks, but as the networks become more complex, people need a new solution — that's why they're playing with ASICs and FPGAs. All the innovators we spoke to said [using GPUs] is holding them back from new innovations. If you look at the types of models that people are working on, they are primarily working on forms of convolutional neural networks because recurrent neural networks and other kinds of structures, [such as] reinforcement learning, don't map well to GPUs. Areas of research are being held back because there isn't a good enough hardware platform, and that's why we're trying to bring [IPUs] to market.
EETimes: Will Graphcore address enterprise markets, and if so, how will you differentiate yourselves from competitors in that space?
Enterprise is interesting, [particularly] companies trying to do real deep learning in the enterprise space — we’re interested in and focused on that. The problem is, how do you reach those customers? They are all over the world, in different vertical markets. It’s a difficult market to reach for a startup company. Our strategy has been pretty cynical there — we’ve built a close relationship with Dell. Dell is an investor in our company. By partnering with them, we get fabulous access to the market and can get our technology into the hands of those customers in a number of different forms — it might be a 4U all-in IPU server, or it could be a workstation with a single IPU PCI card inside, for example. There are lots of different options for how we can target that market and we’ve got a channel to do that as well.
EETimes: Congratulations on becoming the only Western semiconductor unicorn. With such a high valuation, how will Graphcore ensure investors get their ROI?
Having a high valuation for our company is great, because it’s a great validation of the business and it allows us to raise significant amounts of capital. We now have the firepower to grow incredibly quickly, which is important because this is an emerging market. It will play out in the next two or three years, and we’ll have to run incredibly fast to be the leading player in that time period.