Following a high-profile no-show at the AI Hardware Summit last month, AI accelerator startup Groq had some explaining to do. What caused them to pull out at such a late stage?

“We had a customer priority, and we're very customer-focused here,” said Jonathan Ross, co-founder and CEO of Groq.

Brushing off EETimes’ suggestion that sending someone to present the company pitch deck might have been less of a PR disaster, Ross was adamant that they made the right decision.

Jonathan Ross

Jonathan Ross (Image: Groq)

“We have a saying: show, don't tell,” he said. “For the AI Hardware Summit we were working on a demo but we had to divert our resources to our customer, so we weren't going to be able to demo. We had the option of going up and talking about something that we weren't going to be able to demo in that time period, or withdraw. And so we decided to withdraw.”

“It worked out,” he insisted. “The customer was very happy.”

Ross was previously on the team that developed Google’s tensor processing unit (TPU), and many of its senior executives also have long histories at Google.

The secretive AI accelerator startup employs 70 people and has raised $67 million in funding to date, having recently closed a second round. As the company begins to emerge from stealth mode, EETimes spoke to its senior leadership team to find out more about the company’s offering.

Software defined hardware
Groq’s unusual software-first approach started with building a prototype compiler, rather than with prototype hardware. The hardware architecture was then built around that compiler. The resulting TSP has a simplified hardware design, with all the execution planning happening in software. Software essentially orchestrates all the dataflow and timings required to make sure calculations happen without stalls, making latency and performance predictable.

“We put an amount of control into the compiler’s hands, which allowed us to make some trade-offs at the hardware-software interface... to provide deterministic execution,” explained Dennis Abts, Groq’s Lead Architect.

Abts, a 12-year Google data centre veteran who also spent more than a decade as a hardware architect at Cray, explained that the compiler has control over both the execution and the power profile, so both the precise, repeatable execution time and the power consumption for running each model can be accurately predicted at compile time.

“We think this gives us a leg up on ease-of-use,” said Abts. Knowing the execution time and power envelope at compile time means "you can do rapid experimentation from the model development standpoint and then deploy it in a way that you know confidently what performance you're going to get.”

The compiler has complete control over the chip, both statically and dynamically.

“There’s no such thing as dynamically profiling your code because the static is exactly the same as the dynamic, and that has some very nice features to it,” he said.

Software defined hardware diagram

Groq’s software-defined hardware approach provides deterministic operation and predictable latencies (Image: Groq)

First of these features is the synchronisation step required by most architectures, in between computation and communicating the results, has been eliminated. Overhead-free synchronisation means models can be deployed at scale without incurring tail latency, which according to Abts is a major problem in data centres. Groq’s chip allows all latencies to be known up front, at compile time.

“We’ve also avoided a lot of complex hardware that would go into the front end — speculative execution, branch prediction — a lot of complicated control structures can simply be factored out,” he said. “We had multiple motivations for doing this, not least because aggressive speculation techniques can be weaponised, such as by [hardware security vulnerabilities] Spectre and Meltdown.”

It’s Not an FPGA
While the concept of software-defined hardware combined with deterministic operation might put one in mind of an FPGA, Ross stressed that the TSP is definitely not an FPGA. EETimes also wondered whether there may be some crossover between Groq’s approach and SambaNova’s “software defined hardware” concept, of which full details are still to emerge (SambaNova are still in stealth mode) but have said they are developing a reconfigurable dataflow architecture and working on a language for programming accelerators).

“This is totally new,” Ross said. “It’s programmable in a way that... imagine an FPGA that can be reconfigured every cycle — that would be similar to how our chip works. But it’s not an FPGA, there are no lookup tables... you can completely change what the chip is doing on a cycle by cycle basis. You know exactly what every part of the chip is doing at any one moment, you have that level of control, and it’s very fine-grained, but it’s not an FPGA. It’s not like what other people are doing.”

“[Groq’s] approach does look very similar to regular FPGAs and to the SambaNova approach,” said Tirias Research Principal Analyst, Kevin Krewell.

Based on what Groq has shared so far, Krewell expressed a couple of concerns.

“Because the design appears to be very fine-grained, I'm still concerned about the efficiency compute per square millimetre,” he said. “There are a number of challenges — the design is statically compiled, which means you are only processing one type of machine learning algorithm at a time. Some tasks require different machine learning models depending on the workload, such as recommendations, image processing, speech processing, etc. Groq doesn't say how long it takes to reconfigure the chip for a different algorithm.”

First Silicon
Groq’s TSP combines a large number of arithmetic logic units (ALUs) with a large amount of on-chip memory with adequate bandwidth to the feed the large number of ALUs (>60TB/s).

Slides seen by EETimes (that the company later declined to share) showed a die photograph with three columns of ALUs interleaved with two large strips of memory (ALUs comprised approximately 40% of the chip area, memory closer to 50%). Figures on the Groq website reveal the TSP is capable of 400 TOPS, but the company would not specify the conditions under which this figure could be achieved, saying only that it was peak performance for INT8 calculations. Incidentally, the TSP can handle both integer and floating point arithmetic, but the company is firmly focused on inference for the time being.

Groq Team

Groq employs 70 people full-time and the company has raised $67 million in funding, to date (Image: Groq)

“We have silicon back, and we had power-on on the first day,” said Groq’s VP Engineering, Michelle Tomasko. “We had programs running [on the chip] in the first week and we were sampling to customers six weeks later... now we’re achieving the Holy Grail, which is taking A0 silicon into production.”

Tomasko detailed how the TSP’s determinism will benefit customers’ system validation times, adding that the ability to deliver the compiler well before the silicon means customers can be get ahead on targeting their models to the TSP’s architecture.

“By the time they get their hardware, they can already have the content ready to go,” she said. “The determinism allowed us to run [our own] verification tests pre-silicon... in a traditional architecture, there is a lot of complexity, a lot of different control systems, so there are race conditions, boundary conditions, things that you need to shake out. When we did the bring up, we knew that the deterministic cores were going to work, and they did.”

Tomasko spent 3 years at Google before joining Groq, and prior to that worked at Nvidia.

“Nvidia have so much man (and woman) power and they can brute force it, and overtake architectures pretty easily once they have the target they can go after,” she said. “But the fact that we can execute so quickly and nimbly with this kind of architecture is the key that will allow us to stay ahead of a beast like Nvidia.”

Groq is targeting inference applications in data centres and autonomous vehicles. Chief Operating Officer Adrian Mendes said that hyperscalers were attracted to the TSP’s lack of tail latency which helped with scaling out across large data centres, while the ability to work on code up front was interesting to enterprise data centres and tier 1 OEMs. Low latencies were also an advantage in the financial sector for high frequency trading applications.

Latencies “in the microseconds” combined with total determinism suits the TSP to safety critical applications such as autonomous vehicles, said Mendes. 

“We have shipped hardware to a handful of customers already, starting in August,” Mendes said. “Our hardware is in customer data centres right now... they are running programs on those boards and are getting very good results from it.”

Groq’s TSP on is sampling now a PCIe board.