Benchmarking systems that use different interfaces, encodings, technology, and approaches has emerged as a hot topic in neuromorphic engineering over the past few years.
Anyone building a new technology understands that success partly depends on showing value added—demonstrating that your technology is better than your competitors’. Only in this way can innovators attract investors and satisfy managers. When you are making a smaller, faster, lighter, more efficient replacement for something that already exists, this is easy—at least in principle—but it’s much more difficult when you are trying to create something genuinely new and different.
Neuromorphic computing is among the fields in which engineers are attempting something genuinely new, and the lack of easy comparisons between different systems—neuromorphic and otherwise—can be a problem.
Part of the issue has to do with the complexity of the field. Neuromorphic technology is brain-inspired, but as discussed in an earlier article,1 there are many ways to implement that inspiration at the hardware level: analog or digital, spikes or not, continuous or discrete time, virtual or direct connections between neurons.
There are also competing goals and emphases within groups. Some wish to simulate biology, some focus on energy efficiency, others want to simulate human-like intelligence, and still others simply seek practical solutions to everyday machine-learning problems.
How to benchmark systems that use different interfaces, encodings, technology, and approaches has emerged as a hot topic in neuromorphic engineering over the past few years. Going back to 2016, there have been many attempts to compare systems for different applications, or to run different algorithms, networks, or sets of algorithms/networks.
Three studies on the subject were published just this year.
It’s worth considering the larger picture to understand why benchmarking remains a hard problem.
Identifying the obstacles
Consider trying to run a learning or recognition task, then comparing how it performs on competing experimental systems. First, you have to choose a task that is at least achievable on all the systems you’re comparing — despite the fact they may not have been designed with that task in mind. You also have to consider whether all steps in the process — from loading up the task to running it and getting an output — have been fully optimized. If not — and you wouldn’t necessarily expect them to be in an emerging technology — you will have to break down your metrics so they measure only the relevant systems, not the infrastructure (temporarily) needed to support them.
Of course, to do this work at all, you have to acquire the systems you want to compare — not always easy when only a small number have been produced. You also have to hope that the hardware has been built in such a way that you can extract the information you need to complete your study, such as the speed or power consumption at different stages.
Unfortunately, this will not always be the case.
Much as it would make things easier, to date, there is no equivalent of the floating-point operations per second (flops) metric. Engineers have attempted to use multiply-and-accumulate operations, but MACs, while somewhat applicable to deep learning, do not reflect the complexity of neuromorphic engineering. Nor do synaptic operations. Why? Because there are too many ways to get the job done, too many learning rules that can be used, too many encoding methods, too many synapse and neuron functions.
Even if that weren’t the case, it’s well known that, all things being equal, a compiler must be tailored to implement the system software to be run. If not, then even the best system will perform poorly.
To get some clarity, it helps to look beyond neuromorphic technology. Robin Blume-Kohout of Sandia National Laboratories is interested in benchmarking quantum computers. In a 2020 talk2 titled “Not All Benchmarks Are Created Equal,” he discusses the difficulty of relying on benchmarks for any technology still at a very early development stage.
In a rehearsal of that argument that appears in a 2018 technical report,3 Blume-Kohout states: “Today’s most advanced quantum processors are like infants. Metrics and benchmarks that are useful for adult humans (e.g., IQ or SAT scores) are blatantly inapplicable to an infant [whose] whole purpose is to grow into an adult. Monitoring its progress requires skills and knowledge totally different from what’s needed to evaluate an adult. Children and immature technologies both progress counterintuitively and sometimes even appear to regress (losing baby teeth or entering adolescence).”
Neuromorphic engineering is at a more advanced stage than quantum computing; practical systems exist, albeit mostly on a small scale. But Blume-Kohout’s point remains valid for an adolescent technology. Just as over-testing children at school can make them proficient at passing tests but poor at independent study, using the wrong benchmarks at this formative stage can skew the development of neuromorphic engineering in the wrong direction.
Leveling the playing field
The report also points to another, much earlier, paper4 that also warned of the dangers of bad benchmarking. Grappling with the best ways to evaluate computers in the burgeoning digital computer industry of the 1980s, Jack Dongarra of Argonne National Laboratory and his co-authors write: “The value of a computer depends on the context in which it is used, and that context varies by application, by workload, and in terms of time. An evaluation that is valid for one site may not be good for another, and an evaluation that is valid at one time may not hold true just a short time later.”
Then there’s this warning from the same paper: “Although benchmarks are essential in performance evaluation, simple-minded application of them can produce misleading results. In fact, bad benchmarking can be worse than no benchmarking at all.”
In exploring why comparing “like with like” is often so hard, we’ve seen that, in practice, researchers tend to choose a benchmark metric that suits their particular technology, then treat the result as the only figure of merit that matters. Of course, in the absence of any alternative, it’s hard to criticize that approach.
There is another option, however, and it has become an increasing trend over the past few years: Enlist evaluators who are not directly involved in the technology development itself. Three papers published this year describe efforts to do just that. Although they have a lot to commend them, they also illustrate just how difficult it is to get this right.
Apples and oranges
In a paper5 issued by Oak Ridge National Laboratory, the authors selected different machine-learning tasks that neuromorphic simulators should be able to run. They then measured performance as well as how much power the tasks consumed. The chosen tasks were varied and therefore should have provided a well-rounded view of the systems. Tested were NEST, Brian, Nengo, and BindsNET, all of which are used to design and simulate different kinds of networks. They were run on a PC and accelerated using various methods, including GPUs (which one of the platforms supported) but not boards with neuromorphic hardware (which some of the others could have used). For practical reasons, runtime was limited to 15 minutes.
According to co-author Catherine Schuman, the hardware choice reflected the investigators’ desire to ensure the study was relevant to those without advanced equipment. That’s a reasonable goal, even if optimizing neuromorphic simulators on classical hardware could be seen as a bit of a contradiction. Completing the study in weeks rather than months (hence, the runtime limit) also seems like an obvious decision. However, the result was that only two-fifths of the machines completed some of the tasks, leaving big gaps in the data.
An experiment6 on robotic path planning from FZI Research Center for Information Technology in Karlsruhe, Germany, confronted a different problem. The SpiNNaker system from the University of Manchester was chosen as a representative neuromorphic technology, then compared with a system using Nvidia’s Jetson boards, designed to accelerate machine learning. SpiNNaker was originally designed more as a simulator than as actual neuromorphic hardware (in contrast to SpiNNaker 2) and so fared poorly in terms of power efficiency. Other low-power neuromorphic chips, such as Intel’s Loihi, were not tested.
Given that SpiNNaker is part of the Human Brain Project, in which FZI is a participant, it’s not surprising that the researchers used what was available. Indeed, these might well have been the right comparisons for their specific purposes. Whether the results really represent a useful benchmarking exercise is less clear.
Finally, a project7 at the University of Dresden, in collaboration with the creators of Nengo and SpiNNaker, was much less ambitious in its goals: comparing SpiNNaker 2 with Loihi for keyword spotting and adaptive control tasks. (Spoiler alert: SpiNNaker was more energy-efficient for the former and Loihi for the latter.) Comparing just two systems may seem to make this a less important benchmarking study (though it fulfilled some other important goals). But it may also have been the only way the researchers could generate a fair and useful comparison. That demonstrates the difficulty well.
The play’s the thing
In a 2018 commentary8 on neuromorphic benchmarking, Mike Davies, head of Intel’s Loihi project, suggests a suite of tasks and metrics that could be used to measure performance. These include everything from keyword spotting to classification of the Modified National Institute of Standards and Technology database digits, playing Sudoku, gesture recognition, and moving a robotic arm.
Perhaps Davies’ most compelling suggestion, however, is that we pursue the grander kind of challenge that we know from robotics and AI: creating contests in which machines can compete directly against each other (RoboCup soccer) or even against humans (chess or Go). Even foosball has emerged as a potential interim challenge but seems unlikely, in the long run, to present sufficient complexity to demonstrate any advantages offered by neuromorphic engineering.
Among the advantages of competitions is that, rather than standardize in arbitrary ways, individual research groups can use their creativity to forge the best system, optimized for their hardware, encoding method, learning rules, network architecture, and neuron/synapse type. Where flexibility in the rules is needed, accommodations can be made or rejected in consultation with other players — who may themselves require restrictions to be lifted or relaxed.
Done well, that approach could provide a more creative and higher-level playing field that could help push the discipline forward.
3. Blume-Kohout, R., and Young, K. Metrics and Benchmarks for Quantum
Processors: State of Play. 2018. bit.ly/3D0hRxT
4. Dongarra, J., Martin, J. L., and Worlton, J. Computer Benchmarking: Paths and Pitfalls. IEEE Spectrum 24, 38–43. 1987. bit.ly/3a5BFDg
5. Kulkarni, S. R., Parsa, M., Mitchell, J.P., and Schuman, C.D. Benchmarking the performance of neuromorphic and spiking neural network simulators. Neurocomputing (Amsterdam) 447, 145–160. 2021. bit.ly/3mucYGq
6. Steffen, L., et al. Benchmarking Highly Parallel Hardware for Spiking Neural Networks in Robotics. Frontiers in Neuroscience 15, 1–17. 2021. bit.ly/3Ae8tEQ
7. Yan, Y., et al. Comparing Loihi with a SpiNNaker 2 Prototype on Low-Latency Keyword Spotting and Adaptive Robotic Control. Neuromorphic Computing and Engineering. 2021. doi:10.1088/2634-4386/abf150. bit.ly/3a7xfMi
8. Davies, M. Benchmarks for progress in neuromorphic computing. Nature Machine Intelligence 1, 386–388. 2019. go.nature.com/3msgSzJ