As engineers we want to measure progress. But how?
Anyone building a new technology understands that success partly depends on adding value: That is, demonstrating that your technology is better than your competitors. Only in this way can innovators attract investors and satisfy managers. Making a smaller, faster, lighter, more efficient replacement for something that already exists is relatively easy.
It’s much harder to create something genuinely new and different.
Neuromorphic computing is among the fields where engineers are attempting something genuinely new, and the lack of easy comparisons between different systems – both neuromorphic and otherwise—can be a problem.
Part of the issue has to do with complexity of the field. Neuromorphic technology is brain-inspired. But as discussed previously, there are many ways that inspiration can be implemented at the hardware level: analog or digital; spikes or not; continuous or discrete time; virtual or direct connections between neurons.
There are also competing goals and emphases within different groups. Some wish to simulate biology, some focus on energy efficiency, others want to simulate human-like intelligence, and still others simply seek practical solutions to everyday machine learning problems.
How and how long, then, must developers benchmark neuromorphic systems that use different interfaces, encodings, technology and approaches? Benchmarking has emerged as a hot topic in neuromorphic engineering over the last few years. Going back to 2016 there have been many attempts to compare systems for different applications, or to run different algorithms, networks or sets of algorithms/networks. Three studies on the subject were published just this year.
We’ll consider those studies in a future post, but for now it’s worth considering the larger picture to understand why benchmarking remains a hard problem.
Devil in the details
Consider trying to run a learning or recognition task, then comparing how it performs on competing experimental systems. First, a task must be chosen that is at least achievable on all systems being compared—despite the fact they may not have been designed with that in mind. Also, the entire process must be considered—from loading the task, running it and generating an output—and whether each step has been optimized.
If not (and this wouldn’t be unexpected in an emerging technology), metrics must be broken down so they only measure the relevant systems, not the infrastructure (temporarily) needed to support those systems.
Of course, a valid comparison means acquiring systems that are often in short supply. Then, hardware has hopefully been built in a way that yields the appropriate information for analysis, including performance or power consumption at different stages. Unfortunately, this will often not be the case.
Much as it would make things easier, to date there is no equivalent for FLOPS (floating-point operations per second). Engineers have attempted to use multiply-and-accumulate operations. While somewhat applicable to deep learning, MACs do not reflect the complexity of neuromorphic engineering.
Nor do synaptic operations. Why? Because there are too many ways to get the job done; too many learning rules that can be used; too many encoding methods; too many synapse and neuron functions.
Even if that weren’t the case, it’s well known that—all things being equal—a compiler must be tailored to implement the system software to be run. If not, then even the best system will perform poorly.
To get some clarity, it helps to look beyond neuromorphic technology. Robin Blume-Kohout of Sandia National Laboratories is interested in benchmarking quantum computers. In a talk titled Not All Benchmarks Are Created Equal, Blume-Kohout discusses the difficulty of relying on benchmarks for any technology still at a very early development stage.
In honing that argument, he published in a 2018 technical report declaring, “Today’s most advanced quantum processors are like infants. Metrics and benchmarks that are useful for adult humans (e.g., IQ or SAT scores) are blatantly inapplicable to an infant [whose] whole purpose is to grow into an adult.
“Monitoring its progress requires skills and knowledge totally different from what’s needed to evaluate an adult,” he continued. “Children and immature technologies both progress counterintuitively and sometimes even appear to regress (losing baby teeth or entering adolescence).”
Neuromorphic engineering is at a more advanced stage than quantum computing—practical systems exist, albeit mostly on a small scale. But Blume-Kohout’s point remains valid for an adolescent technology. Just like over-testing children at school can make them proficient at passing tests but poor at independent study, using the wrong benchmarks at a formative stage can skew the development of neuromorphic engineering in the wrong direction.
In an earlier paper, Blume-Kohout initially warned of the dangers of bad benchmarking.
Grappling with the best ways to evaluate computers in the burgeoning computer industry of the 1980s, Jack Dongarra of Argonne National Laboratory added: “The value of a computer depends on the context in which it is used, and that context varies by application, by workload and in terms of time.
“An evaluation that is valid for one site may not be good for another, and an evaluation that is valid at one time may not hold true just a short time later,” Dongarra asserted.
Then there was this warning: “Although benchmarks are essential in performance evaluation, simple-minded application of them can produce misleading results. In fact, bad benchmarking can be worse than no benchmarking at all,” he concluded.
Next, I’ll examine ways in which researchers are trying to solve the benchmarking conundrum.
This article was originally published on EE Times.
Dr. Sunny Bains teaches at University College London, is author of Explaining the Future: How to Research, Analyze, and Report on Emerging Technologies, and is currently writing a book on neuromorphic engineering.