Photonic TPUs Beat GPUs in Accelerating Next-Gen Machine Learning

Article By : Nitin Dahad

Using photonic tensor cores rather than GPUs or TPUs can achieve 2-3 orders higher performance, lower latency and power consumption...

A new approach to performing neural network computations for machine learning using photonic tensor cores instead of graphics processing units (GPUs) suggests 2-3 orders higher performance can be achieved for processing optical data feeds. It also indicates that photonic processors have the potential to augment electronic systems and may perform exceptionally well in network-edge devices in 5G networks.

The work has been published in the Applied Physics Review journal, in a paper, “Photon-based processing units enable more complex machine learning,” by Mario Miscuglio and Volker Sorger from the department of electrical and computer engineering at George Washington University in the United States.

In their approach, a photonic tensor core performs multiplications of matrices in parallel, improving speed and efficiency of deep learning. In machine learning, neural networks are trained to learn to perform unsupervised decision and classification on unseen data. Once a neural network is trained on data, it can produce an inference to recognize and classify objects and patterns and find a signature within the data.

The photonic TPU stores and processes data in parallel, featuring an electro-optical interconnect, which allows the optical memory to be efficiently read and written, and the photonic TPU to interface with other architectures.

“We found that integrated photonic platforms that integrate efficient optical memory can obtain the same operations as a tensor processing unit, but they consume a fraction of the power and have higher throughput and, when opportunely trained, can be used for performing inference at the speed of light,” said Mario Miscuglio, one of the authors.

Most neural networks unravel multiple layers of interconnected neurons aiming to mimic the human brain. An efficient way to represent these networks is a composite function that multiplies matrices and vectors together. This representation allows the performance of parallel operations through architectures specialized in vectorized operations such as matrix multiplication.


Accelerating Your Machine Learning Compute for the IoT and Embedded Market with Arm

How to choose the right processor IP for your ML application

Take advantage of wide-ranging AI opportunities. GigaOm report shows you how to devise, define, and deploy the right AI for job.

Photonic tensor core and dot product engine
(a) The photonic tensor core (PTC) is constituted by a 16-dot product engine that inherently and independently performs row by column pointwise multiplication and accumulation. (b) The dot product engine performs the multiplication between two vectors. The ith row of the input matrix is given by WDM signals, which are modulated by high-speed (e.g., Mach–Zehnder) modulators. The jth column of the kernel matrix is loaded in the photonic memory by properly setting its weight states. Availing light-matter interaction with the phase- change memory, the inputs, opportunely spectrally filtered by micro-ring resonators (MRR), are weighted in a seemingly quantized electro-absorption scheme (i.e., amplitude modulation), thus performing element-wise multiplication. The element-wise multiplications are incoherently summed up using a photodetector, which amounts to a MAC operation (Dij). (Image: Mario Miscuglio and Volker Sorger)

The more intelligent the task and the higher accuracy of the prediction desired, the more complex the network becomes. Such networks demand larger amounts of data for computation and more power to process that data. Current digital processors suitable for deep learning, such as graphics processing units (GPUs) or tensor processing units (TPUs), are limited in performing more complex operations with greater accuracy by the power required to do so and by the slow transmission of electronic data between the processor and the memory.

The researchers showed that the performance of their TPU could be 2-3 orders higher than an electrical TPU. Photons may also be an ideal match for computing node-distributed networks and engines performing intelligent tasks with high throughput at the edge of a networks, such as 5G. At network edges, data signals may already exist in the form of photons from surveillance cameras, optical sensors and other sources.

“Photonic specialized processors can save a tremendous amount of energy, improve response time and reduce data center traffic,” said Miscuglio. For the end user, that means data is processed much faster, because a large portion of the data is preprocessed, meaning only a portion of the data needs to be sent to the cloud or data center.

Making the case for optical versus electrical

The paper presents a case for taking the optical route for carrying out machine learning tasks. It said in most neural networks (NNs) which unravel multiple layers of interconnected neurons/nodes, each neuron and layer as well as the network interconnectivity is essential for the task for which the network has been trained. In their connected layer, NNs strongly rely on vector matrix math operations in which large matrices of input data and weights are multiplied, according to the training. Complex, multi-layered deep NNs require a sizeable amount of bandwidth and low latency to satisfy the vast operation required to perform large matrix multiplication without sacrificing efficiency and speed.

So how do you efficiently multiple these matrices? With general purpose processors, the matrix operations take place serially while requiring continuous access to the cache memory, generating the von Neumann bottleneck. Specialized architectures such as GPUs and TPUs help reduce the effect of these von Neumann bottlenecks enabling some effective machine learning models.

GPUs and TPUs are particularly beneficial compared to CPUs, but when used to implement deep NN performing inference on large 2-dimensional datasets such as images, they can be power-hungry and require longer computation runtime (greater than tens of milliseconds). Smaller matrix multiplication for less complex inference tasks are still challenged by a non-negligible latency, predominantly due to the access overhead of the various memory hierarchies and the latency in executing each instruction in the GPU.

The authors of the paper suggest that given this context, it is necessary to explore and reinvent the operational paradigms of current logic computing platforms, in which matrix algebra relies on continuous access to memory. In this respect, the wave nature of light and related inherent operations, such as interference and diffraction, can play a major role in enhancing computational throughput and concurrently reducing the power consumption of neuromorphic platforms.

They suggest that future technologies should perform computing tasks in the domain in which their time varying input signals lay, exploiting their intrinsic physical operations. In this view, photons are an ideal match for computing node-distributed networks and engines performing intelligent tasks over large data at the edge of a network (e.g., 5G), where the data signals may exist already in the form of photons (e.g., surveillance camera, optical sensor, etc.), thus pre-filtering and intelligently regulating the amount of data traffic that is allowed to proceed downstream toward data centers and cloud systems.

This is where they explore the approach using a photonic tensor core (PTC) able to perform 4 × 4 matrix multiplication and accumulation with a trained kernel in one shot (i.e., non-iteratively) and entirely passively; in other words, once a NN is trained, the weights are stored in a 4-bit multilevel photonic memory directly implemented on-chip, without the need for either additional electro-optic circuitry or off-chip dynamic random-access memory (DRAM). The photonic memories feature low-loss, phase-change, nanophotonic circuits based on wires of G2Sb2Se5 deposited on a planarized waveguide, which can be updated using electrothermal switching and can be read completely optically. Electrothermal switching is enabled by tungsten heating electrodes, which clamp the phase change memory (PCM) wire.

Photonic tensor core performance
Tensor core performance comparison. Electronic data-fed (left column) photonic tensor core (PTC) offers 2–8 × throughput improvement over Nvidia’s T4 and A100, and for optical data (e.g., camera) improvements are ∼60× (chip area limited to a single die ∼800 mm2). (Table: Mario Miscuglio and Volker Sorger)

The authors said this work represents the first approach toward the realization of a photonic tensor processor storing data and processing in parallel, which could scale the number of multiply-accumulate (MAC) operations by several orders of magnitude while significantly suppressing power consumption and latency compared to the state-of-the-art hardware accelerators delivering real-time analytics.

Unlike digital electronics, which rely on logic gates, in integrated photonics, multiplication, accumulation, and more in general linear algebraic operations can be performed inherently and non-iteratively, benefiting from the intrinsic parallelism provided by the electromagnetic nature of the signals and efficient light matter interaction. In this regard, integrated photonics is an ideal platform for mapping specific complex operations one-to-one into hardware, and in some cases algorithms, achieving time complexity.

Leave a comment