Get Ready for Transformational Transformer Networks

Article By : Sally Ward-Foxton

A transformer network's attention mechanism 'is going to really blow the doors off of research', experts say.

Got some grainy footage to enhance, or a miracle drug you need to discover? No matter the task, the answer is increasingly likely to be AI in the form of a transformer network.  

Transformers, as those familiar with the networks like to refer to them in shorthand, were invented at Google Brain in 2017 and are widely used in natural language processing (NLP). Now, though, they are spreading to almost all other AI applications, from computer vision to biological sciences.

Transformers are extremely good at finding relationships in unstructured, unlabeled data. They are also good at generating new data. But to generate data effectively, transformer algorithms often must grow to extreme proportions. Training language model GPT3, with its 175 billion parameters, is estimated to have cost between $11 million and $28 million. That’s to train one network, one time. And transformer size is not showing any sign of plateauing.  

Transformer networks broaden their view 

Ian Buck (Source: Nvidia)

What makes transformers so effective at such a wide range of tasks?  

Ian Buck, general manager and VP of accelerated computing at Nvidia, explained to EE Times that, while earlier convolutional networks might look at neighboring pixels in an image to find correlations, transformer networks use a mechanism called “attention” to look at pixels further away from each other.  

“Attention focuses on remote connections: It’s not designed to look at what neighbors are doing but to identify distant connections and prioritize those,” he said. “The reason [transformers] are so good at language is because language is full of context that isn’t about the previous word but [dependent] on something that was said earlier in the sentence—or putting that sentence in the context of the whole paragraph.”  

For images, this means transformers can be used to contextualize pixels or groups of pixels. In other words, transformers can be used to look for features that are a similar size, shape, or color somewhere else in the image to try and better understand the whole image. 

“Convolutions are great, but you often had to build very deep neural networks to construct these remote relationships,” Buck said. “Transformers shorten that, so they can do it more intelligently, with fewer layers.”  

The more remote the connections a transformer considers, the bigger it gets, and this trend doesn’t seem to have an end in sight. Buck referred to language models considering words in a sentence, then sentences in a paragraph, then paragraphs in a document, then documents across a corpus of the internet.  

Once they understand language, transformer networks may be able to learn about any subject where there is sufficient text, effectively absorbing knowledge by reading about it. Different types of transformers can also be used for computer vision and generation of images. The author created these images using (formerly known as Dall-E Mini), a generative pre-trained transformer network, using the prompt “transformer robot reading large stack of books photorealistic”. (Source: Times)

So far, there doesn’t seem to be a theoretical limit on transformer size. Buck said studies on 500 billion parameter models have demonstrated they are not yet near the point of overfitting. (Overfitting occurs when models effectively memorize the training data.)  

“This is an active question in AI research,” Buck said. “No-one has figured it out yet. It’s just a matter of courage,” he joked, noting that making models bigger isn’t as straightforward as just adding more layers; extensive design work and hyperparameter tuning is required.  

There may be a practical limit, though.  

“The bigger the model, the more data you need to train on,” Buck said, noting that the vast amount of data required also must be high quality to ensure language models aren’t trained on irrelevant or inappropriate content, as well as filtering out repetition. The requirement for data may be a limiting factor in transformer size going forward. 

Recognizing trends for extremely large networks, Nvidia’s Hopper GPU architecture includes a transformer engine—a combination of hardware and software features that enables more throughput while preserving accuracy. Buck argued that platforms like Hopper address economic limits on training transformers by allowing smaller infrastructure to train larger networks.  

Applications abound   

Transformers may have started in language, but they are being applied to fields as disparate as computer vision and drug discovery. One compelling use case is medical imaging, where transformers can be used to generate synthetic data for training other AIs.  

Nvidia, for example, has collaborated with researchers at King’s College London (KCL) to create a library of open-source, synthetic brain images.  

Kimberly Powell (Source: Nvidia)

Nvidia’s VP healthcare Kimberly Powell told EE Times this solves two problems: the shortage of training data in the quantities required for large AI models, particularly for rare diseases, and deidentification of data as synthetic data isn’t any person’s private medical data. Transformers’ attention mechanism can learn how brains look for patients of different age, or with different diseases, and generate images with different combinations of those variables. 

“We can learn how female brains in neurodegenerative diseases atrophy different than male brains, so now you can start doing a lot more model development,” she said. “The fact of the matter is we don’t have that many anomalous, if you will, brain images to start with. Even if we amassed all the world’s data, we just didn’t have enough of it. This is going to really blow the doors off of research.” 

KCL investigators use these synthetic brain images to develop models that help detect stroke, or to study the effects of dementia, for starters.  

Researchers have also taught transformers the language of chemistry.  

Transformers can dream up new molecules, then fine tune them to have specific properties, an application Powell called “revolutionary.” These biological models have the potential to be much larger than language models, since chemical space is so large.  

“For spoken language, there’s only so many ways you can arrange it,” she said. “My genome is 3 billion base pairs and there are 7 billion of us. At some point, this type of biological model will need to be much, much larger.”  

Large language models are also used as a shortcut to teach AI about scientific fields where a large amount of unstructured language data already exists, particularly in medical sciences. 

“Because [the transformer] encoded the knowledge of whatever domain you’ve thrown at it, there are downstream tasks you can ask it to do,” Powell said, noting that once the model knows that certain words represent certain diseases or drugs, it can be used to look for relationships between drugs and diseases or between drugs and patient demographics.  

Nvidia has pioneered BioMegatron, which is a large language model trained on data from PubMed, the archive of biomedical journal articles that can be adapted for various medical applications, including searching for associations between symptoms and drugs in doctor’s notes.  

Janssen, the pharmaceutical arm of Johnson & Johnson, is using this technology to scan medical literature for possible drug side effects, and recently improved accuracy by 12% using BioMegatron. 

Transformers can also learn about hospital behaviors like readmission rates from unstructured clinical text.  

The University of Florida has trained GatorTron-S, its 8.9-billion-parameter model, on discharge summaries so it can be used to improve healthcare delivery and patient outcomes. 

Challenges to scaling up 

Andrew Feldman (Source: Cerebras)

Training huge transformer networks presents specific challenges to hardware. 

“OpenAI showed that, for this particular class of networks, the bigger they are, the better they seem to do,” Cerebras CEO Andrew Feldman told EE Times. “That is a challenge to hardware. How do we go bigger? It’s a particular challenge on the front of multi-system scaling. The real challenge is: Can you deliver true linear scaling?”  

Hardware has historically struggled to scale linearly for AI compute: The movement of data requires a huge amount of communication between chips, which uses power and takes time. This communication overhead has been a limiting factor on system practicality at the large end.  

“One of the fundamental challenges on the table is: Can we build systems that are large like transformers but build hardware that scales linearly? That is the Holy Grail,” Feldman said.  

Cerebras’ wafer-scale engine addresses this by effectively building a chip the size of an entire wafer, so that the communications bottleneck is drastically reduced.  

Feldman splits users of today’s Big AI broadly into two groups.  

In the first group are organizations with scientific research objectives. These organizations spend billions of dollars to create or gather the training data they require, including pharmaceutical and energy companies performing drug discovery or looking for oil. These companies work hard to extract insight from data they already have because it’s so expensive to create more. 

In the second group are hyperscalers like Google and Meta. “For them, the data is exhaust,” he said. “It’s gathered approximately for free from their primary business. And they approach it profoundly differently because they’ve paid nothing for it.”   

One player addressing affordability for all 

The size limit for transformers is also an economic one, Feldman said.  

“Part of the challenge is, how do we build models that are hundreds of billions or tens of trillions [of parameters in size] but build hardware so that more than six or eight companies in the world can afford to work on them?” he said, noting that if training costs tens of millions of dollars, it is out of reach for universities and many other organizations.  

One of Cerebras’ goals is to make large-model training accessible to universities and large enterprises at a cost they can afford. (Cerebras has made its WSE available in the cloud to try and tackle this).  

“Otherwise, Big AI becomes the domain of a very small number of companies, and I think historically that’s been bad for the industry,” he said.  

Transformer networks getting closer to issues  

Transformers are also spreading to the edge.  

While the largest networks remain out of reach, inference for smaller transformers on edge devices is gaining ground. 

Wajahat Qadeer (Source: Kinara)

Wajahat Qadeer, chief architect at Kinara, told EE Times the edge AI chip company is seeing demand for both natural language processing and vision transformers in edge applications. This includes ViT (vision transformer, for vision) and DETR (detection transformer, for object detection). 

“In either case, the transformer networks that work best at the edge are typically smaller than BERT-Large at 340 million parameters,” he said. “Bigger transformers have billions or even trillions of parameters and thus require huge amounts of external memory storage, big DRAMs, and high bandwidth interfaces, which are not feasible at the edge.” (BERT, bidirectional encoder representations from transformers, is a natural language processing model Google uses in its search engine). 

There are ways to reduce the size of transformers so inference can be run in edge devices, Qadeer said.  

“For deployment on the edge, large models can be reduced in size through techniques, such as student-teacher training, to create lightweight transformers optimized for edge devices,” he said, providing MobileBert as an example. “Further size reductions are possible by isolating the functionality that pertains to the deployment use cases and only training students for that use case.” 

Student-teacher is a method for training neural networks where a smaller student network is trained to reproduce the outputs of the teacher network.  

Techniques like this can bring transformer-powered NLP to applications like smart home assistants, where consumer privacy dictates data doesn’t enter the cloud. Smartphones are another key application here, Qadeer said.  

“In the second generation of our chips, we have specially enhanced our efficiency for pure matrix-matrix multiplications, have significantly increased our memory bandwidth, both internal and external, and have also added extensive vector support for floating point operations to accelerate activations and operations that may require higher precision,” he added.  

Transformer convergence is happening 

Marshall Choy (Source: SambaNova)

Marshall Choy, senior VP of product at SambaNova, told EE Times that while there was a vast proliferation of model types emerging five years ago, that period of AI’s history may well be over.  

“We’re starting to see some convergence,” Choy said. Five years ago, he added, “it was still something of an open research question for language models… The answer is pretty clear now: It’s transformers.”  

A typical scenario across SambaNova’s banking customer base, Choy said, might be hundreds or even thousands of disparate instances of BERT, a situation that hardly encourages repeatability. SambaNova’s hardware and software infrastructure offering includes pre-trained foundation models on a subscription basis. The company typically works with its customers to transition from BERT to SambaNova’s pre-trained version of GPT (generative pre-trained transformer, a model for producing human-like text).  

“We are not trying to be a drop-in replacement for thousands of BERT models,” he said. “We’re trying to give customers an onramp from where they are today to reimagining thousands of BERT models with one GPT instance… to get them to where they ought to be at enterprise scale.”  

A side effect of convergence on transformers so far has been enterprises’ shift from neural network engineering to focusing on data-set creation, Choy said, as they increasingly see data sets, and not models, as their IP.  

“You could be dramatic and say convergence leads to commoditization. I don’t think we’re there yet. But if you look at the trajectory we’re on, I think models are going to be commoditized at some point,” he said. “It may be sooner rather than later, because software development moves so fast.”


This article was originally published on EE Times.

Sally Ward-Foxton covers AI technology and related issues for and all aspects of the European industry for EETimes Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters’ degree in Electrical and Electronic Engineering from the University of Cambridge.


Subscribe to Newsletter

Leave a comment