The transformer model quietly arrived in 2017 and set off the modern AI explosion. Photo credit: Igor Omilaev via Unsplash
Artificial intelligence seems to have appeared almost overnight, generating artwork and answering questions in a way that would have been unthinkable just a few years ago. One breakthrough behind this sudden prevalence was the transformer architecture, a reimagination of how neural networks learned to focus on parts of data. In 2017, researchers at Google released a document named Attention Is All You Need. This explained how, by making use of a mechanism known as “attention”, transformers could surpass many state-of-the-art deep learning techniques. Now, transformers and attention are used everywhere, but why were they such a big change?
…by making use of a mechanism known as “attention”, transformers could surpass many state-of-the-art deep learning techniques.
Before the development of the transformer architecture, contemporary deep learning methods relied on far more flawed tools. Deep learning is the process of developing neural networks – a biologically inspired structure of computer-run calculations. These typically include trainable parameters, which can be thought of as control dials that the system adjusts repeatedly until the network is competent at its purpose. Even before the advent of the transformer, deep learning technologies were wide-spread: language translation, image recognition, and even simple chatbots could be created from a series of parameter-driven calculations.
This raises an obvious question: if calculations with the right trained parameters can infer obscure details about an input, how are these parameters chosen? The answer to this is a large amount of training data. A neural network generally begins life with randomised parameters, which are improved iteratively when its output is compared against training data. For example, if a model’s aim is to perform calculations on an image and output a word specifying what the image is of, it would be reasonable to expect that word to be “cat” when the corresponding image is fed in. If it, instead, outputs “dog”, the parameters are tweaked to more likely succeed in this instance. With a large volume of data, a model can eventually become suitable for its task.
This process works for a single instance of input data such as an image, but many important scenarios have continually-occurring information. Consider the cases of ongoing audio files, sequential data like stock prices, or even just long paragraphs where meaning is spread across a text. In these situations, a single network is unlikely to be able to see larger connections over time. Cases like these necessitated the development of tools like Recurrent Neural Networks (RNNs).
Essentially, this gives the system a sort of short-term memory, where future outputs will consider past information.
RNNs are an example of one of the many deep learning advancements that have come about through carefully structured neural networks. RNNs are networks that can feed information into the next calculation. Essentially, this gives the system a sort of short-term memory, where future outputs will consider past information. These have a major flaw: information gradually disappears with longer series of data, so learning to tweak parameters to earlier points is therefore much more difficult. Later improvements include Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) models. These new designs helped information to last longer along the recurrent sequence seen in an RNN, providing a long-term memory via a strategy similar to taking notes. However, these were still very slow. Every output had to wait until the previous calculation was complete.
With this technological context, it’s easy to see why the transformer has made such an impact. Attention Is All You Need introduced this paradigm shift in deep learning in 2017. The core development associated with this new transformer design is the titular attention mechanism.
Like attention in humans, this provides the model with a method of focusing on the important parts of the input, and deciding which sections are relevant to each other. Each “token” (series of input characters – this could be a word or a sub-word, for example) is encoded into three vectors: the query, value, and key. The query represents what the token is “looking for”, in a sense, and the key describes the features by which other tokens can decide it is relevant. By comparing one token’s query to another’s key, the network can constantly ask, and answer, “which other token is relevant to this one?”. The value, representing the token’s information, more effectively influences the output of the model in the event of a high similarity between the query and key. For example, in the sentence “It was a cat”, the word “cat” is highly relevant to the word “it”, and a well-designed transformer should find a high similarity between the former’s key and the latter’s query. Consequently, “cat” contributes more of its value to “it”, allowing meaning to be derived from token’s contextual information.
Many of these attention mechanisms, known as “heads”, can be placed in parallel to focus on different patterns present in the input, gaining several understandings at once. Following this with ordinary neural networks, and with some additional structuring, leads to a transformer design – though the specifics of this structure can rely on the context of the task at hand. This development caused quite immediate changes to the field of deep learning. Recurrence, on which RNNs depended, was no longer necessary to handle sequences of data, and this newer architecture could overcome many of its flaws. Attention analyses an entire input simultaneously, no longer losing information through a long textual input, for example.
Perhaps more impactful was the parallelisation that was now possible with transformers. Training these new models depended on many simultaneous multiplications, unlike RNNs which relied on sequential processing of input tokens. GPU (Graphics Processing Unit) technology was highly apt for this. These computer parts were optimised to run parallel calculations, and were becoming increasingly effective even before 2017 for applications like gaming and video editing. Now, GPUs are widely used to train AI models, too.
NLP (Natural Language Processing) is a field of computer science concerned with how computers can process and generate language. This, along with many other applications, adopted the transformer model as the standard. The original paper became one of the most cited AI papers, marking one of the most impactful developments in the field. Interestingly, since the 2017 publication, referencing the title has become a playful trope in the AI research world. This can be seen with Diversity Is All You Need,Segmentation Is All You Need, and a host of other similarly named papers.
Transformers aren’t limited to just text…they have also been applied to and excelled at several other domains.
Transformers have long-since left the laboratory and become a common sight in daily life. The most obvious of these may be the GPT-based (Generative Pretrained Transformer) chatbots sewn into many services and applications. Transformers aren’t limited to just text, however. They have also been applied to and excelled at several other domains. Vision transformers (ViT) are used for image recognition, and image generation models like DALL-E have become quite popular. Beyond images, transformers can be seen processing audio, predicting financial data, and even trying their hand at computational chemistry. This same design is applicable to many forms of data, with faster training and deeper inferred pattern discovery leading to advancements in several technologies and research fields.

This rate of advancement can be further extended thanks to the Scaling Law observed in Figure 1. Put simply, OpenAI found that a transformer could be made more effective by increasing the number of computer calculations, amount of training data, or size of the model. This can be seen in the three graphs as a decrease of test loss – a value describing model performance in testing, with lower loss indicating a more successful model. In this case, bigger is better (to a certain point). Developers of AI models are incentivised to throw more money and electrical energy into creating increasingly powerful models.
Data centres used to train and deploy models are quickly becoming a world-wide phenomenon. In the UK, for example, data centres consume significant power and require tonnes of water to cool components. The volume of tap water used by Scottish data centres has quadrupled since 2021, now consuming nearly 20,000 m3 of water in a single year – equivalent to 40 million half-litre bottles. A single prompt to ChatGPT uses 0.34Wh of energy, or about one second of an oven’s draw. This sounds negligible until you consider the hundreds of millions of users each prompting several times a day. The desire for more and better transformer-based models, and therefore the need for more data centres, will have consequences for both national infrastructures and the environment.
These data centres typically rely on advanced GPUs optimised for training AI. This also exacerbates the demand for rare materials and manufacturing. The increasing call for GPUs used in increasingly numerous data centres has, predictably, caused an unprecedented shift in the hardware industry. Nvidia, a company which designs and manufactures GPUs, has recently become the first to be valued at $5 trillion, and their realignment from gaming to AI has been one of the driving factors for their success. Companies like Microsoft and Amazon have also become key players in this environment. Whether today’s valuations will hold up in the future may be up for debate, but the underlying impact on industry is undeniable.
There are limits to this technology, however. A significant amount of easily available training data is already in use and scaling indefinitely, even if it continues to be viable, is unsustainable in terms of computing and electrical resources. This suggests an end to the amount that transformers can be improved. Still, there remains progress to be made. Developments in hardware design are leading to further advancement of AI, leading to more efficient GPUs optimised for training and using deep learning systems. Additionally, the AI sector is increasingly turning to synthetic data (artificially generated material to train on) to combat the issue of the depleted genuine equivalent. For more reliable output, the use of multimodal AI (which uses simultaneous data sources, such as both audio and video) has also been a recent trend. This allows output to be derived from more complete information, with the hope of increased accuracy.
Potential alternatives to the transformer are also being explored, like the Hyena architecture, which uses recurrence (similar to an RNN) and further mathematical techniques to aim for higher efficiency. This, and other similar ideas, are still largely experimental, but such discoveries suggest that development in this space may not yet be over.
From chatbots to scientific instruments and image generation, transformers have changed the way the world uses artificial intelligence.
The transformer has been making waves in the field of deep learning since its introduction. With its resolution of its predecessors’ flaws, and its suitability for modern hardware, this technology came at the perfect time to reimagine the digital world. From chatbots to scientific instruments and image generation, transformers have changed the way the world uses artificial intelligence. This has not been without its consequences: the development of this field has caused issues, environmental or otherwise, and the state of the AI industry seems utterly unpredictable. Regardless, making machines pay attention has caused a profound shift in modern computing.
