The title was wrong.
That’s what made it historic.
In 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio published a paper that introduced attention mechanisms to neural machine translation. It was brilliant. It worked. The field adopted it almost immediately.
Three years later, a team at Google published “Attention Is All You Need.” The world called it the attention paper. Textbooks treat it as the moment attention was discovered. Interview prep guides summarize it as “they invented a new way for models to attend to sequences.”
Every single one of those summaries misses the point.
Attention already existed. What the Vaswani et al. paper actually did was far more consequential and far less appreciated: it removed sequential computation as the load-bearing assumption of sequence modeling. And in doing so, it produced an architecture that would become the dominant substrate for frontier AI systems in the decade that followed.
This is the story of what actually happened.
I. The problem nobody was naming correctly
By 2017, the standard architecture for sequence modeling was an encoder-decoder built from recurrent neural networks, specifically LSTMs or GRUs. These models were genuinely impressive. They could translate sentences, caption images, summarize documents. The best ones used attention mechanisms on top: after encoding the input sequence, the decoder would “attend” to relevant encoder states when generating each output token.
The field was satisfied. Results kept improving. Attention-augmented RNNs were state of the art.
But there was a constraint so fundamental that most researchers had stopped seeing it. RNNs process sequences step by step. To compute the hidden state at position t, you need the hidden state at position t-1. Which means you need t-2 first. Which means the entire computation is a chain: serially dependent across positions, unable to be fully parallelized within a single sequence because of those sequential dependencies.
In 2017, NVIDIA’s P100 GPU could perform roughly 9.5 teraflops of single-precision computation per second, the estimate the paper itself uses in its training-cost calculations. But for training an RNN, the sequential dependency across positions meant that parallelization within a sequence was structurally blocked. Extra hardware could help with batching across examples, but the within-sequence bottleneck remained. The architecture could not fully exploit within-sequence parallelism on the hardware.
This was not a quirk. It was structural. It was baked into the definition of what a recurrent network was.
II. The actual innovation: a removal, not an invention
Here is what the Transformer paper did: it removed recurrence entirely.
Not improved it. Not replaced it with something smarter. Removed it. Gone. No hidden states passing between time steps. No sequential dependency. The encoder processes all input positions simultaneously, in parallel, on every layer. The decoder does the same during training, using masking to prevent positions from looking ahead. At inference, the decoder still generates outputs one token at a time, auto-regressively. The sequential bottleneck was not eliminated everywhere. It was dramatically narrowed: from a hard constraint on training, to the unavoidable minimum of generating new tokens.
To make this work, they needed attention, but not for the reasons you have been told. They needed attention because without recurrence, there is no natural way for the model to integrate information from across the sequence. Each position needs to be able to look at every other position directly. Attention is the mechanism that enables this. It is the enabling technology. A central goal was parallelism.
Look at Table 1 of the original paper, the one comparing layer types. The headline number is not about accuracy. It is about sequential operations required. Self-attention: O(1). Recurrent: O(n). This is the paper’s central architectural argument. The BLEU scores confirm it works. Table 1 explains why.
The tradeoff they accepted was this: self-attention requires O(n²·d) total computation, because every position attends to every other position, making the cost quadratic in sequence length. For 2017-era sequences, this was a good deal. For a 512-token sentence, you are looking at manageable matrices. And crucially, all that computation can run in parallel.
Training costs dropped dramatically. The base model trained in 12 hours on 8 P100s, at a fraction of the compute required by prior state-of-the-art systems. Not because the math got easier, but because for the first time, the architecture let the hardware actually work.
III. The smoking gun: positional encoding
There is a single design decision in the Transformer that reveals more about the true nature of the innovation than any other. It is hidden in Section 3.5, treated almost as an afterthought, and almost nobody talks about it.
When you remove recurrence, you lose something RNNs gave you for free: the model’s awareness that token 1 comes before token 2. An RNN processes the sequence in order. Position is implicit in the computation itself. The Transformer processes positions in parallel, which means positional order is invisible to the raw attention mechanism. Without some way of encoding position, the model is permutation-equivariant: it produces consistent outputs regardless of input order, which for language is a serious problem.
So they had to add position back in. Manually. The paper’s solution was sinusoidal positional encodings, functions of different frequencies added to the input embeddings before any computation begins. But the paper is explicit that this was not a new idea. It states: “There are many choices of positional encodings, learned and fixed,” and cites prior work. What the Transformer did was not invent positional encoding. It made positional encoding a central, explicit architectural problem, one that had previously been hidden inside the sequential structure of RNNs.
The authors also tried learned positional embeddings and got nearly identical results. That result is more philosophically interesting than most papers published that year. It tells you that the specific positional signal matters less than the fact that you need to supply it explicitly at all. The problem is structural, not solved by any particular encoding scheme.
The positional encoding problem has been iterated on relentlessly ever since: RoPE (Rotary Position Embedding), ALiBi, relative position encodings, 2D and 3D encodings for images and video. Eight years later, how to properly tell a Transformer where things are in a sequence remains an active research question.
The Transformer did not eliminate position-awareness as a problem. It surfaced it: made it visible, explicit, and therefore improvable. Every architecture that followed is still working on the positional problem that RNNs had kept implicit.
IV. The scaling engine nobody understood they were building
Here is the thing about 2017: nobody knew about scaling laws yet. The Chinchilla paper, which would establish the precise relationship between model parameters, training data, and compute, was five years away. The idea that you could predictably improve model capability simply by adding more compute, and that the right architecture would absorb that compute efficiently, was not yet a framework anyone was working from.
The Transformer became the dominant architecture for frontier AI not because attention is the most expressive possible mechanism. It became dominant because of three properties that only became legible after the fact:
It scales with compute. Because the encoder is fully parallelizable and training the decoder is largely so, throwing more GPUs at training actually helps. With RNNs, more hardware had sharply diminishing returns past a certain point because the sequential bottleneck remained.
It scales with parameters. The attention mechanism has scaled well empirically with model size. Wider layers, more heads, more depth: all translate predictably into capability improvements. This property was not proven until years of empirical work, but the architecture silently enabled it from day one.
It generalizes across modalities. The Transformer makes almost no assumptions about the structure of its input. Text is a sequence of tokens. Images can be patches: tokens. Audio is frames: tokens. Proteins are amino acids: tokens. The same architecture, with modest modification, has been adapted across all of them.
| Architecture | Within-sequence parallelism | Scales with compute | Cross-modal |
|---|---|---|---|
| LSTM / GRU | No | Weakly | Limited |
| Convolutional S2S | Yes | Yes | Limited |
| Transformer | Yes | Yes | Yes |
GPT, BERT, T5, PaLM, Gemini, Claude, ViT, CLIP, Whisper — the dominant family of frontier AI systems runs on Transformer architecture or close derivatives. The reason is not that researchers ran out of ideas. Alternatives have been tried seriously: state-space models like Mamba take a deliberately non-Transformer approach to long sequences, and architectures like AlphaFold2’s Evoformer mix attention with other components. In mainstream foundation-model practice, no alternative has yet clearly displaced it across scale, domains, and ecosystem adoption. Mamba-style state-space models are a serious ongoing challenge, particularly for long sequences, and hybrid architectures are attempting to combine both families rather than simply replace one with the other. The Transformer’s position is not permanent. What is true is that it has held longer than most 2017 observers would have predicted.
V. The template for every infrastructure breakthrough
The story of the Transformer is not really a story about machine learning. It is a story about the structure of technological breakthroughs at infrastructure layers, and it follows a pattern that appears across computing history.
Every so often, a fundamental architectural constraint prevents a field from absorbing the most abundant resource available. In 2017, that resource was parallel compute. GPUs had been commoditized; cloud infrastructure had made them accessible; the hardware was there. But recurrent architectures could not use it. The bottleneck was not hardware. It was the sequential assumption baked into the software architecture.
Its most consequential move was removing a constraint rather than adding a capability. And removing constraints is worth more than adding capabilities, because constraint removal is multiplicative: it does not improve one thing by 10%, it enables an entire category of future work that was previously impossible.
The framework: find the constraint preventing a system from scaling along the most abundant resource. Eliminate it architecturally. Accept the new constraints that come with the trade. They will be the next frontier’s problems to solve. The Transformer’s new constraint is the quadratic cost of attention over long sequences. Flash Attention, sparse attention, linear attention: these are all children of the 2017 trade.
VI. What the title got right by being wrong
“Attention Is All You Need” is a poor description of what the paper does. It is a great title for what the paper produces.
If you name the paper accurately, “Sequential Computation Elimination Enables Full GPU Utilization in Sequence Transduction,” it gets cited by 600 researchers and forgotten by everyone else. Nobody writes about it in 2026. No one explains it to their non-technical friends at dinner parties.
“Attention Is All You Need” is punchy, slightly provocative, and carries exactly enough mystery to make people want to open it. The title is a feature. The mismatch between title and content is, in a strange way, what made the paper’s actual ideas travel so far.
But there is a cost to that travel. The attention mechanism became the brand, and the brand obscured the mechanism’s purpose. A generation of practitioners learned that “transformers use attention” without learning why attention replaced what it replaced. They can compute QKV attention from memory but cannot answer the more important question: what was wrong with the thing before?
What was wrong was the sequential assumption. And the lesson, the one worth carrying, is that the most consequential architectural decisions are usually not additions. They are subtractions. They are the removal of something that everyone had stopped questioning because it had always been there.
RNNs processed sequences the way they did because sequences are ordered and it seemed natural to process them in order. It took until 2017 for someone to ask: does it have to be this way?
It did not. And that question, that simple, almost reckless willingness to remove the obvious thing, is what produced the architecture that the intelligence industry still builds on.
The real title should have been: Recurrence Is None Of What You Need. But that is a harder paper to love.
2 Vaswani et al. Table 1, the comparison of sequential operations across layer types, is the most important table in the paper and the one most often skipped in summaries.
3 Hoffman et al. “Training Compute-Optimal Large Language Models” (Chinchilla, 2022), the paper that retroactively explained why the Transformer’s scaling properties mattered so much.
4 The quadratic attention problem spawned an entire research sub-field: Flash Attention (Dao et al., 2022), Longformer, BigBird, Mamba, and others. All trying to pay back the debt of the 2017 O(n²) trade.
5 On the decoder: the Transformer’s decoder remains auto-regressive at inference, generating one token at a time. The sequential bottleneck was removed from training, not from generation. The paper states this explicitly: “the decoder generates an output sequence one element at a time.”
This is Essay No. 002. The topics: intelligence, AI, systems, knowledge, and the questions underneath the questions everyone else is asking. You can reach me directly. I read everything.