Source: Attention Is All You Need

AI source
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Jun 11, 2017
PublishedJune 11, 2017
AuthorAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Sourcearxiv.org

Source: Attention Is All You Need

Summary

The paper that introduced the Transformer Architecture, the model design behind virtually all modern large language models and increasingly dominant in computer vision. The authors proposed replacing recurrent and convolutional networks entirely with a mechanism called self-attention, which allows every position in a sequence to attend to every other position in constant time. The resulting architecture is more parallelizable than RNNs and achieved state-of-the-art machine translation results while training in a fraction of the time.

Key Claims

  • Self-attention replaces recurrence. Prior sequence models (LSTMs, GRUs) processed tokens sequentially — each hidden state depended on the previous one. This made parallelization impossible within a training example. The Transformer computes all positions simultaneously via self-attention, dramatically reducing training time.
  • Multi-head attention captures different relationship types. Rather than computing a single attention function, the model runs multiple attention heads in parallel, each learning to focus on different aspects of the input (e.g., syntactic structure vs. semantic similarity). The outputs are concatenated and projected.
  • Positional encoding is required. Because self-attention has no inherent notion of sequence order (unlike RNNs), the model adds sinusoidal positional encodings to the input embeddings so it can distinguish token positions.
  • The encoder-decoder structure is preserved. The Transformer maintains the encoder-decoder pattern from prior sequence-to-sequence models, but both components are built entirely from stacked self-attention and feed-forward layers.
  • Training is dramatically faster. The Transformer achieved 28.4 BLEU on WMT 2014 English-to-German (exceeding all prior results including ensembles) after training for 3.5 days on 8 GPUs — a fraction of the compute used by competing approaches.
  • The architecture generalizes beyond translation. Applied to English constituency parsing with both large and limited training data, the Transformer performed competitively, suggesting broad applicability.

Architecture Details

The Transformer consists of:

  1. Encoder — 6 identical layers, each containing multi-head self-attention and a position-wise feed-forward network, with residual connections and layer normalization
  2. Decoder — 6 identical layers with an additional cross-attention sub-layer attending to the encoder output, plus masking to prevent attending to future positions
  3. Scaled dot-product attention — $\text{Attention}(Q,K,V) = \text{softmax}(QK^T / \sqrt{d_k})V$, where the scaling factor prevents the dot products from growing too large

Relevance and Implications

This paper is the foundation of modern AI. The Transformer architecture, originally designed for machine translation, became the basis for GPT-3 and all subsequent large language models, vision transformers, diffusion model backbones, protein structure prediction, and more. The key insight — that attention alone is sufficient for sequence modeling — unlocked the scaling properties that define the current era of AI. Nearly every concept in this wiki traces back to this architecture.

Sources