Transformer Architecture

AI concept developed
2 sources

Transformer Architecture

The Transformer is the neural network architecture behind virtually all modern large language models, vision models, and an expanding range of AI systems. Introduced in 2017 by researchers at Google, it replaced recurrent and convolutional networks with a single mechanism — self-attention — that processes all positions in a sequence simultaneously. This design unlocked massive parallelism during training, enabling the scale-driven capabilities that define the current era of AI.

Why Transformers Matter

Before the Transformer, sequence models (LSTMs, GRUs) processed tokens one at a time. Each hidden state depended on the previous one, making training inherently sequential and slow. The Transformer eliminated this bottleneck: by computing relationships between all positions in parallel, it made training vastly more efficient. This efficiency enabled researchers to train on more data with more parameters, which led to emergent capabilities like in-context learning and multi-step reasoning.

Self-Attention

Self-attention is the core mechanism. Given a sequence of token representations, self-attention allows each token to “look at” every other token and compute a weighted combination based on relevance:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Each token is projected into three vectors — Query (what am I looking for?), Key (what do I contain?), and Value (what information do I provide). The dot product between a query and all keys determines how much attention each token pays to every other token. The scaling factor $\sqrt{d_k}$ prevents the dot products from growing too large, which would push the softmax into regions with vanishingly small gradients.

The key property: self-attention has no inherent notion of distance. A token can attend equally to its neighbor or to a token 1,000 positions away. This is what makes Transformers effective at capturing long-range dependencies — and what makes them fundamentally different from RNNs and CNNs.

Multi-Head Attention

Rather than computing a single attention function, the Transformer runs multiple attention “heads” in parallel, each with its own learned projections. Different heads learn to focus on different types of relationships — one might capture syntactic structure, another semantic similarity, another positional patterns. The outputs are concatenated and projected back to the model dimension. This gives the model multiple “perspectives” on the same input.

Architecture Components

The original Transformer is an encoder-decoder architecture for sequence-to-sequence tasks (like translation):

Encoder (6 identical layers, each containing):

  1. Multi-head self-attention — each position attends to all positions
  2. Position-wise feed-forward network — two linear transformations with ReLU
  3. Residual connections + layer normalization around each sub-layer

Decoder (6 identical layers, each containing):

  1. Masked multi-head self-attention — each position can only attend to earlier positions (prevents “seeing the future”)
  2. Cross-attention — attends to the encoder’s output
  3. Position-wise feed-forward network
  4. Residual connections + layer normalization

Modern LLMs typically use decoder-only architectures (no encoder, no cross-attention), which simplifies the design while retaining the ability to generate text autoregressively.

Positional Encoding

Because self-attention is permutation-invariant (it doesn’t inherently know token order), the Transformer adds positional information to the input embeddings. The original paper used fixed sinusoidal encodings; modern models typically use learned position embeddings or Rotary Position Embeddings (RoPE), which encode relative positions through rotation matrices.

Why Scale Works

The Transformer’s design has two properties that make it uniquely amenable to scaling:

  1. Parallelism. All tokens in a sequence are processed simultaneously, making efficient use of GPU hardware. This is in stark contrast to RNNs, where tokens must be processed sequentially.
  2. Expressiveness grows with depth and width. Adding more layers and wider hidden dimensions increases the model’s capacity to represent complex relationships, and the residual connections ensure gradients can flow through deep networks.

These properties explain why Scaling Laws work: doubling the parameters and data of a Transformer reliably improves performance, enabling the progression from GPT to GPT-3 to modern frontier models.

Beyond Language

The Transformer’s generality extends far beyond text:

  • Vision. Vision Transformers apply the same architecture to image patches, achieving state-of-the-art image classification by treating an image as a sequence of 16x16 patches.
  • Diffusion models. Diffusion Models use Transformer-based backbones (DiT) for image generation.
  • Multimodal. Models like GPT-4V and Gemini process text, images, and audio using Transformer-based architectures.
  • Science. Protein structure prediction (AlphaFold), weather forecasting, and molecular design all use Transformer variants.

Sources