Source: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

AI source
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov Oct 21, 2020
PublishedOctober 21, 2020
AuthorAlexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov
Sourcearxiv.org

Source: An Image is Worth 16x16 Words

Summary

This paper introduced the Vision Transformer (ViT), showing that a standard Transformer applied directly to sequences of image patches — with minimal modification — can match or exceed state-of-the-art convolutional neural networks on image classification. The key finding: the inductive biases that make CNNs work well on small datasets (translation equivariance, locality) become unnecessary when enough training data is available. Scale beats architecture.

Key Claims

  • A pure Transformer works for images. The model splits an image into fixed-size 16x16 patches, linearly embeds each patch, adds position embeddings, and feeds the resulting sequence to a standard Transformer encoder. No convolutions required.
  • Large-scale pretraining is essential. On mid-sized datasets like ImageNet alone, ViT underperforms comparable ResNets — it lacks the inductive biases (locality, translation invariance) that help CNNs generalize from limited data. However, when pretrained on large datasets (14M–300M images), ViT matches or beats the best CNNs.
  • Scale trumps inductive bias. This is the paper’s central insight. CNNs build in assumptions about images (nearby pixels are related, patterns are translation-invariant). ViT makes no such assumptions — it learns spatial relationships entirely from data. Given enough data, learning these relationships is better than hard-coding them.
  • Best results on major benchmarks. ViT achieved 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on VTAB (19 tasks) — while requiring substantially fewer computational resources to train than competing CNNs.
  • Position embeddings capture 2D structure. Despite using 1D position embeddings (not 2D-aware), the model learned spatial relationships between patches, with nearby patches having similar position embedding vectors.

Relevance and Implications

ViT demonstrated that the Transformer Architecture is not specific to language — it is a general-purpose architecture for processing sequences of any kind. This finding accelerated the convergence toward Transformers as the universal model architecture across modalities (text, images, audio, video, proteins). It also reinforced the scaling paradigm: given enough data and compute, simpler architectures with fewer built-in assumptions outperform hand-crafted designs.

Sources