Vision Transformers
Vision Transformers
Vision Transformers (ViT) apply the Transformer Architecture to images by treating an image as a sequence of patches. Introduced by Dosovitskiy et al. (2020), ViT demonstrated that a standard Transformer — with minimal modifications — can match or exceed state-of-the-art convolutional neural networks (CNNs) on image classification. The key finding: the inductive biases that make CNNs work (locality, translation invariance) become unnecessary when enough training data is available.
How It Works
- Patch embedding. Split the image into fixed-size patches (typically 16x16 pixels). Flatten each patch and project it into the Transformer’s embedding dimension using a linear layer. An image becomes a sequence of patch embeddings, analogous to a sequence of word embeddings in NLP.
- Position embeddings. Add learnable position embeddings to retain spatial information. Despite using 1D embeddings (not 2D-aware), the model learns spatial relationships — nearby patches develop similar position vectors.
- Classification token. Prepend a learnable
[CLS]token (borrowed from BERT). After processing through the Transformer, this token’s representation is used for classification. - Standard Transformer encoder. Process the sequence through a standard Transformer encoder with multi-head self-attention and feed-forward layers. No convolutions anywhere.
Scale Trumps Inductive Bias
This is ViT’s most important contribution. CNNs build in strong assumptions about images:
- Locality — nearby pixels are related (convolution kernels are small and local)
- Translation equivariance — patterns are recognized regardless of position (shared weights across spatial positions)
These biases help CNNs learn effectively from limited data. ViT makes no such assumptions — it treats every patch equally and learns all spatial relationships from data. On mid-sized datasets (ImageNet alone), this disadvantage is visible: ViT underperforms comparable CNNs. But when pretrained on large datasets (14M–300M images), ViT matches or exceeds the best CNNs while requiring fewer computational resources.
The implication is general: given enough data, learning relationships from scratch is better than hard-coding them. This same principle explains why Transformers dominate NLP despite lacking the sequential inductive bias of RNNs — with enough training data, the model learns sequence structure on its own.
Impact
ViT proved that Transformers are not language-specific — they are a general-purpose architecture for processing sequences of any kind. This accelerated convergence toward Transformers as the universal model architecture. Today, Transformers are used for text, images, audio, video, protein structures, weather prediction, and more. The architectural unification across modalities is what makes multimodal AI (models that process text and images together) practical.