Source: Language Models Are Few-Shot Learners

AI source
Tom B. Brown, Benjamin Mann, Nick Ryder May 27, 2020
PublishedMay 27, 2020
AuthorTom B. Brown, Benjamin Mann, Nick Ryder
Sourcearxiv.org

Source: Language Models Are Few-Shot Learners

Summary

The GPT-3 paper from OpenAI demonstrated that scaling a language model to 175 billion parameters produces a qualitative shift in capability: the model can perform new tasks from just a few examples provided in the prompt, without any gradient updates or fine-tuning. This property — in-context learning — showed that sufficiently large language models are general-purpose learners, not just text generators.

Key Claims

  • Scale enables few-shot learning. GPT-3 (175B parameters, 10x larger than any prior non-sparse model) achieved competitive performance on many NLP benchmarks using only task descriptions and a few examples in the prompt. No fine-tuning, no gradient updates.
  • Three evaluation modes establish a spectrum. The paper defined zero-shot (task description only), one-shot (one example), and few-shot (several examples) evaluation. Performance improved consistently from zero to few-shot, suggesting the model is genuinely learning from the examples rather than relying solely on pretraining.
  • Strong results across diverse tasks. GPT-3 achieved strong performance on translation, question answering, cloze tasks, and on-the-fly reasoning tasks like unscrambling words and 3-digit arithmetic — tasks requiring capabilities beyond memorization.
  • Some tasks remain hard. Natural language inference and certain reading comprehension tasks showed weaker few-shot performance, suggesting limitations in the model’s reasoning over complex multi-sentence relationships.
  • Generated text is hard to distinguish from human writing. Human evaluators could only identify GPT-3-generated news articles ~52% of the time (near chance), raising concerns about misuse.
  • Training data contamination is a real risk. The paper documented extensive analysis of benchmark contamination in the training corpus and found measurable overlap for some datasets, introducing a methodological concern for all future large model evaluations.

Relevance and Implications

GPT-3 established that language models are not just pattern completers — they are few-shot learners that can adapt to new tasks at inference time. This insight reshaped the field: instead of training specialized models for each task, researchers began exploring how to prompt a single large model. The paper directly motivated work on Chain-of-Thought Prompting (eliciting reasoning from these models), RLHF (aligning them with human preferences), and Scaling Laws (understanding the relationship between model size, data, and capability). It also established the paradigm that AI Agents would later build on — using LLMs as general-purpose reasoning engines.

Sources