In-Context Learning
In-Context Learning
In-context learning (ICL) is the ability of large language models to perform new tasks based on examples provided in the prompt, without any gradient updates or fine-tuning. The model’s weights do not change — it adapts its behavior purely from the input context. Demonstrated at scale by GPT-3, ICL was the first clear evidence that sufficiently large language models are general-purpose learners, not just text generators.
How It Works
In standard machine learning, adapting a model to a new task requires training: collecting labeled data, computing gradients, and updating model weights. In-context learning skips all of this. Instead, the user provides a few examples of the desired input-output behavior directly in the prompt:
Translate English to French:
sea otter => loutre de mer
cheese => fromage
hello =>
The model produces “bonjour” — not because it was fine-tuned on translation examples, but because the pattern in the prompt is sufficient for it to infer the task and execute it. This works across tasks the model was never explicitly trained on, as long as the underlying capability exists in its pretrained weights.
Three Evaluation Modes
The GPT-3 paper formalized three modes:
- Zero-shot — only a task description, no examples. Tests whether the model can infer the task from the instruction alone.
- One-shot — one input-output example plus the new input. Provides a concrete demonstration of the format and expectation.
- Few-shot — several examples (typically 4–32). More examples generally improve performance, up to context length limits.
Performance consistently improves from zero to few-shot, suggesting the model genuinely learns from the examples rather than relying solely on pretrained knowledge.
Why It Works
The mechanism behind ICL is still debated, but the leading hypotheses are:
- Implicit Bayesian inference. The model’s pretraining on diverse text implicitly teaches it to identify patterns in context and extrapolate them. The examples in the prompt narrow down the “task distribution” the model should sample from.
- Induction heads. Mechanistic interpretability research has identified specific attention patterns (“induction heads”) that copy patterns from earlier in the context, providing a circuit-level mechanism for ICL.
- Task vectors in activation space. The examples create a “task vector” in the model’s internal representation space that steers generation toward the desired behavior.
Scale Is Required
ICL is an emergent capability — it appears only in models above a certain size threshold. Small models shown the same few-shot examples perform no better (and sometimes worse) than zero-shot. The GPT-3 paper showed this scaling relationship clearly: as model size increased from 125M to 175B parameters, few-shot performance improved dramatically across nearly all tasks.
This has practical implications: in-context learning is not something you can get from a small model with clever prompting. It requires the representational capacity that comes with scale.
Limitations
- Context window constraints. The number of examples is bounded by the model’s context length. Complex examples with long inputs quickly consume the available space.
- Sensitivity to formatting. ICL performance can be brittle — small changes to the prompt format, example order, or label distribution can significantly affect results.
- Not real learning. ICL adapts behavior but doesn’t create new knowledge. The model can only perform tasks within the scope of its pretrained capabilities. Tasks that require knowledge not in the training data (like post-cutoff APIs) need external context, not just examples — see Agent Context Strategies.
Relationship to Other Techniques
ICL is the foundation that other prompting techniques build on:
- Chain-of-Thought Prompting extends ICL by including reasoning steps in the examples, unlocking multi-step reasoning.
- Passive context (AGENTS.md, CLAUDE.md) provides domain knowledge rather than task examples, addressing ICL’s inability to supply new information.
- Fine-tuning approaches like LoRA address the cases where ICL isn’t sufficient — when the model needs to learn genuinely new capabilities or when prompt engineering becomes impractical.