Source: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

AI source

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou Jan 27, 2022

PublishedJanuary 27, 2022

AuthorJason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou

Sourcearxiv.org

Source: Chain-of-Thought Prompting Elicits Reasoning in LLMs

Summary

This paper from Google Research introduced Chain-of-Thought Prompting, demonstrating that providing large language models with examples that include intermediate reasoning steps (a “chain of thought”) dramatically improves their performance on tasks requiring multi-step reasoning. The technique is simple — instead of showing input→answer examples, show input→reasoning→answer — but the effect is striking, particularly for arithmetic, commonsense, and symbolic reasoning tasks.

Key Claims

Chain-of-thought prompting unlocks reasoning. By including step-by-step reasoning in few-shot examples, LLMs produce their own reasoning chains and arrive at correct answers far more often. PaLM 540B with CoT prompting achieved state-of-the-art on GSM8K math word problems (57%), surpassing even fine-tuned GPT-3 with a verifier.
The effect is emergent at scale. CoT prompting only helps models above a certain size threshold. Small models produce fluent but illogical chains of thought that don’t improve accuracy. This is an in-context learning capability that emerges with scale.
No training required. CoT is a prompting technique — it requires no fine-tuning, no gradient updates, no additional training data. A single model checkpoint can be applied to many different reasoning tasks simply by changing the prompt examples.
The reasoning chain provides interpretability. Unlike standard prompting where the model produces an answer without explanation, CoT outputs reveal the model’s reasoning process, making it possible to diagnose where and how errors occur.
Broad applicability across reasoning types. CoT improved performance on arithmetic (GSM8K, SVAMP, MultiArith), commonsense reasoning (CSQA, StrategyQA), and symbolic reasoning (last letter concatenation, coin flip tracking).

Relevance and Implications

Chain-of-thought prompting established that LLMs can perform multi-step reasoning — they just need to be shown how to “think out loud.” This insight has had enormous downstream impact. It directly motivated zero-shot CoT (“Let’s think step by step”), tree-of-thought prompting, and ultimately the development of reasoning-focused models like DeepSeek-R1 that are trained to produce extended reasoning traces. CoT is now standard practice for any task requiring complex reasoning from an LLM, and the underlying principle — that generating intermediate steps improves final answers — is fundamental to how AI Agents decompose complex tasks.

Sources

Original paper (arXiv)