The Reasoning Trap: Why Chain-of-Thought Prompting Doesn't Always Work
Site Owner
发布于 2026-06-03
The Reasoning Trap: Why Chain-of-Thought Prompting Doesn't Always Work Large language models have become remarkably adept at showing their work. Give a model a math problem and it will walk you throug...
The Reasoning Trap: Why Chain-of-Thought Prompting Doesn't Always Work
Large language models have become remarkably adept at showing their work. Give a model a math problem and it will walk you through each step. Ask it to debug code and it will narrate its thought process like a senior engineer thinking aloud. The AI industry has collectively celebrated this behavior — calling it "reasoning," treating it as a proxy for intelligence, and building billion-dollar products around it.
But a growing body of empirical research suggests we may be celebrating the wrong thing.
What Chain-of-Thought Actually Does
Chain-of-thought (CoT) prompting, introduced in 2022 by Wei et al. at Google, asks models to generate intermediate reasoning steps before producing a final answer. The empirical results were striking: on benchmarks like GSM8K (grade-school math), CoT boosted performance from 17.9% to 73.4% on GPT-3. The technique quickly became a foundational best practice.
The standard explanation is seductively simple: by articulating their reasoning, models think more carefully, catch their own mistakes, and arrive at better answers. This framing treats CoT as a cognitive aid — the LLM equivalent of "talking through" a problem.
There's just one problem with this narrative: the model is not actually thinking.
What CoT actually does is produce plausible-looking textual patterns that happen to correlate with correct answers. When you read "First, I need to find the common denominator..." you're seeing a learned linguistic template, not an actual arithmetic operation being performed. The tokens "3.14" and "π" and "multiply" are no more computational than the word "breakfast."
This is not a minor distinction.
The Systematic Failures of Verbalized Reasoning
If CoT were genuine reasoning, we would expect models to be robustly wrong in predictable ways. Instead, CoT failures follow a distinctly non-reasoning pattern:
1. Prompt sensitivity. Change a single word in the question and CoT performance can collapse. Genuine reasoning is robust to paraphrasing. CoT is not. A model that "thinks step by step" through "If Mary has 5 apples and gives 3 to John, how many does she have left?" will often stumble if you rephrase it to "Mary starts with 5 apples. After giving some to John, she has 2 left. How many did she give away?" — even though the arithmetic is identical.