The debate over 'simulated thinking' in large language models

An article published Monday on Machine Society titled "AI researchers want AI to fake 'thinking'" has brought back to the forefront a question that has been circulating through AI labs for years: does it make sense to design models that simulate deliberation before answering, even if that process doesn't reflect what actually happens inside the neural network? The piece has received only minimal engagement on Hacker News, but the question it raises deserves far more attention than it has gotten.

The idea isn't new. Techniques like chain-of-thought prompting, tree-of-thought, or the extended thinking that Anthropic incorporated into Claude Opus 4.7 share a common premise: making the model externalize intermediate steps before delivering a final answer. The results on benchmarks are consistently better. The problem, and here's where the friction lies, is both philosophical and practical.

Reasoning or appearing to reason?

When a model produces a chain of steps before its response, two interpretations are possible. The first, optimistic one: those intermediate steps genuinely guide the computation toward a more precise solution, similar to how a human works in draft form before writing the final version. The second, more skeptical one: the model generates those steps because the training corpus is full of human texts that reason out loud, and it learned to imitate them without any actual internal deliberation process.

What the Machine Society article points out is that some current research appears to be consciously leaning toward the second scenario: designing systems that produce the appearance of thinking because that improves usability and user confidence, even though the engineers know it's largely theatre. The argument is pragmatic: if it works, why does it matter if it's "real"?

Why the distinction matters

It matters for at least three concrete reasons.

First, auditability. If the visible reasoning steps are an artifact of presentation and don't reflect the internal process, using those traces to debug or audit the model's behaviour is a deceptive exercise. A company reviewing why Claude made a certain decision by reading its chain of thought could be interpreting a post-hoc narrative, not the actual cause.

Second, calibrated trust. Users who see a model "thinking step by step" tend to trust its output more. If that process is decorative, we're inducing trust that isn't justified by the system's actual reliability. This is especially relevant in professional applications—medicine, law, engineering—where overestimating the model's precision can have direct consequences.

Third, designing agentic systems. In architectures like those enabled by Claude Code, with sub-agents, hooks and chained skills, intermediate reasoning steps are sometimes used to make decisions about which tool to invoke next. If those steps are theatre, the reliability of the entire agentic chain depends on scaffolding that nobody fully understands.

The pragmatic side of the argument

That said, there's a reasonable position on the other side. The fact that an LLM's "thinking" isn't identical to human thinking doesn't automatically make it fraud. Models that produce intermediate steps make fewer errors on complex tasks: that's empirically consistent and doesn't depend on the reasoning being "authentic" in any deep philosophical sense. Engineering has a long tradition of using useful abstractions even when they aren't literally precise.

The problem is that the industry tends to sell those abstractions outward as if they were reality, and that does create distorted expectations. There's a difference between saying "the model produces reasoning traces that improve its accuracy" and saying "the model thinks before it responds".

Who this matters for

The debate is most relevant to three types of people: researchers designing reasoning architectures, product teams deciding how to present the model's capabilities to end users, and compliance officers in organisations using AI in critical processes. For all three, the distinction between functional reasoning and simulated reasoning has different but equally concrete implications.

---

Here at ElephantPink, we've long observed that the industry has a vocabulary problem more than a technology problem: terms borrowed from human psychology are used too loosely. The Machine Society article doesn't resolve the debate, but it does well by naming it.

The debate over 'simulated thinking' in large language models

Reasoning or appearing to reason?

Why the distinction matters

The pragmatic side of the argument

Who this matters for

Sources

Read next

Conversational Design for Museums: From Monologue to AI Dialogue

Will AI Kill the Scientific Paper As We Know It?

Anthropic Explains Why It Trains Claude With Moral Reasoning, Not Just Rules