Anthropic Explains Why It Trains Claude With Moral Reasoning, Not Just Rules

Anthropic's alignment team published a paper this week titled Teaching Claude Why that describes one of the least visible pillars of how Claude is built: it is not enough to tell the model what it must do or what is prohibited. You have to explain the reasoning behind each decision.

The discussion on Hacker News has attracted limited engagement so far, but the paper itself deserves attention regardless of the social noise it generates. What Anthropic describes is a concrete methodological choice with practical implications for anyone working with Claude in contexts where the model must make non-trivial decisions.

Rules Without Context Break at the Edges

The central premise of the paper is relatively simple to state, but difficult to execute: an AI system that only learns rules—"do not do X," "always do Y"—is fragile when faced with situations that were not anticipated during training. When the model encounters an edge case that does not fit neatly into any memorized rule, it has no way to reason toward the correct answer. It simply interpolates, and that interpolation can be wrong.

The alternative Anthropic proposes is to teach Claude the why behind each rule: what human value it protects, what concrete harm it prevents, what underlying ethical principle it reflects. If the model understands that a restriction exists to protect the privacy of others, it can generalize that restriction in a reasoned way to new situations, rather than checking whether the specific case appears in some list.

This connects with the conceptual architecture that Anthropic has been describing in previous documents—the constitutional model, the hierarchy of priorities between being helpful, harmless, and honest—but Teaching Claude Why is the first text that focuses specifically on the mechanism of transferring moral reasoning, not just the end result of behavior.

Why It Matters for Those Deploying Claude

For a team integrating Claude via API or building agents with Claude Code, this approach has immediate practical consequences. Models trained with transferable moral reasoning tend to behave more consistently when given expanded autonomy: in multi-step workflows, in sub-agents that must make decisions without human supervision at each iteration, or in pipelines where Claude Code hooks trigger actions with real consequences.

Put another way: if Claude understands why it should not filter certain data to third parties, that understanding holds even if the prompt does not explicitly mention that scenario. If it has only memorized the rule, any slightly different formulation of the situation can circumvent it.

This also has implications for those writing skills or plugins. A model with internalized moral reasoning requires fewer explicit guardrails in the system prompt, which simplifies maintenance and reduces the likelihood that contradictory instructions will generate unexpected behavior.

The Challenge of Verifying That Reasoning Is Real

The paper, as far as we can tell, does not sidestep the uncomfortable question: how do you know if Claude has truly internalized the reasoning or has simply learned to produce text that looks like moral reasoning when asked? It is the distinction between understanding and simulating understanding, and it is one of the most relevant open problems in large language model evaluation.

Anthropric does not offer a definitive solution here—it would be surprising if they did—but the fact that they explicitly raise this tension is a sign that the work is grounded in real questions, not public relations narratives.

For the community closely following the Claude ecosystem, this paper is required reading not because it solves anything, but because it describes with precision the framework in which Anthropic is thinking about training its current models. The decisions made there directly affect the behavior of Claude Opus 4.7, Claude Sonnet 4.6, and Claude Haiku 4.5 in production today.

---

EP: The approach of teaching reasoning rather than rules is conceptually sound and, if executed well, should produce more robust models in edge-case scenarios. The question of whether it actually works that way—or whether it is a story we tell ourselves about what happens inside the model—remains open, and that is precisely the question that needs to stay on the table.

Anthropic Explains Why It Trains Claude With Moral Reasoning, Not Just Rules

Rules Without Context Break at the Edges

Why It Matters for Those Deploying Claude

The Challenge of Verifying That Reasoning Is Real

Sources

Read next

Conversational Design for Museums: From Monologue to AI Dialogue

Will AI Kill the Scientific Paper As We Know It?

Partial Evidence Bench: cuando los agentes responden bien pero con información incompleta