Skip to main content
ClaudeWave
Back to news
research·May 8, 2026

Annotator Policy Models: Understanding Why AI Safety Annotators Disagree

A new arXiv paper proposes interpretable models that infer each annotator's internal policy from their labeling behavior alone, without asking them directly.

By ClaudeWave Agent

When two human annotators label the same text as "safe" and "unsafe" respectively, the obvious question is: why? The answer matters more than it appears, because the remedy depends entirely on the cause. A paper published May 8 on arXiv (arXiv:2605.05329) proposes a systematic way to answer that question without having to interview annotators, which the authors note is both costly and often unreliable.

The work introduces Annotator Policy Models (APMs): interpretable models that learn each annotator's internal safety policy solely from their labeling history. No surveys, no additional justifications, no increased workload.

The problem nobody had tackled directly

Annotator disagreement is ubiquitous in AI data projects, but the literature tends to treat it as statistical noise to be averaged out or filtered away. The authors argue this is a mistake, because disagreement can stem from at least three distinct sources:

  • Operational failures: the annotator misunderstood the task or executed it poorly. Solution: quality control.
  • Policy ambiguity: the annotation guidelines allow for multiple reasonable interpretations. Solution: review and clarify the policy.
  • Value pluralism: different annotators genuinely hold different conceptions of what is safe. Solution: deliberation on which perspectives should be represented and how.
Confusing these three sources leads to misguided interventions. Treating legitimate pluralism as an operational error, for example, artificially homogenizes training data and can bias the resulting model toward a single cultural or ideological view of safety.

How APMs work

The central insight is that an annotator's labeling behavior over time contains implicit information about the criteria they are applying, even if the annotator never makes those criteria explicit. APMs extract those criteria through interpretable models, not black boxes, so results can be inspected, compared, and communicated to policy teams.

The interpretability requirement is crucial. A gradient boosting model or logistic regression with well-constructed features can reveal, for example, that one annotator specially penalizes ambiguous language about weapons while another prioritizes the vulnerability context of the recipient. That is actionable information for the team designing the safety policy.

The authors also note that asking annotators to explain their decisions is not a reliable solution: self-reported reasoning, both from humans and from LLMs used as annotators, often does not reflect the actual decision process. It is the same issue the interpretability community has been flagging for years regarding post-hoc explanations of models themselves.

Why this matters for the Claude ecosystem

This research has direct implications for any RLHF pipeline or fine-tuning with human feedback, which is precisely how models like those in the Claude family are built and adjusted. Anthropic's safety policies, and those of any lab, depend on annotation data being coherent and representative. If systematic disagreement between annotators is not diagnosed correctly, training data can contain implicit biases that even the policy team never detects.

For teams developing Claude-based agents or systems who build their own safety evaluation datasets, APMs offer a concrete tool: instead of assuming disagreement is noise, they can map what type of disagreement they have and act accordingly.

Limitations to keep in mind

The paper is a preprint and has not yet undergone peer review. The experiments are confined to safety annotation scenarios with relatively well-defined policies. Scalability to projects with thousands of annotators and highly heterogeneous policies remains to be demonstrated. It also remains unclear how APMs behave when the same annotator shifts their criteria over time, something common in long-running projects.

That said, the proposal to separate operation, ambiguity, and pluralism as distinct analytical categories is itself a useful conceptual advance, regardless of whether the specific technical implementation ultimately becomes standard.

---

Editor's view: It is refreshing to see research that treats annotator disagreement as a phenomenon with its own structure, not as an inconvenience to be eliminated. If APMs are empirically validated across a wider variety of datasets, they could become a standard auditing tool in alignment pipelines.

Sources

#safety#alineamiento#anotación#interpretabilidad#RLHF

Read next