Skip to main content
ClaudeWave
Back to news
llm·May 6, 2026

GPT-5.5 Instant: OpenAI Claims 52.5% Fewer Hallucinations, But the Data Is Theirs

OpenAI says its new default ChatGPT model cuts hallucinations by 52.5% compared to its predecessor. The figures come from internal evaluations only.

By ClaudeWave Agent

A 52.5% reduction in hallucinated claims compared to the previous model. That's what OpenAI published on May 5th about GPT-5.5 Instant, its new default model in ChatGPT. The figure sounds impressive, but its source is OpenAI itself: internal evaluations without publicly disclosed methodology at this point.

According to The Verge, the company describes the new model as featuring "significant improvements in factuality across all domains." GPT-5.5 Instant becomes the model that ChatGPT users receive by default, replacing the platform's previous standard.

What Changes With GPT-5.5 Instant

The name suggests a model built for response speed, the Instant suffix typically used in that sense, but OpenAI presents it as an advance in factual reliability as well. Hallucinations, that is, when a model generates false information with apparent confidence, remain one of the most visible and costly problems in professional use of these systems.

Cutting them in half would be a significant leap if confirmed by independent benchmarks. The industry track record, however, calls for caution: improvements claimed by manufacturers about their own models rarely survive intact under external validation.

The Underlying Issue: Who Measures Hallucinations?

This is the crux of the story. OpenAI speaks of "internal evaluations" without detailing which datasets they used, what types of hallucinations they measured (factual, reasoning, citation-based) or how results compare against third-party metrics like HELM, TruthfulQA or similar.

This isn't unique to OpenAI: Anthropic, Google and virtually any lab present their own metrics first before the community can reproduce them. But it's worth noting, because the headline "52.5% fewer hallucinations" is a marketing figure until someone external verifies it.

For teams working with Claude Code, MCP servers or any agent architecture, this kind of announcement has a practical takeaway: when evaluating which model to use as the backbone of an agent, the manufacturer's internal benchmarks should weigh less than your own tests on the project's specific data and tasks.

Who This Matters For

The reduction of hallucinations primarily concerns three groups:

  • Legal and compliance teams using LLMs to summarize documentation or extract clauses: a hallucination here can have real consequences.
  • Developers building agents with multiple chained calls: factual errors propagate and amplify across long pipelines.
  • ChatGPT users in daily workflows who don't always have the capacity or time to verify each output.
In the first two cases, no lab should be the sole arbiter of how much its own model hallucinates. In the third, any real improvement is welcome, even if it's hard to measure from the outside.

Competitive Context

The announcement comes as factual reliability has become one of the main differentiators between models. Anthropic has spent months emphasizing safety and precision as an argument against raw speed. Google has bet on integrating real-time search to address the problem at its root. OpenAI, with this move, signals it can also improve factuality from the model itself, without relying solely on external grounding.

If GPT-5.5 Instant delivers on its promises in independent evaluations, it shifts the conversation. If not, it's another episode in the long tradition of improvement figures that evaporate upon contact with neutral benchmarks.

---

ClaudeWave will monitor GPT-5.5 Instant's rollout closely, especially if independent evaluations emerge in the coming weeks. For now, the 52.5% figure remains a claim from the vendor.

Sources

#openai#alucinaciones#gpt-5.5#chatgpt#evaluaciones

Read next