When AI Papers Look Good But Aren't
Superficial improvements in AI-generated scientific articles are flooding peer review and distorting citations in fields like epidemiology.
Last summer, Peter Degen's postdoctoral supervisor came to him with an unusual problem: one of his papers was being cited too much. Citations are the currency of academia, but something was off about these ones. The paper in question, published in 2017, evaluated the accuracy of a specific type of statistical analysis on epidemiological data, a technical and specialized work, not exactly the kind of material that goes viral. What was happening, according to The Verge, is that a wave of new articles generated or assisted by AI were citing it without their authors appearing to have read, or understood, the original.
That episode sums up a problem that has been quietly developing for months: scientific papers produced with the help of language models are no longer easily identifiable by their poor writing or awkward structure. Now they sound coherent, they flow well, and they have the formal appearance of legitimate research. The problem is that beneath that polished surface, methodological rigor is nowhere to be found.
The Paper That Sounds Right But Fails in Substance
For years, editors and reviewers could filter out low-quality academic noise, in part, because it was recognizable: choppy prose, disconnected references, inflated conclusions. LLMs have removed those surface-level signals. An article produced with AI assistance can have a flawless introduction, a methods section that appears coherent, and conclusions that align with the abstract, and yet cite studies that don't say what the author claims, or apply statistical analyses incorrectly.
This makes peer review qualitatively more difficult. Before, a reviewer could dismiss mediocre work on the first read. Now they have to do the same foundational work they would with a serious paper, because the text gives no early clues to its deficiencies. The volume of submissions to scientific journals has grown notably in the past eighteen months, with some publications reporting increases of 30-50 percent in submissions, and a significant portion of that growth is not accompanied by an equivalent increase in actual quality.
The Problem of Ghost Citations
Degen's case points to a specific dysfunction: language models, when assisting in the drafting of a paper, tend to search for references that support general claims. If the model has processed enough literature on a topic, it can retrieve real titles and associate them with plausible statements even if the connection is forced or outright incorrect. The result is citations that exist, of papers that exist, but don't support what the new article says they support.
This isn't hallucination in the classical sense, the model isn't inventing a nonexistent DOI, but something subtler: a distortion of the meaning of the original source. For academic metrics systems, those citations count the same. A researcher's h-index goes up. The cited paper gains visibility. And the chain of knowledge gets loaded with defective links that are difficult to identify without reading in depth.
For Whom This Is an Urgent Problem
The most affected in the short term are editorial teams at scientific journals, especially in disciplines with high paper production like biomedicine, epidemiology, or climate science. Also young researchers, who compete to publish in a market where the volume of superficially correct work has increased and review times are getting longer.
For those building tools on LLMs, including integrations with APIs like Anthropic's, this scenario is a reminder that a model's usefulness isn't measured solely by the fluency of its output. A system that helps draft scientific research without source verification mechanisms isn't helping science: it's adding well-dressed noise.
Proposals circulating in the sector range from detection systems specific to academic papers to changes in editorial processes that require authors to declare AI use and manually verify each citation. No solution is simple or quick.
---
From our perspective, this is one of the most serious and least spectacular side effects of LLM expansion: not the obvious errors, but the errors that pass the filter. Academia took decades to build standards of scientific integrity; the pace at which they're coming under pressure is, at minimum, concerning.
Sources
Read next
General-Purpose LLMs Outperform Specialized Medical AI in Benchmarks
A study published in Nature Medicine shows that general-purpose language models achieve better results than specialized clinical systems on standardized medical evaluation benchmarks.
ToolSense: How to Audit What an LLM Really Knows About Its Tools
A new diagnostic framework published on arXiv reveals that models retrieving tools parametrically can score well on standard metrics without actually understanding what each tool does.
Business World Model: How AI Agents Learn to Reason About Companies
A new arXiv paper proposes a formal architecture enabling AI agents to model the state and dynamics of an entire business before acting, rather than simply executing predefined tasks.