When AI Papers Look Good But Aren't

Last summer, Peter Degen's postdoctoral supervisor came to him with an unusual problem: one of his papers was being cited too much. Citations are the currency of academia, but something was off about these ones. The paper in question, published in 2017, evaluated the accuracy of a specific type of statistical analysis on epidemiological data, a technical and specialized work, not exactly the kind of material that goes viral. What was happening, according to The Verge, is that a wave of new articles generated or assisted by AI were citing it without their authors appearing to have read, or understood, the original.

That episode sums up a problem that has been quietly developing for months: scientific papers produced with the help of language models are no longer easily identifiable by their poor writing or awkward structure. Now they sound coherent, they flow well, and they have the formal appearance of legitimate research. The problem is that beneath that polished surface, methodological rigor is nowhere to be found.

The Paper That Sounds Right But Fails in Substance

For years, editors and reviewers could filter out low-quality academic noise, in part, because it was recognizable: choppy prose, disconnected references, inflated conclusions. LLMs have removed those surface-level signals. An article produced with AI assistance can have a flawless introduction, a methods section that appears coherent, and conclusions that align with the abstract, and yet cite studies that don't say what the author claims, or apply statistical analyses incorrectly.

This makes peer review qualitatively more difficult. Before, a reviewer could dismiss mediocre work on the first read. Now they have to do the same foundational work they would with a serious paper, because the text gives no early clues to its deficiencies. The volume of submissions to scientific journals has grown notably in the past eighteen months, with some publications reporting increases of 30-50 percent in submissions, and a significant portion of that growth is not accompanied by an equivalent increase in actual quality.

The Problem of Ghost Citations

Degen's case points to a specific dysfunction: language models, when assisting in the drafting of a paper, tend to search for references that support general claims. If the model has processed enough literature on a topic, it can retrieve real titles and associate them with plausible statements even if the connection is forced or outright incorrect. The result is citations that exist, of papers that exist, but don't support what the new article says they support.

This isn't hallucination in the classical sense, the model isn't inventing a nonexistent DOI, but something subtler: a distortion of the meaning of the original source. For academic metrics systems, those citations count the same. A researcher's h-index goes up. The cited paper gains visibility. And the chain of knowledge gets loaded with defective links that are difficult to identify without reading in depth.

For Whom This Is an Urgent Problem

The most affected in the short term are editorial teams at scientific journals, especially in disciplines with high paper production like biomedicine, epidemiology, or climate science. Also young researchers, who compete to publish in a market where the volume of superficially correct work has increased and review times are getting longer.

For those building tools on LLMs, including integrations with APIs like Anthropic's, this scenario is a reminder that a model's usefulness isn't measured solely by the fluency of its output. A system that helps draft scientific research without source verification mechanisms isn't helping science: it's adding well-dressed noise.

Proposals circulating in the sector range from detection systems specific to academic papers to changes in editorial processes that require authors to declare AI use and manually verify each citation. No solution is simple or quick.

---

From our perspective, this is one of the most serious and least spectacular side effects of LLM expansion: not the obvious errors, but the errors that pass the filter. Academia took decades to build standards of scientific integrity; the pace at which they're coming under pressure is, at minimum, concerning.

When AI Papers Look Good But Aren't

The Paper That Sounds Right But Fails in Substance

The Problem of Ghost Citations

For Whom This Is an Urgent Problem

Sources

Read next

AINTMA: six AI agents to automate software test management

LLM watermarks degrade the quality of medical texts, study finds

SysAdmin, the benchmark that measures power seeking