Two results landed this spring, and almost everyone read only the first one.

The first is a scandal. An analysis by Pangram Labs found that roughly 21% of the 76,139 peer reviews submitted to ICLR 2026 were fully AI-generated, with more than half showing some level of AI involvement — and in a 300-review sample, GPTZero flagged 50 hallucinated citations (Pangram analysis). The conference had already published a policy in late 2025 requiring disclosure of LLM use and threatening desk rejection. The 21% happened anyway. The number is up from 16.9% the year before, so the trend line is the wrong way up.

The second result barely registered. In June, a team introduced SocSci-Repro-Bench, a benchmark of 221 replication tasks drawn from 54 published social-science papers across four disciplines. They pointed coding agents at those papers and asked a blunt question: can you reproduce the finding? Anthropic's Claude Code agent reproduced 78% of the papers and 93% of the individual analysis tasks (Alizadeh et al., 2026). For context, the authors note that prior LLM agents rarely cleared 35–40% on comparable work.

These two facts are usually filed under opposite headings — one under "AI is ruining science," the other under "AI agent benchmark, ML." We think they belong on the same page, because they describe the same capability pointed in two directions. And the under-reported direction is the more important one.

The argument of this post is simple: the highest-value thing AI can do for science right now is not generate new claims. It is reproduce existing ones. Generation is where AI is dangerous and hard to trust. Reproduction is where it is trustworthy almost by definition. The field has been optimizing the wrong verb.

Why generation corrodes trust

Split illustration. The left half, labeled GENERATION, shows a sheet of scientific paper with a chart dissolving into scattered particles beside a broken-link symbol. The right half, labeled REPRODUCTION, shows the same paper passing through a structured grid of light and re-forming into a crisp, aligned chart with a green check mark. Figure 1. The two uses of AI on science are not symmetric. Generation produces something you have to take on faith; reproduction produces something you can check.

The deeper problem with an AI-written paper or an AI-written review is not that the prose is bad. Often it is fluent. The problem is structural: a generated claim carries no ground truth at the point where it is consumed. When a reviewer reads "the authors fail to cite the relevant work of Smith et al.," there is nothing in the sentence itself that tells them whether Smith et al. exists. When a reader sees a confident result, fluency is doing the work that evidence is supposed to do. A convincing fabrication and a real finding look identical on the page.

That is exactly why the ICLR numbers matter. The 50 hallucinated citations are not a quirk; they are the signature of a process whose output is unmoored from anything verifiable. It is also why generated text scales as a threat. A human reviewer can read carefully, but not at the rate a model can produce. The same asymmetry shows up in the broader literature: more than 10,000 papers were retracted in 2023, a record, as publishers struggled with paper mills and peer-review fraud (Nature, 2023). Cheap, fluent, unverifiable generation pours straight into that crack. We have already seen an AI-written paper pass peer review.

None of this is an argument that models are useless for science. It is an argument about where their output can be trusted. A claim you cannot check is a liability no matter who wrote it. The interesting question is whether AI can be turned from the thing producing unverifiable claims into the thing that checks them.

Reproduction is verifiable by construction

It can, and that is the quiet significance of SocSci-Repro-Bench. Reproduction inverts the asymmetry that makes generation dangerous. The output of a reproduction is not a paragraph you have to believe — it is a number you can diff against the published number. Either the re-run recovers the reported coefficient or it does not. Ground truth is not missing; it is the whole point of the exercise.

This is why reproduction is the natural home for an AI agent. The task plays to what these systems are actually good at — reading messy code, resolving dependencies, translating between Python, R, and Stata, executing long mechanical pipelines without getting bored — and it does so under a regime where the answer is checkable. The benchmark even surfaced the failure modes you would expect from honest engineering rather than hand-waving: the weaker agent (OpenAI's Codex, on GPT-5.3) lost most of its points to missing dependencies, hard-coded file paths, and environment drift, not to any deficit of scientific understanding.

We have seen the same shape in our own domain. On BixBench-Verified-50, a cleaned benchmark of real bioinformatics tasks, a generalist agentic system scored 90% — not by being a better biologist than the specialists, but by reliably executing and checking real analysis. Verification, it turns out, is where a generalist with the right scaffolding shines. The institutions are beginning to agree: NeurIPS just made the Machine Learning Reproducibility Challenge an official track for 2026, on the grounds that "reproducibility has become a scientific question worthy of its own rigorous study," and it explicitly invites AI-assisted reproducibility work — valuing failures to reproduce as much as confirmations (NeurIPS, 2026).

For a field that has lived with a reproducibility crisis for a decade — Nature's 2016 survey found that more than 70% of researchers had failed to reproduce another scientist's experiment, and more than half had failed to reproduce their own (Baker, 2016) — an automated, scalable way to re-run published analyses is not a footnote. It is one of the most useful things AI could plausibly do.

The honest limits: what a reproduction is not

This is the part it would be irresponsible to skip, because the same benchmarks that justify the optimism also draw its boundary.

Three cards. The first, EXECUTE, in green with gears, running code and a check mark, is the solved step. The second, JUDGE, in amber with a magnifying glass over a question-marked document, is only partly resolved. The third, ACCESS, in red with broken folders behind a barrier, is blocked. Figure 2. The three steps of verifying a result, color-coded by how well today's agents do them. Executing a reproduction is largely solved (around 78% of papers and 93% of individual tasks). Judging whether a result is sound sits near 21%. And when no code or data were shared, there is nothing to run at all.

First, reproducing a result is not the same as validating it. A re-run confirms that the code produces the number in the paper. It says nothing about whether the model was specified correctly, whether the design supports the causal claim, or whether the data were collected soundly. That judgment is a different and much harder task — and today's agents are bad at it. A separate benchmark, REPRO-Bench, asks agents to assess the reproducibility and soundness of a paper rather than just execute it, and the best off-the-shelf agent scored only 21.4%. Executing a reproduction is largely solved; judging one is not.

Second, the SocSci-Repro-Bench headline comes with a load-bearing caveat the authors are clear about: the benchmark was built only from papers whose materials they had already confirmed were runnable. It measures whether an agent can reproduce work that can be reproduced. In the wild, most papers never ship runnable code or complete data at all — which means the binding constraint is often not the agent but the absence of anything to run. That is the availability wall in Figure 2, and no amount of model capability gets you over it.

Put those two limits together and the picture is precise rather than deflating. The mechanical re-run — the part that is tedious, scalable, and checkable — is where AI is now genuinely strong. The judgment calls — is this sound, does it replicate conceptually, is the absent code a red flag — remain human work. This is the same line we keep drawing between an AI co-scientist and an AI scientist: the agent does the labor, the scientist owns the verdict.

Build the reproducibility layer

If reproduction is where the value is, the thing to build is a standing reproducibility layer: infrastructure that takes a claim — a published finding, a preprint, a number in an investment memo or a regulatory package — and returns checkable evidence about whether it holds.

A left-to-right pipeline. A sheet of paper labeled CLAIM feeds into a glowing machine labeled REPRODUCE, whose interior shows gears, running code and a compare-and-diff symbol. Its output flows to a scientist at a desk labeled DECIDE, reviewing a report and approving it with a check mark. Figure 3. The agent does the re-running — locating the data and code, re-executing in a pinned environment, diffing against the reported numbers, and preserving every artifact — and the scientist makes the call. Automating the re-run is what makes verification cheap enough to do at the scale AI now writes.

Concretely, that layer has to do what the benchmarks reward and route around what they expose. It locates the data and code across repositories, supplements, and archives. It re-executes in a pinned environment so a missing dependency is a logged fact, not a silent failure. It diffs its output against the reported numbers and reports match, drift, or contradiction. And critically, it preserves the whole trail — the inputs, the seeds, the intermediate tables, every figure it regenerated — so a human can audit not just the verdict but the path to it. The agent never gets the final say on soundness; it hands the scientist a dossier and gets out of the way.

This is the same thesis we have argued from the other end. We have said that the model is no longer the bottleneck and that, as frontier science models arrive, the next bottleneck is verification. Reproduction is what that abstract claim looks like when you make it concrete. A reproducibility layer is the workflow layer doing the one job where AI's output can be fully trusted, and a stronger base model only makes that layer better at it.

What scientists, reviewers, and labs can do now

You do not have to wait for anyone to ship a platform to act on this. The reframing is itself useful.

If you review or edit, stop treating a re-run as a luxury. Re-running the analysis behind a paper is now cheap enough to be routine for any work that ships code and data, and the result is a checkable fact rather than an opinion. If you publish, treat runnable materials as the load-bearing part of the submission, not an afterthought — the availability wall is the single biggest thing standing between your result and anyone, human or machine, confirming it. If you lead a lab or a biotech and you are weighing where to point AI, notice that "have it verify what we already think we know" is lower-variance and higher-trust than "have it propose something new," and it compounds: every reproduction you bank is a result you can build on without flinching.

And whatever you automate, keep the line in the right place. Let the agent re-run, diff, and document. Keep the judgment of soundness — and the decision to believe — with a person.

The dominant story this spring was that AI is drowning science in claims nobody can check. It is the same technology, pointed the other way, that can check them. Generation got the hype. Reproduction is the killer app.

Related reading: