---
title: "The Model Is No Longer the Bottleneck"
description: "Anthropic showed a general model can rival ChemDraw at NMR. The real story for scientists: the bottleneck has moved from the model to the workflow around it."
updatedAt: "2026-06-07"
tags: ["AI", "Research", "Chemistry", "Science"]
canonical: "https://k-dense.ai/blog/the-model-is-no-longer-the-bottleneck"
---
This week Anthropic published a quiet result that should change how scientists think about AI. In [Making Claude a chemist](https://www.anthropic.com/research/making-claude-a-chemist), they put a general-purpose model with no chemistry fine-tuning up against the dedicated software that chemists have relied on for decades, and it held its own, beating that software outright in several places.

Most people will take that as the headline, but we think it points somewhere more interesting. The real story is not that an AI can do chemistry, it is that the hardest part of the problem has moved. For years the limiting factor in scientific AI was the raw capability of the model itself, and that era is now ending, because what stands between today's models and real scientific work is no longer intelligence but the workflow built around it.

## What the chemistry result actually showed

![Schematic: forward prediction goes from molecular structure to NMR spectrum, inverse elucidation goes from spectrum back to structure, with key accuracy results below](./nmr-schematic.svg)
*Figure 1. Forward prediction goes from structure to spectrum, the task dedicated software was built for. Inverse elucidation runs it backward, from spectrum to structure. A general model is now competitive at the first and capable at the second.*

Anthropic tested three models against ChemDraw and MestReNova on NMR, one of the most common and most time-consuming analytical tasks in synthetic chemistry, and the results are worth stating plainly.

On forward prediction, taking a known structure and predicting where every hydrogen and carbon peak will fall, the best model (Opus 4.7) was the most accurate tool tested on hydrogen, with an average error of ±0.079 ppm, and was effectively tied with MestReNova on carbon. On the shape of the peaks, the splitting patterns and sub-peak spacing that carry structural information, all three models predicted the spacing to within half a hertz roughly 80% of the time, against 26 to 35% for the classical tools.

Then they ran the problem backwards. Given only a molecular formula and a 1D spectrum, could the model propose the structure that produced it? This is the inverse problem, the one existing software largely leaves to the human. The model recovered all eight of the simpler targets on every attempt from spectra and formula alone, and solved most of the harder fused-ring and spirocycle targets when given the starting material as a hint.

The point that matters here is the one Anthropic makes themselves: this is a general model with no chemistry-specific training, and it does from a pasted spectrum what used to require licensed, specialized, single-purpose tools.

The team is also refreshingly honest about the limits. The evaluation was small, with 20 compounds for the forward task and 15 for the inverse, two-dimensional experiments and stereochemistry were out of scope, and solvent coverage was narrow. On the densest inverse problems, without the starting material as a clue, the model would sometimes loop through its reasoning without ever committing to a final answer. These are real caveats, but that last one is worth holding onto, because it is not a knowledge problem but a workflow one, and we will come back to it.

## This is not a chemistry story

It is tempting to file this under chemistry and move on. That would be a mistake, because the same pattern is showing up across the sciences.

We saw it in biology. On [BixBench-Verified-50](/blog/bixbench-verified-50), a cleaned benchmark of real bioinformatics tasks, a generalist system scored 90%, ahead of specialized agents, without being tuned for the benchmark. The chemistry result is the same shape in a different field. A general model, asked to do work that a domain expert assumed required a dedicated tool, turns out to be competitive or better.

When the same surprising result appears in chemistry, in biology, and in domain after domain, it stops being a surprise and becomes a trend, which means the capability is general and it is already here.

So the interesting question is no longer whether the model can do it, because increasingly the answer is yes, but what has to be true for the answer to be trustworthy, complete, and reproducible, and that turns out to be a very different question with a very different answer.

## The gap between an answer and a result

![Schematic comparing a chatbot answer, which stops at a plausible response, with a research result, which runs through data, analysis, verification, and a reproducible answer](./evidence-chain.svg)
*Figure 2. A chatbot stops at a plausible answer. A research result requires the whole chain of evidence behind it, and the model is only one link in that chain.*

A chat response is not a research result. A research result is a chain of evidence, made up of the right data pulled from the right sources, the right method chosen and run, the output checked against what is already known, and a final answer you can defend and reproduce, and the model is only one link in that chain, however crucial.

Look again at the failure Anthropic reported, the model looping on the hardest structures without committing. That is not a gap in chemical knowledge but the absence of a system around the model that forces a decision, tests the candidates against the spectrum, and rules options out until one survives. Give a capable model the ability to act, to run cheminformatics code, to query a structure database, and to cross-check a proposed structure against its own predicted spectrum, and the loop closes, which is to say the bottleneck was never the chemistry but the scaffolding around it.

This is the part the field consistently underweights. We keep asking models to be smarter when what we actually need is for them to be operationalized. The frontier work in AI for science now lives in the seam between a frontier model and the system that lets it do real work:

- Reaching real data, not just what is in the weights. The 250-plus databases, the unstructured supporting information, the instrument files in their native formats.
- Executing real analysis. Writing and running code, not narrating what the code would do.
- Verifying instead of asserting. Checking a candidate answer against evidence, and committing only when it holds up.
- Producing outputs a human can audit. The method, the data, the script, the figure, not just a confident paragraph.

Anthropic's own roadmap points in exactly this direction. The next bottlenecks they name, reading and rendering chemical structures, retrosynthesis and synthetic reasoning, mechanism, and reading the chemical literature as it is actually published, are not requests for a smarter chemist but integration and translation problems, which is to say they are agent problems.

## The bet worth making

![Schematic stack: frontier models form the engine at the base, an agentic workflow layer of databases, code execution, verification, and auditable outputs sits above, producing reproducible science at the top, with a side arrow showing a better model raises the ceiling](./stack.svg)
*Figure 3. The workflow layer is what turns a frontier model into reproducible science. A more capable model raises the ceiling rather than making that layer redundant.*

This is the premise our work is built on. The frontier models are the engine, but the decisive advantage comes from the system that turns that engine into finished science, connecting it to real data, letting it write and run real code, reading the files instruments actually produce, and holding it to outputs that survive inspection. K-Dense Web is model-agnostic by design, built to run on the strongest frontier models available, Claude among them, precisely because the leverage is no longer in any single model but in everything around it.

That is also why the chemistry paper is encouraging rather than threatening, because a more capable base model does not make the workflow layer redundant, it raises the ceiling on what the workflow can deliver. Better NMR reasoning in the model means a chemistry agent that can elucidate a structure, propose a route, flag a likely byproduct, and hand back a documented result from end to end, so the model getting better is the best thing that can happen to the layer above it.

## Building it together

Anthropic closed their post by inviting researchers working on problems where Claude could help, and by expanding their AI for Science program toward chemistry. That instinct, to develop a model and its real-world applications in conversation with the people who use them, is exactly the right one.

The most valuable progress in AI for science over the next few years will not come only from larger models, and it will not come only from clever scaffolding, but from the two being built together, with the people who push the model's reasoning working alongside the people who turn that reasoning into reproducible research.

We see it the same way, and it already shapes how we build. We work closely with Google today, and we would like to open that same kind of collaboration to every lab building frontier models. The progress that matters now lives in the layer between a capable model and a finished, reproducible result, and we are glad to partner with anyone serious about getting that layer right. The easiest way to start a conversation is contact@k-dense.ai.
