Benchmarks for biology agents are messy for a simple reason: biology work is messy. Questions can be underspecified, grading can depend on method choices, and sometimes the benchmark answer key is just wrong.

That's why Phylo's recent writeup on biology-agent evaluation caught our attention. Their main point is hard to argue with: if you want to know how good an agent actually is, you first have to clean up the benchmark. BixBench-Verified-50 is their attempt to do exactly that.

We ran K-Dense Web on that verified subset and scored 45/50, or 90.0% accuracy.

Why the verified subset matters

The original BixBench benchmark is useful, but it mixes together a few different kinds of failures:

real agent mistakes
ambiguous or underspecified questions
incorrect or inconsistent ground truth

That distinction matters. If an agent picks a defensible method and gets marked wrong because the benchmark expected a different unstated choice, that does not tell you much about the agent. It tells you the eval needs work.

BixBench-Verified-50 is more interesting because it tries to remove that noise. According to Phylo's description, the subset was reviewed with domain experts, with problematic questions removed, wording clarified, and incorrect answers fixed. That makes the score more meaningful than a raw pass/fail number on the original benchmark.

Our result

K-Dense Web scored 45 out of 50 on the verified set.

Metric	Result
Total questions	50
Correct	45
Accuracy	90.0%
Total runtime	~103 minutes
Average runtime per question	123.6 seconds

Here is the benchmark comparison plot for this run:

BixBench-Verified-50 benchmark comparison

For additional context, Phylo reported the following BixBench-Verified-50 scores for other systems in their public post. We are placing our run alongside them here:

System	Accuracy
K-Dense Web	90.0%
Biomni Lab	88.7%
Edison Analysis	78.0%
Claude Code (Opus 4.6)	65.3%
OpenAI Agents SDK (GPT-5.2)	61.3%

Those numbers should always be read with some caution because prompting, tools, and run conditions matter. Still, the headline is straightforward: K-Dense Web performed very strongly on a version of BixBench that was explicitly designed to be less noisy and more fair.

That result is especially notable because K-Dense Web is a generalist intelligent system. It is not a bioinformatics-only product or a benchmark-specialized agent tuned just for BixBench. The same platform is built to handle research, coding, data analysis, machine learning, and multi-step technical workflows across domains. Even without being optimized specifically for bioinformatics, it performed at the top of a demanding biology benchmark.

Breakdown by verifier type

The verified subset uses three grading modes: LLM-based judging, exact-string matching, and numeric range checks. K-Dense Web performed well across all three.

Verifier type	Correct	Total	Accuracy
LLM verifier	18	20	90.0%
String verifier	15	17	88.2%
Range verifier	12	13	92.3%

That spread is useful. The score is not being carried by one easy slice of the benchmark. We saw strong performance on judged answers, exact-answer tasks, and quantitative outputs.

What the remaining misses tell us

A 90.0% score is strong, but it is not perfect. We still missed five questions:

2 on LLM-judged answers
2 on exact-string answers
1 on a numeric range check

Here are the five misses from the run:

Question ID	Verifier	Expected	K-Dense Web answer	What happened
`bix-43-q2`	String	`5.81`	`5.831005059276599`	Numerically close, but failed exact string matching
`bix-61-q5`	LLM judge	`2.68`	`2.56`	Small quantitative miss on a judged answer
`bix-27-q5`	Range	`(55,56)`	`56.471704739574804`	Slightly outside the accepted range
`bix-16-q1`	String	`CDKN1A`	`CCND1`	Wrong exact answer
`bix-45-q1`	LLM judge	`7.6968e-54`	`1.8037401775486344e-47`	Large numeric mismatch on a judged answer

This table is useful because the misses are not all the same. Two are basically formatting or precision problems, one is a narrow range miss, and two are substantive answer misses. That is a better place to be than having one systematic failure mode tank the entire benchmark, but it also makes clear where the next round of improvements should go.

Why this result matters

The bigger story is not just that K-Dense Web scored well. It is that verified benchmarks tell a clearer story about what these systems can already do.

If you look only at noisy benchmarks, you can come away thinking biology agents are much worse than they are in practice. If you clean up the questions and grading, a different picture emerges: good systems are already capable of completing a large share of real bioinformatics analysis tasks correctly.

That matches what we see with users. The value is not in producing a nice-looking answer and hoping for the best. It is in running the analysis, choosing reasonable methods, and getting to a result that holds up under inspection.

BixBench-Verified-50 is not the final word on evaluation, and Phylo is right to argue that biology needs more process-aware benchmarks over time. But as a checkpoint for current capability, this one is useful. On that checkpoint, K-Dense Web scored 90.0%.

At the same time, benchmarks only go so far. The hardest scientific work usually does not look like a fixed multiple-choice or short-answer eval. Real research involves messy datasets, unclear problem definitions, partial context, changing objectives, and output formats that depend on the downstream decision. No current benchmark really captures that full picture.

Try it yourself

If you want to run your own biology analyses, benchmark agent workflows, or pressure-test a scientific question end to end, K-Dense Web is built for exactly that kind of work. We would encourage anyone evaluating the platform to go beyond benchmark scores and test it on their own hardest research problems. That is the standard that matters most.

Start your own analysis →