Skip to main content

Beyond RDKit: Benchmarking the Rowan Agent Skill Against Experiment

A reproducible study of pKa, logD, tautomers, ADME, and docking run by an AI agent with the Rowan skill, measured against RDKit and experimental ground truth.

9 min read
Share:
Beyond RDKit: Benchmarking the Rowan Agent Skill Against Experiment
Rowan Scientific logomark

What is Rowan? Rowan is a cloud-native molecular-modeling platform that runs physics-based and machine-learning chemistry workflows (pKa, logD, tautomers, solubility, permeability, docking, MD, and more) through a scientist-friendly web app and a Python API. The skill benchmarked here was contributed by Rowan Scientific to K-Dense's open-source Scientific Agent Skills library.

Every computational chemist has RDKit open in a terminal. It is free, fast, local, and genuinely excellent at what it was built for: molecular I/O, substructure search, fingerprints, 2D descriptors, conformer geometries, tautomer enumeration. For a huge fraction of day-to-day cheminformatics, RDKit is all you need.

But RDKit is, by design, a cheminformatics toolkit, not a physics engine. Ask it for the pKa of a molecule and it has no answer. Ask for the logD at pH 7.4 that actually governs absorption, and it can only hand you a pH-blind logP. Ask which tautomer dominates in water, with populations, or for a docked pose, or a permeability estimate, and you are out of scope.

Rowan covers that different layer, and it is now available as an agent skill that Rowan contributed to K-Dense's open-source Scientific Agent Skills library: an LLM agent (here, Claude) can drive the whole platform from natural language. So we asked the practical question, and answered it with numbers:

Moving from the free-tool baseline to the Rowan skill, what do you gain, how accurate is it against experiment, and what does it cost?

Every result below has three reference points: Rowan, the without-skill baseline (RDKit, or a competent functional-group heuristic where RDKit offers nothing), and experiment (literature values). All code, data, and the roughly 60 cloud workflows are reproducible. The full study used about 12 to 14 Rowan credits, or roughly $0.48 to $0.56 at Rowan's published pay-as-you-go rate of $0.04 per purchased credit (Rowan Credits FAQ).

Study 1: pKa, the clean kill

pKa determines how much of your molecule is charged at physiological pH, which cascades into solubility, permeability, and binding. It is also the single most glaring hole in the free toolbox: RDKit ships no pKa predictor. Without a dedicated tool you fall back on functional-group rules of thumb ("carboxylic acid ≈ 4.5, aliphatic amine ≈ 10.5") or pay for commercial software.

We took 32 drugs with reliable literature pKa (carboxylic acids, phenols, N-H acids, anilines, pyridines, imidazoles, and aliphatic amines) and compared Rowan's starling model against a faithful functional-group heuristic (the best you can do by eye).

Predicted vs. experimental pKa for 32 drugs, colored by chemical class

Method MAE max error
Rowan starling 0.23 0.986 0.63
RDKit functional-group heuristic 1.15 0.41 5.91

Rowan's mean absolute error of 0.23 pKa units sits at the level of experimental reproducibility. The heuristic, meanwhile, breaks down exactly where chemistry gets interesting:

  • Substituent electronics. 2,4-dinitrophenol has an experimental pKa of 4.09; Rowan predicts 3.46; the heuristic says 10.0, because it cannot "see" that two nitro groups acidify the phenol by six log units.
  • Amine individuality. The heuristic assigns every aliphatic amine 10.5. Rowan resolves nicotine (7.70), lidocaine (7.54), diphenhydramine (8.53), propranolol (9.32), and ephedrine (9.62), each within about 0.4 of experiment.

pKa signed-error distributions and mean absolute error by chemical class

Field note for skill users. Rowan's pKa result exposes both strongest_acid and strongest_base. Match them to the physical ionization: acids use strongest_acid, bases use strongest_base. For propranolol, strongest_acid returns 10.0 (a spurious amide-like N-H deprotonation) while strongest_base gives 9.28, the number you actually want.

Study 2: lipophilicity, the right number versus the wrong one

Here is a subtler trap, and a more important one. RDKit's Crippen MolLogP gives you the neutral logP, a single pH-independent number. But the quantity that controls oral absorption is logD at pH 7.4, and for an ionizable drug it can be 2 to 4 log units lower, because almost all of the molecule is charged at physiological pH.

We took 16 drugs with experimental logP and logD₇.₄ (including four neutral controls) and compared RDKit's logP, used as the ADME-relevant lipophilicity as it often is in practice, against Rowan's pH-aware logD₇.₄ from the macropKa workflow.

Predicted vs. experimental logD at pH 7.4

Predicting experimental logD₇.₄ (12 ionizable drugs) MAE r
RDKit Crippen logP (pH-blind) 2.24 −2.36 0.84
Rowan logD₇.₄ 1.15 0.04 0.92

RDKit's logP-as-logD has a negative R². For ionizable drugs it is literally worse than guessing the dataset average, because it systematically overshoots. Rowan halves the error, correlates strongly (r = 0.92), and gets the ionization direction right for all 12 compounds. On the four neutral controls (caffeine, antipyrine, diazepam), Rowan correctly applies zero correction: logD ≡ logP, exactly as it should.

But the real payoff is not a single number. It is the whole curve, which RDKit fundamentally cannot draw:

logD and aqueous solubility as a function of pH

Acids shed lipophilicity above their pKa; bases below it; neutrals stay flat; and aqueous solubility swings the opposite way. This pH-resolved picture is what a formulation or ADME scientist actually reasons with.

Honest caveat. Rowan's neutral-logP plateau tracks RDKit's Crippen logP almost exactly. Rowan's value-add is the ionization correction layered on top, not the neutral logP itself. And its model lets carboxylate anions retain some octanol partitioning, so it under-penalizes acids by about 1 log unit. logD is a genuinely hard endpoint; trust the shape and the pKa, and treat absolute acid logD₇.₄ as mildly optimistic.

Study 3: tautomers, physics versus convention

Tautomers are where it pays to be skeptical. RDKit's TautomerEnumerator can enumerate tautomers and pick a canonical one, but that choice is a deterministic scoring convention (rules that favor amide over imidic acid, keto over enol, thione over thiol). It is perfect for registration and deduplication, but it carries no energies, no populations, and no awareness of solvent, temperature, or substituents.

Rowan's tautomer search optimizes and ranks tautomers by solvated free energy and returns ΔG and Boltzmann populations. We tested six classic systems with known aqueous-dominant forms.

Relative free energies and an agreement matrix for six tautomer systems

System (aqueous dominant) Rowan ΔG ranking RDKit convention
2-pyridone ✓ (99.9%)
4-pyridone
uracil (diketo)
cytosine (amino-keto)
acetylacetone (diketo, aqueous)
2-mercaptopyridine (thione)
Total 4 / 6 6 / 6

The tautomer benchmark is useful precisely because it is mixed. Rowan nails the biologically canonical forms (2-pyridone, uracil, cytosine) but misranks the two equilibria most sensitive to explicit solvation (4-pyridone and 2-thiopyridine) because its default uses an implicit solvent model. RDKit's convention matches all six because its hard-coded rules encode those common motifs, but it would give the same answer in chloroform, and tells you nothing about how much of each form is present.

The lesson is "right tool for the job": if you need a canonical structure for a database, use RDKit; if you need ΔG and populations or substituent sensitivity, use a physics method, and validate the polar, solvation-sensitive cases.

(We also caught, and disclose, a ground-truth subtlety: acetylacetone's famous enol dominates in gas and neat liquid, but in water the diketo form wins about 85%. Since Rowan models the aqueous phase, the correct aqueous reference is diketo.)

Study 4: a full ADME profile, orchestrated end-to-end

The skill is not just individual predictions, it is the agent chaining them. From a SMILES string, the agent builds a complete profile: descriptors → macropKa (logD and solubility vs pH) → Caco-2 permeability. We ran three drugs that span the Biopharmaceutics Classification System.

End-to-end ADME dashboard for three BCS-spanning drugs

Drug (BCS class) logD₇.₄ logS₇.₄ Caco-2 logP_app Lipinski
ibuprofen (II) 1.76 −3.97 (low sol) −4.40 (high perm) pass
propranolol (I) 1.60 −3.58 −4.33 (high perm) pass
atenolol (III) −0.53 −1.08 (high sol) −6.03 (low perm) pass

The predictions reproduce textbook pharmacology: atenolol's notoriously poor passive permeability shows up as a Caco-2 logP_app nearly two log units below the others, while its high solubility and ibuprofen's low solubility match their BCS classes. The punchline: all three "pass Lipinski" in RDKit, so the free 2D filter cannot tell a BCS I drug from a BCS III drug.

Study 5: into 3D, docking pose recovery

Structure-based work is the last frontier RDKit does not touch. Rowan's docking is managed AutoDock Vina, so the comparison here is not accuracy-versus-RDKit, it is operational: with the skill it is a PDB id, a SMILES, and a pocket box in a few lines.

We validated it the standard way: redock benzamidine into bovine trypsin (PDB 3PTB) and measure the heavy-atom RMSD of the top pose against the crystallographic ligand. The success criterion is RMSD < 2.0 Å.

Benzamidine redocked into trypsin: top pose overlaid on the crystallographic ligand

From a PDB id and a SMILES string, the run produced 13 PoseBusters-valid poses (best Vinardo score −5.78 kcal/mol), and the top pose reproduced the crystal binding mode to a heavy-atom RMSD of 0.19 Å, a textbook pass. The same result with a local toolchain would have meant installing Vina, preparing a PDBQT receptor, and hand-defining a grid box; here it was about 1 to 2 credits.

The big picture

The capability matrix below tells the whole story in one frame.

Capability matrix comparing RDKit and Rowan across cheminformatics and physics-based properties

For the everyday work RDKit was built for (2D descriptors, neutral logP, enumeration), the free baseline is excellent and you should keep using it. But the moment a decision turns on ionization, pH-dependent behavior, solubility, permeability, tautomer energetics, or 3D structure, the Rowan skill moves the agent from descriptor calculation into chemistry workflows that RDKit was never meant to run.

Property Without skill Rowan vs. experiment
pKa heuristic MAE 1.15; RDKit: none MAE 0.23, R² 0.986 at exp. reproducibility
logD₇.₄ (ionizable) logP-as-logD MAE 2.24, R² < 0 MAE 1.15, r 0.92 right quantity and direction
logD / solubility vs pH impossible full curves n/a
Permeability (Caco-2) none recovers BCS rank order qualitative ✓
Tautomer dominant form 6/6 (no ΔG) 4/6 + ΔG and populations nuanced
Docking pose DIY Vina pipeline one call, managed Vina RMSD 0.19 Å
Total cost $0 / local ≈ 13 credits, or ≈ $0.52 pay-as-you-go equivalent n/a

Keep RDKit for what it is great at. Reach for the skill when the answer depends on ionization, energetics, or 3D structure.

Reproducibility and methods

Every workflow result is checkpointed and keyed by molecule, so re-running the study re-collects cached results for zero additional credits. Baseline: RDKit 2026.03.2 (Crippen logP, TPSA, Lipinski, SMARTS matching, TautomerEnumerator, rdDetermineBonds). Rowan: rowan-python 3.x, called through the documented submit_*_workflow entry points. Ground-truth pKa, logP, and logD values are consensus literature numbers from the CRC Handbook, Avdeef's Absorption and Drug Development, and DrugBank-class compilations.

Going further: rowan-autosearch

The studies above score molecules one at a time. To put the same pieces into an optimization loop, see rowan-autosearch, an open-source harness for agent-driven molecular optimization. You define a chemistry objective and a starting molecule, and an AI coding agent acts as a medicinal chemist: proposing analogs, scoring them with Rowan's quantum and ML workflows, gating drug-likeness with RDKit, and iterating until a winning candidate emerges. Every candidate, Rowan payload, constraint check, and design rationale is logged as append-only JSON and rendered into an auditable HTML report. It is joint work between K-Dense and Rowan Scientific.

Get the skill

The rowan skill benchmarked here was contributed by Rowan Scientific to K-Dense's open-source Scientific Agent Skills library; you can read its full instructions in the rowan skill folder. To learn more about the platform it drives, visit Rowan.

Enjoyed this article? Share it with others!

Share:
Back to all posts