--- title: "Beyond RDKit: Benchmarking the Rowan Agent Skill Against Experiment" description: "A reproducible study of pKa, logD, tautomers, ADME, and docking run by an AI agent with the Rowan skill, measured against RDKit and experimental ground truth." updatedAt: "2026-06-09" tags: ["Research", "Drug Discovery", "Chemistry", "Benchmarks", "AI", "Skills", "Scientific Agent Skills"] canonical: "https://k-dense.ai/blog/benchmarking-rowan-skill-chemistry" ---

What is Rowan? Rowan is a cloud-native molecular-modeling platform that runs physics-based and machine-learning chemistry workflows (pKa, logD, tautomers, solubility, permeability, docking, MD, and more) through a scientist-friendly web app and a Python API. The skill benchmarked here was contributed by Rowan Scientific to K-Dense's open-source Scientific Agent Skills library.

Every computational chemist has RDKit open in a terminal. It is free, fast, local, and genuinely excellent at what it was built for: molecular I/O, substructure search, fingerprints, 2D descriptors, conformer geometries, tautomer enumeration. For a huge fraction of day-to-day cheminformatics, RDKit is all you need. But RDKit is, by design, a *cheminformatics* toolkit, not a physics engine. Ask it for the **pKa** of a molecule and it has no answer. Ask for the **logD at pH 7.4** that actually governs absorption, and it can only hand you a pH-blind logP. Ask which **tautomer** dominates in water, with populations, or for a docked **pose**, or a **permeability** estimate, and you are out of scope. Rowan covers that different layer, and it is now available as an **agent skill** that Rowan contributed to K-Dense's open-source [Scientific Agent Skills](https://github.com/K-Dense-AI/scientific-agent-skills) library: an LLM agent (here, Claude) can drive the whole platform from natural language. So we asked the practical question, and answered it with numbers: Moving from the free-tool baseline to the Rowan skill, **what do you gain, how accurate is it against experiment, and what does it cost?** Every result below has three reference points: **Rowan**, the **without-skill baseline** (RDKit, or a competent functional-group heuristic where RDKit offers nothing), and **experiment** (literature values). All code, data, and the roughly 60 cloud workflows are reproducible. The full study used about **12 to 14 Rowan credits**, or roughly **$0.48 to $0.56** at Rowan's published pay-as-you-go rate of $0.04 per purchased credit ([Rowan Credits FAQ](https://www.rowansci.com/blog/credits-faq)). ## Study 1: pKa, the clean kill pKa determines how much of your molecule is charged at physiological pH, which cascades into solubility, permeability, and binding. It is also the single most glaring hole in the free toolbox: **RDKit ships no pKa predictor.** Without a dedicated tool you fall back on functional-group rules of thumb ("carboxylic acid ≈ 4.5, aliphatic amine ≈ 10.5") or pay for commercial software. We took **32 drugs** with reliable literature pKa (carboxylic acids, phenols, N-H acids, anilines, pyridines, imidazoles, and aliphatic amines) and compared Rowan's `starling` model against a faithful functional-group heuristic (the best you can do by eye). ![Predicted vs. experimental pKa for 32 drugs, colored by chemical class](./fig1_pka_parity.png) | Method | MAE | R² | max error | |---|---|---|---| | **Rowan `starling`** | **0.23** | **0.986** | 0.63 | | RDKit functional-group heuristic | 1.15 | 0.41 | 5.91 | Rowan's mean absolute error of **0.23 pKa units sits at the level of experimental reproducibility.** The heuristic, meanwhile, breaks down exactly where chemistry gets interesting: - **Substituent electronics.** 2,4-dinitrophenol has an experimental pKa of **4.09**; Rowan predicts **3.46**; the heuristic says **10.0**, because it cannot "see" that two nitro groups acidify the phenol by six log units. - **Amine individuality.** The heuristic assigns *every* aliphatic amine 10.5. Rowan resolves nicotine (7.70), lidocaine (7.54), diphenhydramine (8.53), propranolol (9.32), and ephedrine (9.62), each within about 0.4 of experiment. ![pKa signed-error distributions and mean absolute error by chemical class](./fig2_pka_errors.png) **Field note for skill users.** Rowan's pKa result exposes both `strongest_acid` and `strongest_base`. Match them to the *physical* ionization: acids use `strongest_acid`, bases use `strongest_base`. For propranolol, `strongest_acid` returns 10.0 (a spurious amide-like N-H deprotonation) while `strongest_base` gives 9.28, the number you actually want. ## Study 2: lipophilicity, the right number versus the wrong one Here is a subtler trap, and a more important one. RDKit's Crippen `MolLogP` gives you the **neutral** logP, a single pH-independent number. But the quantity that controls oral absorption is **logD at pH 7.4**, and for an ionizable drug it can be 2 to 4 log units lower, because almost all of the molecule is charged at physiological pH. We took 16 drugs with experimental logP *and* logD₇.₄ (including four neutral controls) and compared RDKit's logP, used as the ADME-relevant lipophilicity as it often is in practice, against Rowan's pH-aware logD₇.₄ from the macropKa workflow. ![Predicted vs. experimental logD at pH 7.4](./fig3_logd_parity.png) | Predicting experimental logD₇.₄ (12 ionizable drugs) | MAE | R² | r | |---|---|---|---| | RDKit Crippen logP (pH-blind) | 2.24 | **−2.36** | 0.84 | | **Rowan logD₇.₄** | **1.15** | 0.04 | **0.92** | RDKit's logP-as-logD has a **negative R²**. For ionizable drugs it is literally worse than guessing the dataset average, because it systematically overshoots. Rowan **halves the error**, correlates strongly (r = 0.92), and gets the ionization direction right for **all 12** compounds. On the four neutral controls (caffeine, antipyrine, diazepam), Rowan correctly applies *zero* correction: logD ≡ logP, exactly as it should. But the real payoff is not a single number. It is the whole curve, which RDKit fundamentally cannot draw: ![logD and aqueous solubility as a function of pH](./fig4_ph_curves.png) Acids shed lipophilicity above their pKa; bases below it; neutrals stay flat; and aqueous solubility swings the opposite way. This pH-resolved picture is what a formulation or ADME scientist actually reasons with. **Honest caveat.** Rowan's neutral-logP plateau tracks RDKit's Crippen logP almost exactly. Rowan's value-add is the *ionization correction* layered on top, not the neutral logP itself. And its model lets carboxylate anions retain some octanol partitioning, so it under-penalizes acids by about 1 log unit. logD is a genuinely hard endpoint; trust the *shape* and the *pKa*, and treat absolute acid logD₇.₄ as mildly optimistic. ## Study 3: tautomers, physics versus convention Tautomers are where it pays to be skeptical. RDKit's `TautomerEnumerator` can enumerate tautomers and pick a **canonical** one, but that choice is a deterministic *scoring convention* (rules that favor amide over imidic acid, keto over enol, thione over thiol). It is perfect for registration and deduplication, but it carries **no energies, no populations, and no awareness of solvent, temperature, or substituents.** Rowan's tautomer search optimizes and ranks tautomers by **solvated free energy** and returns ΔG and Boltzmann populations. We tested six classic systems with known aqueous-dominant forms. ![Relative free energies and an agreement matrix for six tautomer systems](./fig5_tautomers.png) | System (aqueous dominant) | Rowan ΔG ranking | RDKit convention | |---|---|---| | 2-pyridone | ✓ (99.9%) | ✓ | | 4-pyridone | ✗ | ✓ | | uracil (diketo) | ✓ | ✓ | | cytosine (amino-keto) | ✓ | ✓ | | acetylacetone (diketo, aqueous) | ✓ | ✓ | | 2-mercaptopyridine (thione) | ✗ | ✓ | | **Total** | **4 / 6** | **6 / 6** | The tautomer benchmark is useful precisely because it is mixed. Rowan nails the biologically canonical forms (2-pyridone, uracil, cytosine) but **misranks the two equilibria most sensitive to explicit solvation** (4-pyridone and 2-thiopyridine) because its default uses an *implicit* solvent model. RDKit's convention matches all six because its hard-coded rules encode those common motifs, but it would give the same answer in chloroform, and tells you nothing about *how much* of each form is present. The lesson is "right tool for the job": if you need a canonical structure for a database, use RDKit; if you need ΔG and populations or substituent sensitivity, use a physics method, and validate the polar, solvation-sensitive cases. *(We also caught, and disclose, a ground-truth subtlety: acetylacetone's famous enol dominates in gas and neat liquid, but in water the diketo form wins about 85%. Since Rowan models the aqueous phase, the correct aqueous reference is diketo.)* ## Study 4: a full ADME profile, orchestrated end-to-end The skill is not just individual predictions, it is the agent chaining them. From a SMILES string, the agent builds a complete profile: **descriptors → macropKa (logD and solubility vs pH) → Caco-2 permeability**. We ran three drugs that span the Biopharmaceutics Classification System. ![End-to-end ADME dashboard for three BCS-spanning drugs](./fig6_adme_dashboard.png) | Drug (BCS class) | logD₇.₄ | logS₇.₄ | Caco-2 logP_app | Lipinski | |---|---|---|---|---| | ibuprofen (II) | 1.76 | −3.97 (low sol) | −4.40 (high perm) | pass | | propranolol (I) | 1.60 | −3.58 | −4.33 (high perm) | pass | | atenolol (III) | −0.53 | −1.08 (high sol) | **−6.03 (low perm)** | pass | The predictions reproduce textbook pharmacology: atenolol's notoriously poor passive permeability shows up as a Caco-2 logP_app nearly two log units below the others, while its high solubility and ibuprofen's low solubility match their BCS classes. The punchline: **all three "pass Lipinski"** in RDKit, so the free 2D filter cannot tell a BCS I drug from a BCS III drug. ## Study 5: into 3D, docking pose recovery Structure-based work is the last frontier RDKit does not touch. Rowan's docking is **managed AutoDock Vina**, so the comparison here is not accuracy-versus-RDKit, it is *operational*: with the skill it is a PDB id, a SMILES, and a pocket box in a few lines. We validated it the standard way: redock benzamidine into bovine trypsin (PDB **3PTB**) and measure the heavy-atom RMSD of the top pose against the crystallographic ligand. The success criterion is RMSD < 2.0 Å. ![Benzamidine redocked into trypsin: top pose overlaid on the crystallographic ligand](./fig8_docking.png) From a PDB id and a SMILES string, the run produced 13 PoseBusters-valid poses (best Vinardo score −5.78 kcal/mol), and the top pose reproduced the crystal binding mode to a **heavy-atom RMSD of 0.19 Å**, a textbook pass. The same result with a local toolchain would have meant installing Vina, preparing a PDBQT receptor, and hand-defining a grid box; here it was about 1 to 2 credits. ## The big picture The capability matrix below tells the whole story in one frame. ![Capability matrix comparing RDKit and Rowan across cheminformatics and physics-based properties](./fig7_capability_matrix.png) For the everyday work RDKit was built for (2D descriptors, neutral logP, enumeration), the free baseline is excellent and you should keep using it. But the moment a decision turns on **ionization, pH-dependent behavior, solubility, permeability, tautomer energetics, or 3D structure**, the Rowan skill moves the agent from descriptor calculation into chemistry workflows that RDKit was never meant to run. | Property | Without skill | Rowan | vs. experiment | |---|---|---|---| | pKa | heuristic MAE 1.15; RDKit: none | **MAE 0.23, R² 0.986** | at exp. reproducibility | | logD₇.₄ (ionizable) | logP-as-logD MAE 2.24, R² < 0 | **MAE 1.15, r 0.92** | right quantity and direction | | logD / solubility vs pH | impossible | full curves | n/a | | Permeability (Caco-2) | none | recovers BCS rank order | qualitative ✓ | | Tautomer dominant form | 6/6 (no ΔG) | 4/6 + ΔG and populations | nuanced | | Docking pose | DIY Vina pipeline | one call, managed Vina | RMSD 0.19 Å | | **Total cost** | $0 / local | **≈ 13 credits, or ≈ $0.52 pay-as-you-go equivalent** | n/a | *Keep RDKit for what it is great at. Reach for the skill when the answer depends on ionization, energetics, or 3D structure.* ### Reproducibility and methods Every workflow result is checkpointed and keyed by molecule, so re-running the study re-collects cached results for zero additional credits. Baseline: RDKit 2026.03.2 (Crippen logP, TPSA, Lipinski, SMARTS matching, `TautomerEnumerator`, `rdDetermineBonds`). Rowan: `rowan-python` 3.x, called through the documented `submit_*_workflow` entry points. Ground-truth pKa, logP, and logD values are consensus literature numbers from the CRC Handbook, Avdeef's *Absorption and Drug Development*, and DrugBank-class compilations. ### Going further: rowan-autosearch The studies above score molecules one at a time. To put the same pieces into an optimization loop, see [rowan-autosearch](https://github.com/K-Dense-AI/rowan-autosearch), an open-source harness for agent-driven molecular optimization. You define a chemistry objective and a starting molecule, and an AI coding agent acts as a medicinal chemist: proposing analogs, scoring them with Rowan's quantum and ML workflows, gating drug-likeness with RDKit, and iterating until a winning candidate emerges. Every candidate, Rowan payload, constraint check, and design rationale is logged as append-only JSON and rendered into an auditable HTML report. It is joint work between K-Dense and Rowan Scientific. ### Get the skill The `rowan` skill benchmarked here was contributed by Rowan Scientific to K-Dense's open-source [Scientific Agent Skills](https://github.com/K-Dense-AI/scientific-agent-skills) library; you can read its full instructions in the [`rowan` skill folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/skills/rowan). To learn more about the platform it drives, visit [Rowan](https://www.rowansci.com/).