Limited TimeFast mode is completely free!Try it now

Autonomous Drug Discovery: Mining 700,000 Natural Products for Antimicrobial Candidates

How K-Dense Web autonomously processed the COCONUT database to identify 50 prioritized antimicrobial candidates using unsupervised machine learning.

Share:
Autonomous Drug Discovery: Mining 700,000 Natural Products for Antimicrobial Candidates

The global antibiotic pipeline is in bad shape. Bacterial infections that were reliably treatable 30 years ago now kill people, and new antibiotic approvals have been slow. Natural products - compounds produced by bacteria, fungi, and plants - have historically been the best source of new ones. Penicillin, vancomycin, erythromycin: all came from organisms that had already figured out how to kill bacteria.

The problem is there are a lot of natural products to look through. The COCONUT database alone has 715,822 of them.

In this case study, K-Dense Web processed that entire database to produce 50 prioritized antimicrobial candidates ready for experimental screening. The whole pipeline ran in about 45 minutes.

Finding needles in a molecular haystack

Manually evaluating 715,000+ compounds isn't realistic. You need a way to filter, cluster, and prioritize before anything goes near a lab.

K-Dense Web was given a single prompt describing the research goal, then designed and ran the rest.

The pipeline

Autonomous workflow schematic showing the complete pipeline from data preparation through candidate selection

Step 1: Data preparation

K-Dense Web downloaded the full COCONUT database (664 MB), filtered for bacterial-derived compounds, and validated all SMILES structures using RDKit.

  • Compounds downloaded: 715,822
  • Bacterial-derived compounds: 24,911 (3.48%)
  • Validation rate: 99.996%
  • Final dataset: 24,910 unique compounds with validated structures

Step 2: Feature engineering

For each compound, K-Dense Web calculated physicochemical properties (molecular weight, LogP, TPSA, hydrogen bond donors/acceptors), structural descriptors (ring count, aromatic rings, fraction sp3 carbons), and drug-likeness metrics (QED score, Lipinski's Rule of 5 compliance, PAINS filtering).

Property Mean Range
Molecular Weight 539 Da 1 - 4,900 Da
LogP 2.1 -29 to 37
QED Score 0.36 0.01 - 0.94
Lipinski Compliant 44.8% -

Only 39.7% of compounds passed both Lipinski's Rule of 5 and PAINS filters. That's not surprising - bacterial natural products often sit outside traditional drug-like chemical space. Vancomycin and daptomycin wouldn't pass Lipinski either.

Step 3: Chemical space analysis

The original plan was to pull bioactivity training data from ChEMBL and build a supervised model. The ChEMBL API returned errors. Rather than stopping, K-Dense Web switched to an unsupervised approach - which worked fine.

PCA of the full compound set turned up two distinct clusters:

PCA visualization showing two distinct chemical clusters in the bacterial natural product space

Cluster 0 (75.4% of compounds): Small, drug-like molecules. Mean MW: 396 Da, mean QED: 0.44. Probably alkaloids, terpenoids, and smaller polyketides.

Cluster 1 (24.6% of compounds): Large, complex molecules. Mean MW: 978 Da - 2.5× larger - mean QED: 0.09. Probably glycopeptides, lipopeptides, and macrocyclic antibiotics.

This bimodal split maps cleanly onto what's already known about antimicrobial natural products: one population of small, simple molecules and another of the kind of complex scaffolds that produced vancomycin.

Heatmap showing normalized chemical profiles for each cluster

Step 4: Candidate selection

K-Dense Web selected 25 compounds from each cluster:

Group A: Drug-like leads (from Cluster 0)

  • Mean MW: 322 Da, mean QED: 0.93
  • 100% Lipinski compliant
  • Good starting points for traditional medicinal chemistry optimization

Group B: Complex scaffolds (from Cluster 1)

  • Mean MW: 1,930 Da, mean QED: 0.05
  • Structural types typical of clinically successful antibiotics
  • Lower development tractability, higher novelty potential

PCA plot showing the 50 selected candidates highlighted against the full dataset

Covering both groups hedges the bets. Group A is easier to develop and optimize. Group B is where the bigger swings are.

Step 5: Validation and reporting

K-Dense Web produced 10 publication-ready figures, a research manuscript with methods, results, and discussion sections, and detailed candidate profiles.

Property distribution comparison between Group A and Group B candidates

The two groups look quite different from each other:

Chemical structures of top candidates from each group

Results

Metric Value
Initial compounds screened 715,822
Bacterial compounds identified 24,910
Chemical clusters discovered 2
Prioritized candidates 50
Group A (drug-like) 25
Group B (complex) 25
Pipeline execution time ~45 minutes

What this workflow replaced

Traditional computational drug discovery involves real setup time: installing RDKit, figuring out ChEMBL's API, writing clustering code, debugging visualization scripts. When something breaks mid-analysis - like an API going down - you lose time replanning.

K-Dense Web ran the full pipeline, handled the API failure without stopping, and produced figures and a manuscript draft at the end. The 45-minute runtime is mostly compute, not planning or debugging.

The 50 candidates are ready for antimicrobial screening against resistant strains (MRSA, VRE, MDR pathogens), MIC determination, and structure-activity relationship analysis using the chemical space clusters.

Try it yourself

Start your autonomous research project with $50 free credits →


This case study was generated from K-Dense Web. View the complete example session including all analysis code, data files, figures, and the publication-ready research manuscript.

Enjoyed this article? Share it with others!

Share:
Back to all posts