Skip to main content

The Sandboxed AI Scientist: Pairing NVIDIA OpenShell with Scientific Agent Skills

Combine NVIDIA OpenShell's policy-governed runtime with Scientific Agent Skills to run autonomous research agents that are both highly capable and genuinely safe on patient data, proprietary molecules, and HPC credentials.

13 min read
Share:
The Sandboxed AI Scientist: Pairing NVIDIA OpenShell with Scientific Agent Skills

A year ago, the limiting factor in using AI agents for real science was capability. Today, for most computational workflows, the limiting factor is trust.

Frontier models can read a VCF, write a Scanpy pipeline, design a Qiskit circuit, and draft a methods section in the same afternoon. They can also, in the same afternoon, pip install a typosquatted package, exfiltrate an ANTHROPIC_API_KEY to a host that looks legitimate, or overwrite the one copy of a dataset that took six months to curate. Neither of those failures is hypothetical. Both have happened to people we know.

For scientists, this is a familiar shape of problem. It is the same tension we manage in wet labs with biosafety cabinets, in HPC with kerberized clusters, and in clinical research with IRBs and HIPAA controls: we want powerful tools, and we want blast radius that is mathematically smaller than the power of the tools.

Two recent open-source projects, used together, go a long way toward resolving that tension for AI-driven research:

  • NVIDIA OpenShell: a safe, private runtime for autonomous AI agents. It puts each agent inside a container with kernel-level filesystem and process isolation plus an application-layer proxy that enforces network policy, all declared in YAML.
  • Scientific Agent Skills: 133 curated Agent Skills that teach that agent how to do real science. RDKit, Scanpy, pysam, DiffDock, AlphaFold DB, ClinVar, COSMIC, PyMC, Astropy, and 120+ more.

They are designed for different layers of the stack and they compose exceptionally well. OpenShell answers "where should an autonomous agent run?". Scientific Agent Skills answers "what should the agent actually know how to do once it gets there?". This post is about what happens when you put them together, and why that pairing is especially well-suited to scientific work.

The two gaps, one stack

If you have been following the agent-skills space, you have seen versions of this diagram before: agents are pulled in two directions, upward toward domain knowledge and downward toward safe execution.

Scientific Agent Skills targets the upward gap. A frontier model already "knows" a lot about bioinformatics in the abstract, but knowing that Scanpy exists is different from knowing that QC thresholds for a 10x Genomics lung adenocarcinoma sample should cap pct_counts_mt around 20% and drop cells below 500 genes, then use scvi-tools for integration when batch effects are present. Each of the 133 skills in the K-Dense repository is a SKILL.md encoding exactly that kind of procedural knowledge, plus tested code snippets, references, and a concise description so the agent can decide, at runtime, which skills to load.

OpenShell targets the downward gap. Without a sandbox, when an agent runs code, that code runs as you. It inherits your shell environment, your filesystem permissions, your AWS credentials, your SSH agent, your ability to email dean@university.edu. The existing industry answer ("run it in Docker") is better than nothing, but Docker alone does not express fine-grained policy over which paths the agent may write, which syscalls are reachable, which hosts it may dial, and what happens when it tries to route an LLM call. OpenShell does express all of that, declaratively, with Landlock LSM for filesystem, seccomp for syscalls, a policy-enforcing HTTP proxy for network, and a privacy router for inference.

Neither project supersedes the other. A sandbox with no skills is a brilliant researcher locked in an empty room; a pile of skills without a sandbox is a brilliant researcher who has been given the keys to your cluster on day one. Together, they give you something close to the right shape: a capable research agent whose capability surface area is a proper subset of a policy you wrote down.

Why scientists, specifically, care about this pairing

It is tempting to treat agent sandboxing as a generic devops concern, the kind of thing an SRE team worries about after an incident. That framing understates how different scientific workloads are from typical enterprise workloads.

Credentials in science are unusually potent. An ANTHROPIC_API_KEY is a cost problem. A DNAnexus token, a Benchling token, a /home/shared/projects write mount on your HPC scratch, an AWS role that can read the lab's S3 bucket of patient CT scans: those are different categories of object. They cannot be rotated after an incident in the same way a leaked stripe key can. They are attached to IRB protocols, data use agreements, or export-controlled materials.

Data in science is often irreplaceable. The result of a three-day GPU run, a pre-publication dataset, a VCF cohort that took two years of consenting to assemble: these are not recoverable from backup in a practical sense. An autonomous agent that rewrites the wrong file is not a service outage, it is a paper delayed by a quarter.

Outputs have external consequences. A POST /repos/.../issues call that an agent makes by mistake might only be embarrassing at a startup. The same call pattern, made by an agent acting against an FDA submission system or a clinical trials registry, is a regulatory incident. The blast radius of a mistake scales with where the mistake happens, and science happens in high-stakes systems more often than most engineering does.

Reproducibility is a first-class deliverable. In scientific work, the runtime is part of the method. "I ran this with Claude Code and a bunch of skills on my laptop" is not reproducible; "I ran this inside an OpenShell sandbox built from image scientific-base:v0.4.2 with policy lung-cancer-screen.yaml and the K-Dense scientific skills pinned to commit abc123" is.

Every one of those properties is exactly what OpenShell's declarative policies and Scientific Agent Skills' pinned SKILL.md files are good at.

The mental model: skills are the "what", policies are the "where"

A useful way to hold this in your head:

  • A skill is a description of a workflow the agent may perform, such as "run a variant annotation pipeline with Ensembl VEP, cross-reference with ClinVar and COSMIC, and produce a clinical report". It constrains behavior by telling the agent what good looks like.
  • A policy is a description of the environment the agent performs that workflow in, such as "you may read from /sandbox/vcfs and write to /sandbox/reports, you may reach rest.ensembl.org for annotation and eutils.ncbi.nlm.nih.gov for ClinVar queries, and you must not touch anything else". It constrains behavior by telling the runtime what good looks like.

A skill is authoritative about methodology; a policy is authoritative about authority. When they disagree (say, a skill suggests calling out to a new API the policy does not allow), the policy wins. That asymmetry is the point. You get to say "the agent may become more capable at runtime by loading new skills, but it cannot become more authorized."

A concrete recipe: a sandboxed virtual screening agent

Let us make this concrete with a workflow that shows up constantly in drug discovery: a virtual screening campaign against a target of interest. The agent needs to query ChEMBL, pull a protein structure from AlphaFold DB, filter compounds with RDKit, run docking, and produce a write-up. It should not, under any circumstances, be able to push results to GitHub, hit a random pastebin, or read your home directory.

Step 1: create the sandbox with a policy

First, we write a policy that describes the environment. The OpenShell schema has a small number of top-level fields: filesystem_policy, landlock, and process (locked at sandbox creation), and network_policies (hot-reloadable at runtime). Here is a starting point:

version: 1

filesystem_policy:
  include_workdir: true
  read_only:
    - /usr
    - /lib
    - /etc
    - /opt/skills
  read_write:
    - /sandbox
    - /tmp

landlock:
  compatibility: best_effort

process:
  run_as_user: sandbox
  run_as_group: sandbox

network_policies:
  pypi:
    name: pypi
    endpoints:
      - host: pypi.org
        port: 443
      - host: files.pythonhosted.org
        port: 443
    binaries:
      - { path: /usr/local/bin/uv }
      - { path: /usr/bin/pip }

  npm_and_github:
    name: npm_and_github
    endpoints:
      - host: registry.npmjs.org
        port: 443
      - host: github.com
        port: 443
      - host: api.github.com
        port: 443
        protocol: rest
        enforcement: enforce
        access: read-only
      - host: codeload.github.com
        port: 443
    binaries:
      - { path: /usr/bin/node }
      - { path: /usr/bin/npx }
      - { path: /usr/bin/gh }

  chembl:
    name: chembl
    endpoints:
      - host: www.ebi.ac.uk
        port: 443
        protocol: rest
        enforcement: enforce
        access: read-only
    binaries:
      - { path: /usr/local/bin/claude }
      - { path: /usr/bin/python3 }

  alphafold:
    name: alphafold
    endpoints:
      - host: alphafold.ebi.ac.uk
        port: 443
        protocol: rest
        enforcement: enforce
        access: read-only
    binaries:
      - { path: /usr/local/bin/claude }
      - { path: /usr/bin/python3 }

  pubchem:
    name: pubchem
    endpoints:
      - host: pubchem.ncbi.nlm.nih.gov
        port: 443
        protocol: rest
        enforcement: enforce
        access: read-only
    binaries:
      - { path: /usr/local/bin/claude }
      - { path: /usr/bin/python3 }

Read that policy the way you would read an experimental protocol. The agent runs as an unprivileged sandbox user. It can write only into /sandbox and /tmp. It can reach PyPI (so uv pip install rdkit-pypi works), a narrow slice of the npm/GitHub surface (enough to install Agent Skills in the next step, with api.github.com constrained to read-only), and three scientific data sources. Every other outbound connection gets a 403 policy_denied at the proxy, logged with method, path, and calling binary. There is no 0.0.0.0, no s3.amazonaws.com, no discord.com, no unconstrained api.github.com writes, because nothing in your workflow needs any of those, and the default is deny.

Spin it up:

openshell sandbox create \
  --policy ./virtual-screening.yaml \
  --name vscreen-01 \
  -- claude

For workflows like DiffDock docking or downstream ML scoring you will want GPUs inside the sandbox. OpenShell supports this through a --gpu flag, but the default base image does not ship CUDA; you pass --from a GPU-enabled sandbox image (either a community one or your own BYOC image), and the CLI auto-selects CDI or NVIDIA's --gpus all path as available:

openshell sandbox create \
  --policy ./virtual-screening.yaml \
  --name vscreen-01 \
  --gpu --from gpu-enabled-sandbox \
  -- claude

Step 2: load the skills

Inside the sandbox, install Scientific Agent Skills. Because the policy explicitly allows the npm registry, codeload.github.com, and read-only access to api.github.com, this is a one-liner:

npx skills add K-Dense-AI/scientific-agent-skills

For production use you would almost certainly bake the skills into a custom sandbox image (via --from) so there is no install step at runtime and the npm_and_github block can be removed from the policy entirely. That is the stricter setup; the version above is the friendlier one for iteration.

Your agent can now discover and load, on demand, skills covering exactly the workflow:

  • database-lookup: unified REST access to ChEMBL, PubChem, UniProt, AlphaFold, and 74 other databases
  • rdkit: molecular manipulation, SAR, descriptor calculation
  • datamol: analog generation and lead optimization
  • diffdock: blind docking against protein structures
  • medchem: drug-likeness filters
  • scientific-writing, document-skills: the final report

Because skills use progressive disclosure, the agent does not drag all 133 SKILL.md files into context at the start of the conversation. It loads a compact index of names and descriptions at startup (a few thousand tokens), and only pulls the full instructions for a skill when it decides the task actually needs it.

Step 3: run the science

With the environment and the knowledge both in place, you can hand the agent a task that would have been a multi-week rotation for a first-year graduate student a few years ago:

Query ChEMBL for EGFR inhibitors with IC50 < 50 nM and a molecular weight below 500 Da. Analyze structure–activity relationships with RDKit, generate 50 improved analogs with datamol, dock them against the AlphaFold structure of EGFR with DiffDock, filter with MedChem, rank the top 10, and produce a methods-and-results report.

The agent decomposes the task, loads the relevant skills, queries ChEMBL (allowed), downloads an AlphaFold structure (allowed), writes intermediate files in /sandbox (allowed), and produces a final PDF. It also silently tries to curl https://random-host.invalid/... at one point, because models do that; the proxy denies it, logs it, and the workflow continues. You review the logs after the run, not during.

Step 4: iterate the policy without restarting

The most pleasant ergonomic property of this setup is that network_policies is hot-reloadable. If, while watching a run, you realize the agent legitimately needs access to string-db.org to look up protein-protein interactions, you do not lose state. You edit the YAML, then:

openshell policy set vscreen-01 --policy ./virtual-screening.yaml --wait

The policy is re-applied in place. The agent's next network call to STRING succeeds. The filesystem and process constraints stay locked at their original creation-time values, so widening the network does not give the agent new filesystem power. That separation ("network policy is liquid, filesystem and process policy is load-bearing structure") matches how scientists actually work: you discover data sources during a project, but you know from day one that ~/Desktop/paper_draft is off limits.

Second recipe: a clinical variant agent that sees patient data and nothing else

The virtual screening story is useful, but it is the easy case; none of that data is really sensitive. The more interesting use case, and the one that would make most compliance offices sit up, is running an autonomous agent against patient data.

Here is a sketch of a policy for a clinical variant interpretation pipeline, where the agent runs locally against a VCF cohort that must not leave the host:

version: 1

filesystem_policy:
  include_workdir: false
  read_only:
    - /usr
    - /lib
    - /etc
    - /opt/skills
    - /data/vcfs
    - /data/reference/grch38
  read_write:
    - /sandbox/reports
    - /tmp

landlock:
  compatibility: best_effort

process:
  run_as_user: sandbox
  run_as_group: sandbox

network_policies:
  ensembl_vep:
    name: ensembl_vep
    endpoints:
      - host: rest.ensembl.org
        port: 443
        protocol: rest
        enforcement: enforce
        access: read-only
    binaries:
      - { path: /usr/local/bin/claude }
      - { path: /usr/bin/python3 }

  clinvar_entrez:
    name: clinvar_entrez
    endpoints:
      - host: eutils.ncbi.nlm.nih.gov
        port: 443
        protocol: rest
        enforcement: enforce
        access: read-only
    binaries:
      - { path: /usr/local/bin/claude }
      - { path: /usr/bin/python3 }

  clinicaltrials_gov:
    name: clinicaltrials_gov
    endpoints:
      - host: clinicaltrials.gov
        port: 443
        protocol: rest
        enforcement: enforce
        access: read-only
    binaries:
      - { path: /usr/local/bin/claude }
      - { path: /usr/bin/python3 }

Three properties are worth calling out.

First, /data/vcfs is read-only. The agent can load, parse, and annotate variants with pysam (a Scientific Agent Skill), but it physically cannot mutate the cohort. The only writable path is /sandbox/reports, which is where the clinical write-up lands. If someone asks, later, "are you sure the agent didn't modify the VCFs?", the answer is not "we checked git blame", it is "the Landlock LSM policy made it a kernel-level impossibility".

Second, include_workdir: false. By default, OpenShell mounts the current working directory into the sandbox. For clinical work, you explicitly do not want that. You want to point at a curated data directory and nothing else.

Third, network access is exactly three domains, each with access: read-only at the application layer. A confused agent that tries to POST patient-identifying data to ClinVar is blocked not by politeness but by an HTTP proxy that denies the method at L7. Routine GET traffic for variant lookups proceeds normally.

Give this environment the right Scientific Agent Skills (pysam, database-lookup, clinical-reports, treatment-plans, citation-management), and you have an agent that can produce a hereditary cancer risk report on a cohort without ever being in a position to leak it. The skills give it the domain expertise; the policy gives it the permission to use that expertise only in ways your institution has signed off on.

Third recipe: keeping inference on-premises

The last protection layer is one people often forget. Even if the agent's data never leaves the sandbox, every token of patient context it reasons over has, by default, to travel to a hosted LLM provider.

OpenShell's privacy router addresses this. It lets you reroute calls to inference.local (or to any named inference endpoint) to a controlled backend of your choice, whether an on-prem model, a BYOC endpoint, or a regional deployment, and strips caller credentials from the outbound call while injecting the backend's credentials. The sandbox believes it is calling the same API it always did; in reality, traffic never hits a public provider.

openshell inference set \
  --provider on-prem \
  --model llama-3.1-70b-instruct

For regulated workloads (patient data, trade secrets, unpublished results that become published in 12 weeks), this is the difference between "the data stays here" being a thing you assert and a thing you enforce.

Why this is a better primitive than "just use Docker"

You may be thinking: I already run my agents in Docker, or in a VM, or in a Codespace. What does this buy me?

Three concrete things.

Kernel-level enforcement, not just namespace isolation. Landlock LSM and seccomp operate below the container runtime. When the agent tries a disallowed filesystem operation or syscall, the kernel says no. There is no prompt the model can emit that makes that no into a yes.

Application-layer network policy, not just allow-all egress. A standard Docker container with egress networking is one curl away from exfiltration. OpenShell's proxy enforces at HTTP method and path granularity, with per-binary rules. GET rest.ensembl.org/vep/human/hgvs/* can be allowed while POST * is denied, all without rebuilding anything.

Hot-reloadable policy for the liquid parts, locked policy for the structural parts. Iterating on filesystem or process policy should be painful: it represents decisions that have external compliance implications. Iterating on which data sources the agent can reach should be cheap, because that is what you learn during a research project. OpenShell gets this split right.

All three are the kind of primitive that researchers will want anyway, the first time they think carefully about what an autonomous agent is authorized to do with their environment.

Building it into your lab's workflow

If you are a single scientist who wants to try this tomorrow, the minimal path is:

# 1. Install OpenShell
curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh

# 2. Write a policy. Start from the examples in
#    examples/sandbox-policy-quickstart/ in the OpenShell repo
#    and adapt it for your project.
vim my-project.yaml

# 3. Create a sandbox with your agent of choice
openshell sandbox create --policy ./my-project.yaml -- claude

# 4. Inside the sandbox, add scientific skills
npx skills add K-Dense-AI/scientific-agent-skills

# 5. Do science

If you are a PI or a computational core thinking about setting this up for a group, the better abstraction is probably per-project sandbox templates: one YAML per project, checked into the project's repo alongside the code, reviewed during onboarding the same way environment.yml is. The policy becomes a first-class artifact of the science, reviewed by the PI, pinned in preregistration documents, cited in the methods section.

For the full experience (skills, sandboxing, managed compute, publication-ready outputs, and hundreds of additional workflow skills you cannot get in the open repo), K-Dense Web assembles all of this behind a single interface. But the open-source primitives are genuinely usable on their own, and for many labs that is the right starting point.

What this changes, concretely

If you internalize this pattern (a sandboxed, policy-governed runtime plus a library of curated skills), a few practical things change about how you run computational projects.

You stop hand-waving about agent safety. You do not have to defend "but my container is sandboxed" in front of a review board; you can point at a YAML that encodes which hosts, paths, and syscalls the agent was allowed to use during a run, and at logs of every denial.

You stop treating "the environment" as a soft artifact. The policy is as much a part of the experiment as the code, and evolves on the same clock as the science.

You stop conflating capability and authority. A new Scientific Agent Skill that teaches the agent to run an unfamiliar workflow does not also grant the agent network access to a new endpoint. The two systems are orthogonal, and you decide whether to extend each one separately.

And you start letting the agent do more. That is, paradoxically, the most important effect. When the downside of a rogue action is bounded by policy, you stop over-constraining the upside. The agent gets to be genuinely autonomous within a space you have already decided is survivable, and autonomy is where the productivity gains of modern AI for science actually live.


If you try this pattern and find a rough edge, or a recipe worth sharing, both repositories are built to receive that feedback. The NVIDIA OpenShell issue tracker is the right place for runtime questions; the Scientific Agent Skills repository is the right place for new skills or improvements to existing ones. The interesting work for the next several years of AI-for-science will happen at exactly this seam between "what the agent knows" and "what the agent is allowed to do". It is a good time to start pulling on that thread.


Try K-Dense Web for the managed experience: app.k-dense.ai →

Questions or a workflow you want to share? Join our Slack community or email contact@k-dense.ai.

Related resources:

Enjoyed this article? Share it with others!

Share:
Back to all posts