Skip to content

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703

Open
st-gr wants to merge 5 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket
Open

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703
st-gr wants to merge 5 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket

Conversation

@st-gr
Copy link
Copy Markdown

@st-gr st-gr commented Jun 3, 2026

Summary

Adds an opt-in --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) to the gateway. When set, the gateway dispatches sandbox lifecycle to an out-of-tree compute driver speaking the existing compute_driver.proto contract over a Unix domain socket, instead of one of the four built-in drivers (Kubernetes / Podman / Docker / VM).

Related Issue

Supersedes the closed draft #1604 (same author, same patch shape — re-opened as a non-draft after rebasing onto current main, with all commits DCO-signed and st-gr-attributed). Same direction as cheese-head's vouch in #1345 ("extend or provide new out-of-tree compute drivers to OpenShell").

No upstream tracking issue filed — happy to file one if reviewers prefer.

Changes

  • crates/openshell-core/src/config.rs: adds External(PathBuf) variant to ComputeDriverKind. Drops the Copy derive and const fn as_str(self) (both incompatible with PathBuf). FromStr accepts external:<path> (case-insensitive prefix) and rejects bare external with a message pointing at the CLI flag.
  • crates/openshell-server/src/cli.rs: adds --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) on RunArgs. When set, pins ComputeDriverKind::External(<path>) and skips both --drivers and the auto-detection probe.
  • crates/openshell-server/src/config_file.rs: surfaces the new field through the config-file path.
  • crates/openshell-server/src/lib.rs / crates/openshell-server/src/compute/mod.rs: in build_compute_runtime, the External(<path>) arm connects a tonic::Channel to the UDS via hyper-util::TokioIo and wraps it in RemoteComputeDriver — the same proxy used by the in-tree VM driver.
  • architecture/compute-runtimes.md: adds an "External" row to the Runtime Summary table, describing the trust boundary (operator-owned UDS file permissions) and the activation flag.

Testing

  • mise run pre-commit passes — not run end-to-end (mise not on author's dev environment), but the equivalent rust pieces verified independently: cargo fmt --all -- --check clean; cargo clippy --no-deps -p openshell-core -p openshell-server --all-targets -- -D warnings clean.
  • Unit tests added/updated — 9 compute_driver_kind_* tests in crates/openshell-core/src/config.rs::tests cover the new variant: parses external:<path>, rejects bare external and empty path, displays as external:<path>, round-trips through FromStr+Display, and is case-insensitive on the prefix. Plus 3 CLI-arg tests for the new flag (presence, override-of-drivers, env-var binding).
  • E2E tests added/updated — none — running an External driver in CI would require shipping a stub driver binary; deferring until at least one in-tree consumer wants to test against it. The patch is exercised in production at SAP via the st-gr/openshell-driver-kyma reference driver running against a Gardener-managed Kyma cluster since 2026-05-28 (sandbox CR creation, openshell sandbox exec, openshell sandbox upload/download all work).

Checklist

  • Follows Conventional Commits (feat(core):, feat(server):, docs(core,server):, docs(arch):)
  • Commits are signed off (DCO)
  • Architecture docs updated — added "External" row to architecture/compute-runtimes.md::Runtime Summary. Per-runtime implementation-notes section is intentionally NOT extended for External because external drivers ship their own documentation; the reference implementation lives at st-gr/openshell-driver-kyma.

Notes for reviewers

Out of scope (intentional):

  • No new in-tree compute driver implementations. This PR only adds the protocol surface (External(PathBuf) + UDS dispatch). Operators bring their own driver process and point --compute-driver-socket at its socket path.
  • No compute_driver.proto changes. The existing protocol is reused verbatim — the same one already used by the in-tree VM driver to talk to its tonic peer.
  • No security model changes. UDS endpoint security is the operator's responsibility (filesystem permissions, sidecar isolation), matching the --drivers vm posture.
  • No new dependencies. hyper-util is already in the workspace Cargo.toml; the patch reuses it via the same TokioIo connector path the VM driver uses.

Operator context:

The forked gateway image at ghcr.io/st-gr/openshell-gateway (carrying these patches plus an in-tree Dockerfile.gateway) has been driving production-shaped Kubernetes clusters with the st-gr/openshell-driver-kyma compute driver since 2026-05-28. The flag has needed zero changes since it landed — the API shape is stable. Merging this PR lets the driver consume the unmodified upstream gateway image, retiring the fork.

st-gr and others added 5 commits June 2, 2026 20:56
Carries the UDS path supplied by --compute-driver-socket. Drops Copy
from the enum derive (PathBuf is not Copy); existing callers use Clone
or owned values. FromStr accepts 'external:<path>' and rejects bare
'external' with a message pointing at the CLI flag.

Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Adds an opt-in `--compute-driver-socket=PATH` flag (env
`OPENSHELL_COMPUTE_DRIVER_SOCKET`) on the gateway's `RunArgs`. When
set, the gateway pins `ComputeDriverKind::External(<path>)` and
skips both the `--drivers` list and the auto-detection probe. This
lets out-of-tree driver binaries (Kyma, custom backends) connect
to a stock gateway without a rebuild.

`effective_single_driver` and the `Config.compute_drivers` payload
both honour the new flag so pre-runtime checks and the runtime
factory dispatch agree on the configured driver. The companion
dispatch arm in `lib.rs::build_compute_runtime` is wired in the
follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
`build_compute_runtime` now connects a tonic `Channel` to the
configured Unix domain socket using `hyper-util`'s `TokioIo`
connector and wraps it in `RemoteComputeDriver` -- the same proxy
used by the VM driver. Replaces the placeholder
`External(_) => Err(...)` arm.

Adds two helpers in `compute/mod.rs`:

* `connect_external_compute_driver(socket_path)` -- a small
  tonic-Endpoint + tower::service_fn + UnixStream connector,
  parallel to the one in `compute::vm` but with no VM-specific
  logging or capability probing. Out-of-tree drivers manage their
  own readiness; the gateway just dials.
* `ComputeRuntime::new_remote_external(channel, ...)` -- mirrors
  `new_remote_vm` but takes no `ManagedDriverProcess`. The
  external driver's lifecycle is the operator's responsibility
  (systemd unit, sidecar container, etc.).

Smoke-tested: `--compute-driver-socket /tmp/nonexistent.sock`
now starts the gateway, logs "Connecting to external compute
driver", and fails with a clear "failed to connect to external
compute driver socket '<path>': transport error" message that
points at the new arm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
…uteDriverKind

Adds three short inline comments at the points where the
non-`Copy`-ness of `ComputeDriverKind` becomes visible to readers,
so reviewers don't have to ask "why was Copy removed?" or "why
clone() instead of *deref?":

- `crates/openshell-core/src/config.rs`: comment above the enum
  derive explaining the `Copy` and `const fn as_str(self)` revert
  was forced by the `External(PathBuf)` variant, and that all
  in-tree call sites already handle `Clone`.
- `crates/openshell-server/src/cli.rs`: comment at the
  `effective_single_driver` match arm noting why we clone instead
  of deref.
- `crates/openshell-server/src/lib.rs`: same comment shape at the
  `configured_compute_driver` match arm.

Squash into the corresponding earlier commit if preferred — split
out as its own commit to make the rationale grep-able.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Adds a row to the Runtime Summary table describing the new
`--compute-driver-socket=<path>` flag and how the External arm relates
to the existing in-tree drivers (operator-managed driver process,
Unix domain socket, identical proto contract to the VM driver, same
trust boundary).

Keeps the rest of the document untouched — the per-runtime
implementation notes (e.g. which driver crate README to read) are
intentionally NOT extended for `External` because external drivers
ship their own documentation in their own repos. The reference
implementation lives at:

    https://github.com/st-gr/openshell-driver-kyma

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
@st-gr st-gr requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 3, 2026 05:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

I have read the DCO document and I hereby sign the DCO.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

recheck

@drew drew self-assigned this Jun 3, 2026
@drew
Copy link
Copy Markdown
Collaborator

drew commented Jun 3, 2026

/ok to test 8689967

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants