Skip to content

Improve Firecracker fork concurrency#263

Merged
sjmiller609 merged 70 commits into
mainfrom
hypeship/fc-fork-concurrency-v2
Jun 4, 2026
Merged

Improve Firecracker fork concurrency#263
sjmiller609 merged 70 commits into
mainfrom
hypeship/fc-fork-concurrency-v2

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Jun 1, 2026

Summary

  • rewrites Firecracker snapshot-state data-dir paths for forks when an equal-length replacement is possible, including CRC recomputation
  • avoids the temporary source-dir alias on restore when the snapshot state can be rewritten directly
  • narrows alias locking to only the restore windows that still need a temporary source-dir alias
  • gates concurrent stopped/standby fork preparation behind SupportsConcurrentForkPrepare, with Firecracker opting in and other hypervisors keeping the previous exclusive source lock behavior

Tests

  • go test ./lib/hypervisor/firecracker ./lib/hypervisor ./lib/mailbox ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1
  • go test ./lib/instances -run 'TestForkInstanceStoppedSourceUsesReadLock|TestForkInstanceStoppedSourceRequiresWriteLockWithoutConcurrentPrepare' -count=1\n\n## Notes\n- the targeted ./lib/instances tests required local placeholder files for embedded test binaries that are not present in this checkout.

Note

Medium Risk
Changes fork, restore, and snapshot locking paths plus shared restore admission; behavior is well covered by tests but mistakes could cause rare restore/fork races or overload.

Overview
This PR improves Firecracker fork/restore concurrency and surfaces resource exhaustion more clearly on fork/restore APIs.

Firecracker fork prepare can now rewrite snapshot state file paths (with CRC64 validation) when source and target data-dir strings are the same length, so many forks avoid the temporary source data-dir symlink alias during restore. When rewrite is not possible, prepare reports RequiresSnapshotSourceAlias and restore still uses a narrow, shared alias lock (readers block on metadata/copy; writers only during the alias mutation window).

Instance fork uses a source read lock for stopped/standby sources when the hypervisor advertises SupportsConcurrentForkPrepare (Firecracker enables it); other hypervisors keep the exclusive source lock. Fork metadata admission is serialized with a final name check before save, and snapshot memory prepare is per snapshot directory.

Restore capacity is capped via configurable firecracker_max_concurrent_restores (default 32), with a shared slot pool across managers for that hypervisor/limit.

API maps ErrInsufficientResources to 409 with code insufficient_resources on restore, fork instance, and fork snapshot.

Reviewed by Cursor Bugbot for commit bb2602e. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 force-pushed the hypeship/fc-fork-concurrency-v2 branch from cd4377a to c1b06c9 Compare June 1, 2026 19:55
@sjmiller609 sjmiller609 changed the base branch from hypeship/fc-resume-on-load-v2 to main June 1, 2026 19:55
Comment thread lib/hypervisor/firecracker/snapshot_state.go
Comment thread lib/hypervisor/firecracker/snapshot_state.go
Comment thread lib/hypervisor/firecracker/snapshot_state_fork.go
Comment thread lib/hypervisor/hypervisor.go
@sjmiller609 sjmiller609 marked this pull request as ready for review June 2, 2026 14:42
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: PR repository is not specified, making it impossible to confirm it belongs to one of the allowed repos (kernel, infra, hypeman, hypeship); please clarify which repo this PR is in or opt in manually for deploy monitoring.

To monitor this PR anyway, reply with @firetiger monitor this.

Comment thread lib/instances/snapshot_alias_lock.go
Comment thread lib/instances/snapshot_alias_lock.go
Comment thread lib/hypervisor/firecracker/snapshot_state_fork.go Outdated
Comment thread config.example.yaml
Comment thread lib/instances/manager.go
Comment thread lib/instances/snapshot_alias_lock.go
Comment thread lib/network/bridge_linux.go
Comment thread lib/instances/manager.go
Comment thread lib/instances/manager.go
Comment thread lib/instances/storage.go
Comment thread lib/instances/fork.go
Comment thread lib/instances/snapshot.go
Comment thread lib/instances/manager.go
Comment thread lib/instances/fork.go Outdated
Comment thread lib/instances/manager.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ba08c4c. Configure here.

Comment thread lib/instances/manager.go
@sjmiller609 sjmiller609 force-pushed the hypeship/fc-fork-concurrency-v2 branch from cb2d85a to b3ae500 Compare June 3, 2026 15:06
Comment thread lib/instances/manager.go Outdated
@sjmiller609 sjmiller609 requested a review from rgarcia June 3, 2026 15:43
@sjmiller609 sjmiller609 force-pushed the hypeship/fc-fork-concurrency-v2 branch from 724ba4b to c48afe5 Compare June 4, 2026 17:48
@sjmiller609 sjmiller609 requested a review from hiroTamada June 4, 2026 18:46
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a read-through focused on the fork/restore concurrency changes. Overall the capability-gated approach is clean and the snapshot-state path rewrite is a nice way to avoid the symlink dance. Three things I'd like to discuss before approving — none are blockers on their own, but #1 and #2 are about long-term operability.

1. The symlink-alias fallback is load-bearing but invisible, and gives us two failure modes to reason about.

For the common instance→instance fork the rewrite is guaranteed to work (IDs are fixed-length cuid2, so guests/<id> source and target are always equal length). The fallback in recordSnapshotSourceAliasFallback really only matters for (a) symlink-resolved paths from EvalSymlinks and (b) diff-snapshot chains that retain an upstream base path (RetainSnapshotSourceDataDirAlias), where the alias is genuinely required for correctness.

My concern is operational: the alias path is slower and globally serialized, and right now it's entered silently. During an incident I don't want to have to reason about whether a host is hitting the rewrite path or the alias path. Two asks:

  • Emit a metric/log counter whenever the alias fallback fires, so "are we on the slow path?" is a one-query answer.
  • Consider making the non-rewritable case a hard error unless it's the legitimate retained-base diff-snapshot case. That way "alias path" only ever means the one well-understood scenario, instead of a catch-all.

If there's a path to giving diff-snapshot base reuse a stable fixed-length canonical path, we could eventually drop the alias entirely — but I don't think that needs to happen in this PR.

2. Firecracker-specific concept leaking into the generic instances manager.

This PR introduces the clean pattern for hypervisor-specific behavior — a capability plus supportsConcurrentForkPrepare(hvType) — and then doesn't use it for the restore limiter. acquireFirecrackerRestoreSlot hard-codes stored.HypervisorType != hypervisor.TypeFirecracker, and we now have a firecrackerRestoreSlots field on the generic manager, a firecrackerRestoreSlotPools package global, and FirecrackerMaxConcurrentRestores in the manager config. To stay consistent with the capability approach a few lines up, I'd rather see the concurrency limit advertised by the hypervisor (e.g. a MaxConcurrentRestores on Capabilities or typed config keyed by hypervisor.Type) and the manager keep a generic map[hypervisor.Type]chan struct{}, so manager.go/restore.go never name Firecracker directly. To be clear, the existence of a restore limiter is fine and probably inherent to Firecracker's restore cost — it's the naming and the hard-coded type switch in the generic layer I'd like to clean up.

3. ForkInstance's read-lock logic needs explanation.

The useReadLock block is a double-checked-locking pattern (load metadata under RLock, drop it, re-acquire and re-load to re-validate) and its correctness isn't self-evident. The escalation condition (useReadLock && sourceState == StateStandby && targetState == StateRunning && targetRestoreNeedsSourceLock) is also a dense boolean. Please add a comment explaining why we re-validate under the second lock and what each escalation condition guards against — this is exactly the kind of concurrency code the next person will be afraid to touch without context.

Also +1 on getting the Bugbot duplicate-name finding resolved (concurrent same-name forks both passing instanceNameExists/NameExists under the read lock) before merge, unless the serialized name-admission already covers it — if so, a comment pointing at where that's enforced would help.

@sjmiller609 sjmiller609 requested a review from rgarcia June 4, 2026 21:05
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — the review feedback is well addressed:

  • Hypervisor leakage: clean refactor. acquireRestoreSlot(ctx, hvType) + restoreSlotsByHypervisor map[hypervisor.Type]chan struct{} + MaxConcurrentRestoresByHypervisor, with the only Firecracker-specific wiring now living in providers.go. Exactly the capability-consistent shape I was hoping for. (Tiny nit, non-blocking: defaultManagerFirecrackerMaxConcurrentRestores is now applied as the generic per-hypervisor default, so the name reads a little oddly.)
  • Alias fallback visibility: the WarnContext lines in both fork paths are enough to answer "are we on the slow path?" during an incident. A counter metric would be a nice future add but I don't consider it blocking.
  • useReadLock docs: the three added comments make the double-checked-lock + escalation logic readable. Thank you.
  • Duplicate fork names: agreed that saveForkMetadata under forkMetadataMu with the re-check before save is the right serialized admission point.

Two open Bugbot threads still look valid to me and don't appear to be touched by 32b62db. Approving so as not to block, but I'd like these resolved (or explicitly dispositioned) before merge:

  1. VsockCID collision on concurrent standby forks. fork.go still assigns standby forks the source's CID (forkMeta.VsockCID = stored.VsockCID) while stopped forks get a fresh generateVsockCID. With the new read-lock concurrency, multiple standby forks of one source can be prepared in parallel and then started to running, all sharing the source CID → vsock routing / guest-agent breakage. The comment says rewriting CID in restored snapshots isn't reliable across hypervisors, which I get — but then we likely need to either reallocate the CID at restore time for forks or serialize "running" admission per source CID. Can you confirm what prevents two running guests with the same CID here?

  2. Stale source state in the escalated write-lock restore. applyForkTargetStateWithSourceLock re-takes the source write lock, but the targetRestoreNeedsSourceLock / sourceState == StateStandby decision was captured under an earlier RLock that was already released. Nothing re-validates the source state under the write lock before the aliased restore runs. If the source was restored to running in that gap, we could alias its data dir while its VMM is live. Does holding the source write lock actually guarantee the source isn't running by this point, or should we re-load + re-check source state under the write lock before aliasing?

Neither is a correctness concern for the common (rewrite-succeeds, single-fork) path, so I'm comfortable approving — but I'd treat #2 as the higher-risk of the two given it's the data-dir-corruption path.

@sjmiller609 sjmiller609 merged commit 2211346 into main Jun 4, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/fc-fork-concurrency-v2 branch June 4, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants