Improve Firecracker fork concurrency#263
Conversation
cd4377a to
c1b06c9
Compare
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR repository is not specified, making it impossible to confirm it belongs to one of the allowed repos (kernel, infra, hypeman, hypeship); please clarify which repo this PR is in or opt in manually for deploy monitoring. To monitor this PR anyway, reply with |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ba08c4c. Configure here.
cb2d85a to
b3ae500
Compare
This reverts commit a7bb3c5.
724ba4b to
c48afe5
Compare
rgarcia
left a comment
There was a problem hiding this comment.
Did a read-through focused on the fork/restore concurrency changes. Overall the capability-gated approach is clean and the snapshot-state path rewrite is a nice way to avoid the symlink dance. Three things I'd like to discuss before approving — none are blockers on their own, but #1 and #2 are about long-term operability.
1. The symlink-alias fallback is load-bearing but invisible, and gives us two failure modes to reason about.
For the common instance→instance fork the rewrite is guaranteed to work (IDs are fixed-length cuid2, so guests/<id> source and target are always equal length). The fallback in recordSnapshotSourceAliasFallback really only matters for (a) symlink-resolved paths from EvalSymlinks and (b) diff-snapshot chains that retain an upstream base path (RetainSnapshotSourceDataDirAlias), where the alias is genuinely required for correctness.
My concern is operational: the alias path is slower and globally serialized, and right now it's entered silently. During an incident I don't want to have to reason about whether a host is hitting the rewrite path or the alias path. Two asks:
- Emit a metric/log counter whenever the alias fallback fires, so "are we on the slow path?" is a one-query answer.
- Consider making the non-rewritable case a hard error unless it's the legitimate retained-base diff-snapshot case. That way "alias path" only ever means the one well-understood scenario, instead of a catch-all.
If there's a path to giving diff-snapshot base reuse a stable fixed-length canonical path, we could eventually drop the alias entirely — but I don't think that needs to happen in this PR.
2. Firecracker-specific concept leaking into the generic instances manager.
This PR introduces the clean pattern for hypervisor-specific behavior — a capability plus supportsConcurrentForkPrepare(hvType) — and then doesn't use it for the restore limiter. acquireFirecrackerRestoreSlot hard-codes stored.HypervisorType != hypervisor.TypeFirecracker, and we now have a firecrackerRestoreSlots field on the generic manager, a firecrackerRestoreSlotPools package global, and FirecrackerMaxConcurrentRestores in the manager config. To stay consistent with the capability approach a few lines up, I'd rather see the concurrency limit advertised by the hypervisor (e.g. a MaxConcurrentRestores on Capabilities or typed config keyed by hypervisor.Type) and the manager keep a generic map[hypervisor.Type]chan struct{}, so manager.go/restore.go never name Firecracker directly. To be clear, the existence of a restore limiter is fine and probably inherent to Firecracker's restore cost — it's the naming and the hard-coded type switch in the generic layer I'd like to clean up.
3. ForkInstance's read-lock logic needs explanation.
The useReadLock block is a double-checked-locking pattern (load metadata under RLock, drop it, re-acquire and re-load to re-validate) and its correctness isn't self-evident. The escalation condition (useReadLock && sourceState == StateStandby && targetState == StateRunning && targetRestoreNeedsSourceLock) is also a dense boolean. Please add a comment explaining why we re-validate under the second lock and what each escalation condition guards against — this is exactly the kind of concurrency code the next person will be afraid to touch without context.
Also +1 on getting the Bugbot duplicate-name finding resolved (concurrent same-name forks both passing instanceNameExists/NameExists under the read lock) before merge, unless the serialized name-admission already covers it — if so, a comment pointing at where that's enforced would help.
rgarcia
left a comment
There was a problem hiding this comment.
Approving — the review feedback is well addressed:
- Hypervisor leakage: clean refactor.
acquireRestoreSlot(ctx, hvType)+restoreSlotsByHypervisor map[hypervisor.Type]chan struct{}+MaxConcurrentRestoresByHypervisor, with the only Firecracker-specific wiring now living inproviders.go. Exactly the capability-consistent shape I was hoping for. (Tiny nit, non-blocking:defaultManagerFirecrackerMaxConcurrentRestoresis now applied as the generic per-hypervisor default, so the name reads a little oddly.) - Alias fallback visibility: the
WarnContextlines in both fork paths are enough to answer "are we on the slow path?" during an incident. A counter metric would be a nice future add but I don't consider it blocking. useReadLockdocs: the three added comments make the double-checked-lock + escalation logic readable. Thank you.- Duplicate fork names: agreed that
saveForkMetadataunderforkMetadataMuwith the re-check before save is the right serialized admission point.
Two open Bugbot threads still look valid to me and don't appear to be touched by 32b62db. Approving so as not to block, but I'd like these resolved (or explicitly dispositioned) before merge:
-
VsockCID collision on concurrent standby forks.
fork.gostill assigns standby forks the source's CID (forkMeta.VsockCID = stored.VsockCID) while stopped forks get a freshgenerateVsockCID. With the new read-lock concurrency, multiple standby forks of one source can be prepared in parallel and then started to running, all sharing the source CID → vsock routing / guest-agent breakage. The comment says rewriting CID in restored snapshots isn't reliable across hypervisors, which I get — but then we likely need to either reallocate the CID at restore time for forks or serialize "running" admission per source CID. Can you confirm what prevents two running guests with the same CID here? -
Stale source state in the escalated write-lock restore.
applyForkTargetStateWithSourceLockre-takes the source write lock, but thetargetRestoreNeedsSourceLock/sourceState == StateStandbydecision was captured under an earlier RLock that was already released. Nothing re-validates the source state under the write lock before the aliased restore runs. If the source was restored to running in that gap, we could alias its data dir while its VMM is live. Does holding the source write lock actually guarantee the source isn't running by this point, or should we re-load + re-check source state under the write lock before aliasing?
Neither is a correctness concern for the common (rewrite-succeeds, single-fork) path, so I'm comfortable approving — but I'd treat #2 as the higher-risk of the two given it's the data-dir-corruption path.

Summary
SupportsConcurrentForkPrepare, with Firecracker opting in and other hypervisors keeping the previous exclusive source lock behaviorTests
./lib/instancestests required local placeholder files for embedded test binaries that are not present in this checkout.Note
Medium Risk
Changes fork, restore, and snapshot locking paths plus shared restore admission; behavior is well covered by tests but mistakes could cause rare restore/fork races or overload.
Overview
This PR improves Firecracker fork/restore concurrency and surfaces resource exhaustion more clearly on fork/restore APIs.
Firecracker fork prepare can now rewrite snapshot state file paths (with CRC64 validation) when source and target data-dir strings are the same length, so many forks avoid the temporary source data-dir symlink alias during restore. When rewrite is not possible, prepare reports
RequiresSnapshotSourceAliasand restore still uses a narrow, shared alias lock (readers block on metadata/copy; writers only during the alias mutation window).Instance fork uses a source read lock for stopped/standby sources when the hypervisor advertises
SupportsConcurrentForkPrepare(Firecracker enables it); other hypervisors keep the exclusive source lock. Fork metadata admission is serialized with a final name check before save, and snapshot memory prepare is per snapshot directory.Restore capacity is capped via configurable
firecracker_max_concurrent_restores(default 32), with a shared slot pool across managers for that hypervisor/limit.API maps
ErrInsufficientResourcesto 409 with codeinsufficient_resourceson restore, fork instance, and fork snapshot.Reviewed by Cursor Bugbot for commit bb2602e. Bugbot is set up for automated code reviews on this repo. Configure here.