Skip to content

Parallelize fork network setup#266

Merged
sjmiller609 merged 6 commits into
mainfrom
hypeship/fork-network-parallel-v2
Jun 1, 2026
Merged

Parallelize fork network setup#266
sjmiller609 merged 6 commits into
mainfrom
hypeship/fork-network-parallel-v2

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Jun 1, 2026

Summary

  • splits the network/TAP/TC portion out from the old combined UFFD/network branch
  • bases it on the new restore-network optimization branch
  • supersedes the network portion of Parallelize Firecracker fork restore hot path #258; old branch is preserved for comparison

Tests

  • git diff --check origin/main...hypeship/fork-network-parallel-v2
  • go test ./lib/network -count=1

Note

Medium Risk
Changes concurrency, pending allocation visibility, and async tc vs release ordering on shared bridge state; mistakes could cause duplicate identities, stale HTB classes, or briefly unshaped traffic after fork.

Overview
Fork/restore network setup is split into a fast blocking path and deferred shaping, so concurrent forks no longer serialize on bridge tc work.

CreateAllocation and RecreateAllocation now reserve name/IP/MAC/TAP under a short mutex (with pending allocations visible to NameExists / GetAllocation before metadata is written), create and bridge the TAP synchronously, and enqueue download/upload limits on a background goroutine guarded by tcMu. TAP creation no longer applies rate limits inline; Linux createTAPDevice only brings up the interface. Release takes tcMu when deleting TAPs so async tc cannot race HTB cleanup, and delete prefers the persisted class ID. Default bridge details are cached after init to avoid repeated netlink queries. Detached OTel spans cover async rate-limit work so restore/request traces show only blocking network steps. Docs and unit tests cover pending state, class ID on delete, and detached tracing.

Reviewed by Cursor Bugbot for commit 5a26c31. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 force-pushed the hypeship/restore-network-v2 branch from 0e63fd2 to 4d71b40 Compare June 1, 2026 14:16
Base automatically changed from hypeship/restore-network-v2 to main June 1, 2026 14:33
@sjmiller609 sjmiller609 force-pushed the hypeship/fork-network-parallel-v2 branch from f52d030 to e44d81e Compare June 1, 2026 14:37
@sjmiller609 sjmiller609 marked this pull request as ready for review June 1, 2026 14:50
Comment thread lib/network/allocate.go
Comment thread lib/network/derive.go
Comment thread lib/network/derive.go Outdated
@sjmiller609 sjmiller609 force-pushed the hypeship/fork-network-parallel-v2 branch from e44d81e to fa56b15 Compare June 1, 2026 14:54
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: Async Network Rate Limiting and Allocation Refactor

What this PR does: Reduces VM launch latency by making network bandwidth configuration non-blocking — TAP devices are created synchronously but TC rate-limit rules are applied in the background, allowing the VMM to proceed immediately.

Intended effect:

  • VM spawn success rate: baseline ~87–100% across active hours; confirmed if spawn failures (failed to create instance errors) stay within pre-existing range of 12,700–13,100/hr
  • New network OTel spans (network.create_tap, network.rate_limit.apply): baseline none; confirmed if present in traces within 1 hour of deploy
  • TAP creation latency: baseline unchanged; confirmed if Railway 5xx stays at 3–9/hr

Risks:

  • Silent rate-limit missfailed to apply async download/upload rate limit ERROR logs; alert if any appear post-deploy (baseline: 0)
  • Pending allocation leak / IP collisionalready exists, can't assign into same network ERROR logs; alert if any appear (baseline: 0 expected; would cause spawn rejections during concurrent forks)
  • tcMu contention blocking release — Railway HTTP 5xx rate; alert if > 100/hr for 2+ consecutive hours (baseline: 3–9/hr)
  • Spawn failure regressionfailed to create instance ERROR log count; alert if > 15,000/hr for 2+ hours (baseline: 12,700–13,100/hr, pre-existing infra noise)

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Comment thread lib/network/bridge_linux.go Outdated
Comment thread lib/network/manager.go Outdated
…-parallel-v2

# Conflicts:
#	lib/network/allocate.go
#	lib/network/bridge_linux.go
#	lib/network/tracing.go
Comment thread lib/network/allocate.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 41c530e. Configure here.

Comment thread lib/network/allocate.go
@sjmiller609 sjmiller609 requested review from hiroTamada and rgarcia June 1, 2026 15:44
@sjmiller609 sjmiller609 merged commit fcb0faf into main Jun 1, 2026
17 of 20 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/fork-network-parallel-v2 branch June 1, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants