Skip to content

fix(pool): clamp pool top-up to runners_maximum_count#5187

Open
jeff-french wants to merge 1 commit into
github-aws-runners:mainfrom
jeff-french:fix/pool-respect-runners-maximum-count
Open

fix(pool): clamp pool top-up to runners_maximum_count#5187
jeff-french wants to merge 1 commit into
github-aws-runners:mainfrom
jeff-french:fix/pool-respect-runners-maximum-count

Conversation

@jeff-french

Copy link
Copy Markdown

Description

runners_maximum_count was enforced only by the scale-up lambda. The pool lambda (adjustPool) had no knowledge of the maximum and topped up purely against pool_size, so a warm pool could drive the total number of runners far past runners_maximum_count.

calculatePooSize() counts only idle runners. Under a sustained burst of queued jobs, runners created to fill the pool are immediately picked up and become busy, so they stop counting toward numberOfRunnersInPool. Every scheduled pool cycle therefore sees ~0 idle runners and launches another full pool_size batch — with no upper bound — while the scale-up lambda correctly refuses to launch ("maximum number of runners reached"). The two lambdas actively disagree about the cap.

Fixes #5186.

Changes

  • lambdas/.../pool/pool.ts — read RUNNERS_MAXIMUM_COUNT (default -1 = unlimited, matching scale-up semantics) and clamp topUp to the remaining headroom under the cap. ec2runners already contains every running runner for the type (busy + idle), so its length is the current total — no extra API call. Logs when the cap limits the top-up.
  • Terraform — thread the value into the pool lambda's environment:
    • modules/runners/pool/main.tf: RUNNERS_MAXIMUM_COUNT = var.config.runners_maximum_count
    • modules/runners/pool/variables.tf: add runners_maximum_count to the config object
    • modules/runners/pool.tf: runners_maximum_count = var.runners_maximum_count
    • modules/runners/pool/README.md: regenerated docs

Backward compatibility

Defaulting the env to -1 preserves current behavior when it is unset and matches the documented "-1 disables the maximum check" semantics.

Relationship to #5062

#5062 added Math.max(0, …) in scale-up to stop a negative TotalTargetCapacity reaching CreateFleet when currentRunners already exceeds maximumRunners. That guards the crash symptom; this PR addresses the root cause of how currentRunners exceeds maximumRunners (the pool creating past the cap). The two are complementary.

Tests

pool.test.ts adds cap coverage: at-max ⇒ 0 created, over-max ⇒ 0, headroom-clamped ⇒ 2, within-headroom ⇒ pool-driven, and -1 ⇒ unlimited. The base RUNNERS_MAXIMUM_COUNT in the suite is set to -1 so the existing pool-logic tests remain cap-free.

  • control-plane vitest suite: 499 passed
  • eslint / prettier --check: clean
  • terraform validate / terraform fmt: clean

🤖 Generated with Claude Code

The pool lambda (`adjustPool`) topped up purely against `pool_size` and
never read `runners_maximum_count`, so under sustained load—where newly
created runners immediately become busy and stop counting toward the
idle-only pool size—it would launch a fresh `pool_size` batch every cycle
with no upper bound, driving total runners far past the configured maximum
while the scale-up lambda correctly refused to launch.

Clamp the top-up to the remaining headroom under `runners_maximum_count`
(busy + idle). `ec2runners` already holds every running runner for the
type, so its length is the current total—no extra API call. The env var
defaults to `-1` (unlimited), matching scale-up semantics and preserving
behavior when unset. Thread the value into the pool lambda's environment
via Terraform.

Fixes github-aws-runners#5186

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jeff-french jeff-french requested a review from a team as a code owner June 26, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pool lambda (adjustPool) ignores runners_maximum_count, causing unbounded over-provisioning under sustained load

1 participant