Skip to content

CI: dump JVM thread stacks on hung test step, single attempt#5309

Open
adamw wants to merge 3 commits into
masterfrom
ci/thread-dump-on-test-timeout
Open

CI: dump JVM thread stacks on hung test step, single attempt#5309
adamw wants to merge 3 commits into
masterfrom
ci/thread-dump-on-test-timeout

Conversation

@adamw

@adamw adamw commented Jun 12, 2026

Copy link
Copy Markdown
Member

What this does

Adds a thread dump on a timed-out JVM Test step, so the recurring post-test hang can finally be root-caused.

The hang (seen on e.g. #5306, #5307): every test passes, then sbt goes silent and is killed at the timeout with no diagnostic. Because of Tags.limit(Tags.Test, 1) a single stuck teardown stalls the whole sbt JVM. There's currently no thread dump at kill time, so the offending thread is unknown.

Change

Wrap the sbt invocation in timeout -k 30s -s QUIT 10m:

  • At 10m (the same hard ceiling as the previous timeout_minutes: 10) it sends SIGQUIT → full JVM thread dump into the log, then SIGKILL 30s later. A healthy run exits well before and is unaffected.
  • timeout_minutes raised 10 → 11 purely as a backstop above the inner 10m + kill grace.
  • max_attempts stays 2.

Note on the first revision

An earlier revision set the SIGQUIT deadline to 9m and max_attempts: 1. The CI run showed both were wrong and they've been reverted:

  • 9m was too tight: the Native job was still actively Scala-Native linking (LLVM) at 9m, not hung — the dump confirmed it. 10m matches the prior ceiling and doesn't clip it.
  • 1 attempt removed load-bearing retry: the 2.12/2.13 JVM jobs failed on genuinely flaky netty tests (websocket close-race / server-timeout timing), which pass on retry. (The IllegalStateException: Shutdown in progress noise that accompanies them is a side effect of clientTestServer/reStop being skipped when a test fails — not the Fix flaky CI: stop client test server before sbt exits (revolver shutdown race) #5298 race resurfacing.)

The wrapper mechanism itself worked: that run captured a full thread dump on the timed-out job.

🤖 Generated with Claude Code

adamw and others added 3 commits June 12, 2026 07:28
The JVM `Test` step occasionally hangs after all tests pass: a server-interpreter
teardown never returns and, with `Tags.limit(Tags.Test, 1)`, stalls the whole sbt
JVM until the step times out. The retry then re-runs the full ~4.5min suite only to
hang again, so it never helped.

Wrap sbt in `timeout`: at 9m it sends SIGQUIT to dump all JVM thread stacks to the
log (capturing the offending thread), then SIGKILL 30s later. Only fires when the
run is actually hung; healthy runs exit immediately and are unaffected. Drop
max_attempts 2 -> 1 since retrying a true hang is wasted CI time.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reverts the two regressions from the previous commit, keeping the diagnostic:

- max_attempts back to 2. Dropping to 1 unmasked the pre-existing sbt-revolver
  shutdown-hook race (IllegalStateException: Shutdown in progress) that retry was
  masking; it failed 2 JVM jobs outright. Retry legitimately clears that transient
  race (and gives slow Native a second chance).
- SIGQUIT deadline moved 9m -> 10m to match the previous hard timeout. At 9m the
  Native job was still actively Scala-Native linking (LLVM), not hung; the 9m cut
  killed a healthy build. 10m only swaps the silent kill for a kill-with-thread-dump.

The wrapper itself works: this run captured a full thread dump on the timed-out job.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant