CI: dump JVM thread stacks on hung test step, single attempt#5309
Open
adamw wants to merge 3 commits into
Open
CI: dump JVM thread stacks on hung test step, single attempt#5309adamw wants to merge 3 commits into
adamw wants to merge 3 commits into
Conversation
The JVM `Test` step occasionally hangs after all tests pass: a server-interpreter teardown never returns and, with `Tags.limit(Tags.Test, 1)`, stalls the whole sbt JVM until the step times out. The retry then re-runs the full ~4.5min suite only to hang again, so it never helped. Wrap sbt in `timeout`: at 9m it sends SIGQUIT to dump all JVM thread stacks to the log (capturing the offending thread), then SIGKILL 30s later. Only fires when the run is actually hung; healthy runs exit immediately and are unaffected. Drop max_attempts 2 -> 1 since retrying a true hang is wasted CI time. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reverts the two regressions from the previous commit, keeping the diagnostic: - max_attempts back to 2. Dropping to 1 unmasked the pre-existing sbt-revolver shutdown-hook race (IllegalStateException: Shutdown in progress) that retry was masking; it failed 2 JVM jobs outright. Retry legitimately clears that transient race (and gives slow Native a second chance). - SIGQUIT deadline moved 9m -> 10m to match the previous hard timeout. At 9m the Native job was still actively Scala-Native linking (LLVM), not hung; the 9m cut killed a healthy build. 10m only swaps the silent kill for a kill-with-thread-dump. The wrapper itself works: this run captured a full thread dump on the timed-out job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Adds a thread dump on a timed-out JVM
Teststep, so the recurring post-test hang can finally be root-caused.The hang (seen on e.g. #5306, #5307): every test passes, then sbt goes silent and is killed at the timeout with no diagnostic. Because of
Tags.limit(Tags.Test, 1)a single stuck teardown stalls the whole sbt JVM. There's currently no thread dump at kill time, so the offending thread is unknown.Change
Wrap the sbt invocation in
timeout -k 30s -s QUIT 10m:timeout_minutes: 10) it sends SIGQUIT → full JVM thread dump into the log, then SIGKILL 30s later. A healthy run exits well before and is unaffected.timeout_minutesraised 10 → 11 purely as a backstop above the inner 10m + kill grace.max_attemptsstays 2.Note on the first revision
An earlier revision set the SIGQUIT deadline to 9m and
max_attempts: 1. The CI run showed both were wrong and they've been reverted:Nativejob was still actively Scala-Native linking (LLVM) at 9m, not hung — the dump confirmed it. 10m matches the prior ceiling and doesn't clip it.2.12/2.13 JVMjobs failed on genuinely flaky netty tests (websocket close-race / server-timeout timing), which pass on retry. (TheIllegalStateException: Shutdown in progressnoise that accompanies them is a side effect ofclientTestServer/reStopbeing skipped when a test fails — not the Fix flaky CI: stop client test server before sbt exits (revolver shutdown race) #5298 race resurfacing.)The wrapper mechanism itself worked: that run captured a full thread dump on the timed-out job.
🤖 Generated with Claude Code