Skip to content

fix(pyamber): prevent indefinite worker shutdown hang#5326

Closed
Ma77Ball wants to merge 4 commits into
apache:mainfrom
Ma77Ball:fix/stoppable-queue-shutdown-timeout
Closed

fix(pyamber): prevent indefinite worker shutdown hang#5326
Ma77Ball wants to merge 4 commits into
apache:mainfrom
Ma77Ball:fix/stoppable-queue-shutdown-timeout

Conversation

@Ma77Ball

@Ma77Ball Ma77Ball commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Hardens the shutdown path of the three StoppableQueueBlockingRunnable threads (MainLoop, NetworkSender, PortStorageWriter) so a stop request no longer depends solely on a single queue wakeup.

  • Add a threading.Event stop flag to StoppableQueueBlockingRunnable. stop() now sets the flag and enqueues the existing RUNNABLE_STOP marker. interruptible_get blocks on get(timeout=STOP_POLL_INTERVAL) and treats a queue.Empty timeout as a cue to re-check the flag and exit.
  • Plumb an optional timeout through Getable.get, InternalQueue.get, and LinkedBlockingMultiQueue.get (via Condition.wait_for). The default timeout=None preserves the existing blocking behavior exactly, so the tuple data path is untouched and only the stoppable threads opt into polling.

Why?

This is defensive hardening, not a fix for a currently-reproducible hang. As the code stands today, stop() always sets the flag and enqueues the marker through correctly-locked queues, so the wakeup is reliably delivered and no thread parks indefinitely. The value of the change is to decouple "stop was requested" (the flag, checked on a timer) from "the queue delivered the marker" (the notify), so that a future refactor (a new stop path that forgets the marker, or a change to the queue's notify logic) cannot silently reintroduce a shutdown hang.

Performance

  • Data/tuple path: no change (timeout=None everywhere except the stop threads).
  • Shutdown latency: unchanged. The marker still wakes the thread immediately; the timeout is a fallback not hit in normal operation.
  • Idle cost: each stop-eligible thread wakes once per STOP_POLL_INTERVAL (1s) to re-check the flag instead of sleeping indefinitely, a no-op wakeup with negligible CPU cost. The interval can be raised if needed.

Any related issues, documentation, or discussions?

Closes: #5325

How was this PR tested?

  • test_stoppable_queue_blocking_thread.py: items reach receive() and stop() ends run(); and setting only the stop flag (no marker) still terminates run(), exercising the timeout recheck path. This last case injects a state the production code does not currently produce, verifying the safety-net mechanism directly.
  • Added 3 timeout cases to test_linked_blocking_multi_queue.py (raises Empty on expiry, returns an available item, returns an item arriving mid-wait); the file is 17 passing tests.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.7 in compliance with ASF

@Ma77Ball Ma77Ball force-pushed the fix/stoppable-queue-shutdown-timeout branch from 9168cdb to 83c15c5 Compare June 3, 2026 04:06
@codecov-commenter

codecov-commenter commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.13%. Comparing base (6d604f3) to head (29432ed).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5326      +/-   ##
============================================
- Coverage     51.15%   49.13%   -2.02%     
+ Complexity     2413     2379      -34     
============================================
  Files          1054     1051       -3     
  Lines         40923    40205     -718     
  Branches       4381     4271     -110     
============================================
- Hits          20933    19756    -1177     
- Misses        18791    19292     +501     
+ Partials       1199     1157      -42     
Flag Coverage Δ *Carryforward flag
access-control-service 41.89% <ø> (ø) Carriedforward from 4cc7539
agent-service 33.76% <ø> (ø) Carriedforward from 4cc7539
amber 51.56% <ø> (-0.07%) ⬇️ Carriedforward from 4cc7539
computing-unit-managing-service 0.00% <ø> (-1.39%) ⬇️ Carriedforward from 4cc7539
config-service 0.00% <ø> (-54.69%) ⬇️ Carriedforward from 4cc7539
file-service 38.42% <ø> (ø) Carriedforward from 4cc7539
frontend 40.95% <ø> (-4.81%) ⬇️ Carriedforward from 4cc7539
pyamber 90.72% <100.00%> (+0.03%) ⬆️
python 90.81% <ø> (-0.04%) ⬇️ Carriedforward from 4cc7539
workflow-compiling-service 57.25% <ø> (-1.15%) ⬇️ Carriedforward from 4cc7539

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden StoppableQueueBlockingRunnable shutdown to not depend solely on the RUNNABLE_STOP wakeup

2 participants