Fix sync batch mode failing under non-root salt-master (#69418)#69444
Open
dwoz wants to merge 7 commits into
Open
Fix sync batch mode failing under non-root salt-master (#69418)#69444dwoz wants to merge 7 commits into
dwoz wants to merge 7 commits into
Conversation
The 3008.x batch refactor introduced a regression where the sync CLI batch driver writes batch-state persistence files (``.batch.p``, ``batch_active.p``) under the master's ``cachedir`` from the CLI process. When the salt CLI is invoked as ``root`` against a master running as user ``salt`` (the packaging default), the persistence writes pre-create the JID directory with root ownership, which trips a ``PermissionError`` in ``local_cache.prep_jid`` when the master subsequently tries to write the ``jid`` file. The cascading failure surfaces to the user as ``SaltClientError: Some exception handling minion payload``. ``BatchManager`` only ever acts on ``driver="master"`` batch state (``_handle_new``/``_handle_recover`` explicitly ignore ``driver="cli"``), so the sync CLI's persistence calls served no functional purpose for any consumer of the on-disk batch state. Remove them: ``write_batch_state``, ``add_to_active_index``, and ``remove_from_active_index`` are no longer called from the sync CLI driver. Add a code comment at the deletion site explaining why. Update two parity tests in ``test_batch_parity.py`` to capture the driver state via ``progress_batch`` instead of ``write_batch_state`` (same assertions, new capture mechanism). Add a regression test in ``test_batch.py`` that asserts ``Batch.run()`` calls neither ``write_batch_state`` nor ``add_to_active_index`` / ``remove_from_active_index``. Fixes saltstack#69418
Issue saltstack#69418's first commit removed the sync CLI driver's direct writes under the master's ``cachedir`` to fix the ``PermissionError`` that bricked ``salt -b`` whenever the master ran as a non-root user. That fix was correct for the regression but came at the cost of dropping the visibility feature added in PR saltstack#68964 — ``salt-run batch.list_active`` / ``batch.status`` / ``batch.stop`` could no longer see sync batches. This commit replaces the broken in-process disk writes with an event-bus handoff that keeps every cachedir write on the master daemon's side of the trust boundary: * ``Batch.run()`` fires ``salt/batch/<jid>/{new,progress,complete, halted}`` with the full ``BatchState`` embedded in the payload (new helper ``salt.utils.batch_output.state_payload``). All fire / subscribe / poll ops are best-effort — a missing master event bus degrades to "no visibility, batch still works." * ``Batch.run()`` subscribes to ``salt/batch/<jid>/halted`` for the run's duration and polls non-blocking each loop iteration so a ``salt-run batch.stop`` request actually halts the run. * ``BatchManager._handle_new`` learns to read the embedded state from the event data; for ``driver="cli"`` it persists ``.batch.p`` + the active index but does *not* drive the state machine (the CLI owns that). * New ``_handle_progress`` and ``_handle_terminal`` update the on-disk state for sync CLI batches as the run progresses and clear the active-index entry on completion. * ``_tick`` and ``_progress_one`` defensively skip ``driver="cli"`` JIDs so a stale index entry can never trigger a spurious re-publish or false timeout of in-flight minions. Tests: * New ``tests/pytests/unit/cli/test_batch_visibility.py`` wires ``Batch.run`` to a real ``BatchManager`` via a synchronous fire_event bridge and asserts the end-to-end visibility contract — ``batch.list_active`` and ``batch.status`` see the running sync batch, ``batch.stop`` halts it, the post-run active index is empty. * New unit tests in ``test_batch.py`` cover the event flow, halt-subscription lifecycle, halt observation, and best-effort failure handling. * New ``BatchManager`` tests cover CLI registration without adoption, ``_tick`` filtering, ``_handle_progress`` / ``_handle_terminal`` semantics, and the extended event dispatch. Full batch test suite: 167 passed. Fixes saltstack#69418
The new sync-batch visibility code (Batch._fire_event / _subscribe_to_halt / _consume_halt_event) calls fire_event on the LocalClient's master event handle. fire_event lazily creates a SyncWrapper(ipc_publish_server) — and on first publish a nested SyncWrapper(PublishServerClient) — each owning its own asyncio event loop. LocalClient.destroy did clean those up, but only on exit from Batch.run, racing Python 3.14's interpreter shutdown on Windows where Tornado's _AddThreadSelectorEventLoop closes the selector thread via the loop's shutdown_asyncgens path. When that close ran late, the wakeup Handles it scheduled onto _ready were still alive when the loop's own close() ran self._ready.clear(), GC'ing the Handles' wrapped shutdown_* coroutines unawaited and spilling "RuntimeWarning: coroutine 'BaseEventLoop.shutdown_*' was never awaited" onto the CLI's stderr — which the integration tests (test_batch_retcode / test_multiple_modules_in_batch) gate on. Fix in two complementary places: * salt/cli/batch.py — call event.destroy() explicitly inside Batch.run's finally block, before LocalClient.destroy, so the SyncWrapper cleanup happens deterministically while we still control the loop. SaltEvent.destroy is already idempotent, so LocalClient.destroy's follow-on call is a no-op. * salt/utils/asynchronous.py — after running shutdown_asyncgens and shutdown_default_executor, drain up to 8 asyncio.sleep(0) ticks so the selector-thread close path and any other call_soon-scheduled finalizers complete before asyncio_loop.close() clears _ready. Verified on Linux Python 3.10 with `pytest -W error::RuntimeWarning`: all 9 integration tests and 74 unit tests across the batch suite pass; the existing FD-leak regression tests still pass.
Root cause of "RuntimeWarning: coroutine 'BaseEventLoop.shutdown_asyncgens' was never awaited" on Python 3.14 / Windows in the batch CLI tests: loop.run_until_complete(loop.shutdown_asyncgens()) evaluates the inner argument FIRST -- creating the ``shutdown_asyncgens`` coroutine object -- and only THEN runs ``run_until_complete`` which calls ``_check_closed()`` / ``_check_running()``. If either check raises (the loop has already been closed or is already running for some other reason), the ``RuntimeError`` propagates before ``ensure_future`` can wrap the coroutine into a Task, the bare coroutine object is orphaned, and ``coroutine.__del__`` emits the warning when the GC reaps it. On Python 3.14 / Windows that happens to land while the outer loop's ``_ready.clear()`` is mid-flight, which is what put ``self._ready.clear()`` in the warning's traceback in the failing ``test_batch_retcode`` / ``test_multiple_modules_in_batch`` jobs. Same trap for ``shutdown_default_executor`` and for the ``asyncio.gather(...)`` over cancelled pending tasks. Fix: * New ``SyncWrapper._loop_can_run_until_complete(loop)`` helper — ``True`` iff ``loop is not None and not loop.is_closed() and not loop.is_running()``. Anything that can't be driven to completion is gated out before the coroutine is ever constructed. * In ``SyncWrapper.close``, gate every ``run_until_complete`` call through that helper. As a belt-and-braces, on the ``except`` arm explicitly ``.close()`` the (now-orphaned) coroutine so even a loop-state race between the guard and the call can't leak. Replaces the previous ``asyncio.sleep(0)`` drain workaround introduced in commit d3bfd53, which only papered over the leak by giving the GC more wallclock to run the scheduled-but-unawaited coroutines before ``close()`` cleared them. Acceptance bar (Linux Py3.10): python -W error::RuntimeWarning -X dev -m pytest \ tests/pytests/integration/cli/test_batch.py::test_batch_retcode \ --core-tests -xvs PASSES. Full sweep (9 integration + 74 unit batch tests + ``test_fd_leak_asyncgens_executor.py`` + ``test_fd_leak_task_cancellation.py``) under the same flags: 85 passed, 0 failed.
The asyncio teardown leak fixed in 7f30570 slipped past every unit test because the unit tests mock ``SyncWrapper`` out entirely. Adding eleven CLI-level integration tests that exercise the full ``salt`` / ``salt-run`` lifecycle on the real event loop and gate on ``-W error::RuntimeWarning -X dev``-clean stderr. New tests under tests/pytests/integration/cli/test_batch_options.py: * test_batch_integer_size -- ``-b 2`` * test_batch_percentage -- ``-b 50%`` * test_batch_wait_between_subbatches -- ``--batch-wait`` * test_batch_safe_limit_triggers_batching -- ``--batch-safe-limit`` * test_batch_safe_size_one -- ``--batch-safe-size`` * test_batch_failhard_stops_on_first_bad_return -- ``--failhard`` * test_batch_async_handoff_to_batch_manager -- ``--async`` (driver=master) * test_batch_state_apply -- state.apply in batches * test_batch_stop_halts_sync_batch -- salt-run batch.stop * test_batch_list_active_sees_sync_batch -- salt-run batch.list_active * test_batch_status_returns_live_progress -- salt-run batch.status Each test: * runs through salt-factories' ``salt_cli`` / ``salt_run_cli`` fixtures against a real master + 2 minions (no SyncWrapper patching) * asserts ``cmd.returncode == 0`` for non-failhard tests * runs an ``_assert_clean_stderr`` gate that rejects ``BaseEventLoop.shutdown_asyncgens`` / ``shutdown_default_executor`` markers, ``coroutine '...' was never awaited``, and any traceback on the CLI's stderr -- the exact signature of the regression that broke Windows zeromq 2 CI Runner-visibility tests (9, 10, 11) spawn the sync batch in a background thread targeting ``test.sleep 5`` so the runner has ~10s window to observe a mid-run batch; ``test.sleep`` avoids ``state.apply``'s per-minion lock so the tests are safe back-to-back. The async-handoff test (7) captures the JID from the CLI's "Executed batch command with job ID:" line, then polls until the BatchManager retires it from ``batch.list_active`` -- verifying the master-side event-bus path end-to-end. Verified locally on Linux Py3.10: python -W error::RuntimeWarning -X dev -m pytest \ tests/pytests/integration/cli/test_batch.py \ tests/pytests/integration/cli/test_batch_options.py \ --slow-tests --core-tests 20 passed in 66.13s (9 pre-existing batch tests + 11 new options tests.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Two commits:
cachedir. Restores 3007.x behavior: the CLI process never touches<cachedir>/jobs/<jid_dir>/.batch.por<cachedir>/batch_active.p.salt/batch/<jid>/{new,progress,complete,halted}) so the master-sideBatchManagerpersists.batch.pon the CLI's behalf.salt-run batch.list_active,batch.status, andbatch.stopnow work for sync batches in the same deployment shape (non-root master, root CLI) where the feature was broken from day one.What issues does this PR fix or reference?
Fixes #69418
Previous Behavior
salt -b <N> '*' <fun>(sync batch mode) failed withSaltClientError: Some exception handling minion payloadwhenever the salt-master ran as a non-root user (the packaging defaultuser: salt).Root cause: the 3008.x batch refactor (PR #68964) introduced
salt/utils/batch_state.pyand made the sync CLI'sBatch.run()callwrite_batch_state()andadd_to_active_index().write_batch_state()doesos.makedirs(<cachedir>/jobs/<jid_dir>, exist_ok=True). When the salt CLI is invoked asroot, this pre-creates the JID directory owned by root. The CLI then publishes the job viacmd_iter_no_block(jid=batch_jid); the master receives the publish, callslocal_cache.prep_jid(passed_jid=batch_jid), finds the JID directory already present (so skipsmakedirs), and fails withPermissionErrorwhenfopen(<jid_dir>/jid, "wb+")tries to create thejidfile in a root-owned directory. The retry loop hits its 5-try ceiling, raisesSaltCacheError, which the master returns asPublishError, which the CLI surfaces asSaltClientError.The reporter pinpointed the root cause by capturing the master log; switching
user: salt->user: rootin the master config works around the bug. See issue #69418 for the full trace.New Behavior
salt -b <N> '*' <fun>works the same way it did on 3007.x: the master assigns and owns the JID directory, and the CLI process touches no files under the master's cachedir.The sync visibility feature added in PR #68964 (the runners
batch.list_active,batch.status,batch.stop) is restored through a clean event-bus handoff.Batch.run()firessalt/batch/<jid>/newcarrying the full state, thenprogressafter each tick, thencompleteorhaltedat teardown.BatchManagerlistens for those, persists.batch.pon the master daemon's side of the FS trust boundary, and maintains the active-batch index.BatchManager._tickand_progress_onenow defensively skipdriver="cli"JIDs so a stale index entry can never trigger a spurious re-publish or false timeout of in-flight minions — a hazard that existed even in the broken pre-fix code.Batch.run()subscribes tosalt/batch/<jid>/haltedfor the run's duration sosalt-run batch.stop <jid>actually halts a sync batch.Event-bus failures degrade gracefully: every
fire_event/subscribe/get_eventcall in the CLI catches and logs at debug — if the master event bus is unreachable the batch still completes, just without visibility from the runner commands (same as 3007.x).The async batch path (
salt --batch --async) is unchanged — it already lived insalt/cli/salt.py:_run_batch_asyncand buildsdriver="master"state inline.Tests
Full batch test suite: 167 passed.
tests/pytests/unit/cli/test_batch.py(4 new tests: event flow, halt-subscription lifecycle, halt observation, best-effort bus-failure handling).tests/pytests/unit/cli/test_batch_visibility.py(new file, 4 tests) wiresBatch.run+BatchManagerend-to-end and asserts:batch.list_active/batch.statussee the running sync batch;batch.stophalts it via the event round-trip; the post-completion active index is empty.tests/pytests/unit/utils/test_batch_manager.py(new tests for CLI registration,_tickfiltering,_handle_progress/_handle_terminal, extended event dispatch).test_batch_parity.pyupdated to capture driver state viaprogress_batch(sincewrite_batch_stateis no longer called from the CLI process).Merge requirements satisfied?
.batch.pfile)changelog/69418.fixed.mdCommits signed with GPG?
Yes