Fix sync batch mode failing under non-root salt-master (#69418) by dwoz · Pull Request #69444 · saltstack/salt

dwoz · 2026-06-13T00:19:37Z

What does this PR do?

Two commits:

Regression fix — removes the sync CLI batch driver's direct writes under the master's cachedir. Restores 3007.x behavior: the CLI process never touches <cachedir>/jobs/<jid_dir>/.batch.p or <cachedir>/batch_active.p.
Restored visibility through events — replaces the broken in-process disk writes with an event-bus handoff (salt/batch/<jid>/{new,progress,complete,halted}) so the master-side BatchManager persists .batch.p on the CLI's behalf. salt-run batch.list_active, batch.status, and batch.stop now work for sync batches in the same deployment shape (non-root master, root CLI) where the feature was broken from day one.

What issues does this PR fix or reference?

Previous Behavior

salt -b <N> '*' <fun> (sync batch mode) failed with SaltClientError: Some exception handling minion payload whenever the salt-master ran as a non-root user (the packaging default user: salt).

Root cause: the 3008.x batch refactor (PR #68964) introduced salt/utils/batch_state.py and made the sync CLI's Batch.run() call write_batch_state() and add_to_active_index(). write_batch_state() does os.makedirs(<cachedir>/jobs/<jid_dir>, exist_ok=True). When the salt CLI is invoked as root, this pre-creates the JID directory owned by root. The CLI then publishes the job via cmd_iter_no_block(jid=batch_jid); the master receives the publish, calls local_cache.prep_jid(passed_jid=batch_jid), finds the JID directory already present (so skips makedirs), and fails with PermissionError when fopen(<jid_dir>/jid, "wb+") tries to create the jid file in a root-owned directory. The retry loop hits its 5-try ceiling, raises SaltCacheError, which the master returns as PublishError, which the CLI surfaces as SaltClientError.

The reporter pinpointed the root cause by capturing the master log; switching user: salt -> user: root in the master config works around the bug. See issue #69418 for the full trace.

New Behavior

salt -b <N> '*' <fun> works the same way it did on 3007.x: the master assigns and owns the JID directory, and the CLI process touches no files under the master's cachedir.

The sync visibility feature added in PR #68964 (the runners batch.list_active, batch.status, batch.stop) is restored through a clean event-bus handoff. Batch.run() fires salt/batch/<jid>/new carrying the full state, then progress after each tick, then complete or halted at teardown. BatchManager listens for those, persists .batch.p on the master daemon's side of the FS trust boundary, and maintains the active-batch index. BatchManager._tick and _progress_one now defensively skip driver="cli" JIDs so a stale index entry can never trigger a spurious re-publish or false timeout of in-flight minions — a hazard that existed even in the broken pre-fix code. Batch.run() subscribes to salt/batch/<jid>/halted for the run's duration so salt-run batch.stop <jid> actually halts a sync batch.

Event-bus failures degrade gracefully: every fire_event / subscribe / get_event call in the CLI catches and logs at debug — if the master event bus is unreachable the batch still completes, just without visibility from the runner commands (same as 3007.x).

The async batch path (salt --batch --async) is unchanged — it already lived in salt/cli/salt.py:_run_batch_async and builds driver="master" state inline.

Tests

Full batch test suite: 167 passed.

tests/pytests/unit/cli/test_batch.py (4 new tests: event flow, halt-subscription lifecycle, halt observation, best-effort bus-failure handling).
tests/pytests/unit/cli/test_batch_visibility.py (new file, 4 tests) wires Batch.run + BatchManager end-to-end and asserts: batch.list_active / batch.status see the running sync batch; batch.stop halts it via the event round-trip; the post-completion active index is empty.
tests/pytests/unit/utils/test_batch_manager.py (new tests for CLI registration, _tick filtering, _handle_progress / _handle_terminal, extended event dispatch).
Two parity tests in test_batch_parity.py updated to capture driver state via progress_batch (since write_batch_state is no longer called from the CLI process).

Merge requirements satisfied?

Docs — N/A (no user-facing documentation describes the in-progress .batch.p file)
Changelog — changelog/69418.fixed.md
Tests written/updated

Commits signed with GPG?

Yes

The 3008.x batch refactor introduced a regression where the sync CLI batch driver writes batch-state persistence files (``.batch.p``, ``batch_active.p``) under the master's ``cachedir`` from the CLI process. When the salt CLI is invoked as ``root`` against a master running as user ``salt`` (the packaging default), the persistence writes pre-create the JID directory with root ownership, which trips a ``PermissionError`` in ``local_cache.prep_jid`` when the master subsequently tries to write the ``jid`` file. The cascading failure surfaces to the user as ``SaltClientError: Some exception handling minion payload``. ``BatchManager`` only ever acts on ``driver="master"`` batch state (``_handle_new``/``_handle_recover`` explicitly ignore ``driver="cli"``), so the sync CLI's persistence calls served no functional purpose for any consumer of the on-disk batch state. Remove them: ``write_batch_state``, ``add_to_active_index``, and ``remove_from_active_index`` are no longer called from the sync CLI driver. Add a code comment at the deletion site explaining why. Update two parity tests in ``test_batch_parity.py`` to capture the driver state via ``progress_batch`` instead of ``write_batch_state`` (same assertions, new capture mechanism). Add a regression test in ``test_batch.py`` that asserts ``Batch.run()`` calls neither ``write_batch_state`` nor ``add_to_active_index`` / ``remove_from_active_index``. Fixes saltstack#69418

Issue saltstack#69418's first commit removed the sync CLI driver's direct writes under the master's ``cachedir`` to fix the ``PermissionError`` that bricked ``salt -b`` whenever the master ran as a non-root user. That fix was correct for the regression but came at the cost of dropping the visibility feature added in PR saltstack#68964 — ``salt-run batch.list_active`` / ``batch.status`` / ``batch.stop`` could no longer see sync batches. This commit replaces the broken in-process disk writes with an event-bus handoff that keeps every cachedir write on the master daemon's side of the trust boundary: * ``Batch.run()`` fires ``salt/batch/<jid>/{new,progress,complete, halted}`` with the full ``BatchState`` embedded in the payload (new helper ``salt.utils.batch_output.state_payload``). All fire / subscribe / poll ops are best-effort — a missing master event bus degrades to "no visibility, batch still works." * ``Batch.run()`` subscribes to ``salt/batch/<jid>/halted`` for the run's duration and polls non-blocking each loop iteration so a ``salt-run batch.stop`` request actually halts the run. * ``BatchManager._handle_new`` learns to read the embedded state from the event data; for ``driver="cli"`` it persists ``.batch.p`` + the active index but does *not* drive the state machine (the CLI owns that). * New ``_handle_progress`` and ``_handle_terminal`` update the on-disk state for sync CLI batches as the run progresses and clear the active-index entry on completion. * ``_tick`` and ``_progress_one`` defensively skip ``driver="cli"`` JIDs so a stale index entry can never trigger a spurious re-publish or false timeout of in-flight minions. Tests: * New ``tests/pytests/unit/cli/test_batch_visibility.py`` wires ``Batch.run`` to a real ``BatchManager`` via a synchronous fire_event bridge and asserts the end-to-end visibility contract — ``batch.list_active`` and ``batch.status`` see the running sync batch, ``batch.stop`` halts it, the post-run active index is empty. * New unit tests in ``test_batch.py`` cover the event flow, halt-subscription lifecycle, halt observation, and best-effort failure handling. * New ``BatchManager`` tests cover CLI registration without adoption, ``_tick`` filtering, ``_handle_progress`` / ``_handle_terminal`` semantics, and the extended event dispatch. Full batch test suite: 167 passed. Fixes saltstack#69418

The new sync-batch visibility code (Batch._fire_event / _subscribe_to_halt / _consume_halt_event) calls fire_event on the LocalClient's master event handle. fire_event lazily creates a SyncWrapper(ipc_publish_server) — and on first publish a nested SyncWrapper(PublishServerClient) — each owning its own asyncio event loop. LocalClient.destroy did clean those up, but only on exit from Batch.run, racing Python 3.14's interpreter shutdown on Windows where Tornado's _AddThreadSelectorEventLoop closes the selector thread via the loop's shutdown_asyncgens path. When that close ran late, the wakeup Handles it scheduled onto _ready were still alive when the loop's own close() ran self._ready.clear(), GC'ing the Handles' wrapped shutdown_* coroutines unawaited and spilling "RuntimeWarning: coroutine 'BaseEventLoop.shutdown_*' was never awaited" onto the CLI's stderr — which the integration tests (test_batch_retcode / test_multiple_modules_in_batch) gate on. Fix in two complementary places: * salt/cli/batch.py — call event.destroy() explicitly inside Batch.run's finally block, before LocalClient.destroy, so the SyncWrapper cleanup happens deterministically while we still control the loop. SaltEvent.destroy is already idempotent, so LocalClient.destroy's follow-on call is a no-op. * salt/utils/asynchronous.py — after running shutdown_asyncgens and shutdown_default_executor, drain up to 8 asyncio.sleep(0) ticks so the selector-thread close path and any other call_soon-scheduled finalizers complete before asyncio_loop.close() clears _ready. Verified on Linux Python 3.10 with `pytest -W error::RuntimeWarning`: all 9 integration tests and 74 unit tests across the batch suite pass; the existing FD-leak regression tests still pass.

Root cause of "RuntimeWarning: coroutine 'BaseEventLoop.shutdown_asyncgens' was never awaited" on Python 3.14 / Windows in the batch CLI tests: loop.run_until_complete(loop.shutdown_asyncgens()) evaluates the inner argument FIRST -- creating the ``shutdown_asyncgens`` coroutine object -- and only THEN runs ``run_until_complete`` which calls ``_check_closed()`` / ``_check_running()``. If either check raises (the loop has already been closed or is already running for some other reason), the ``RuntimeError`` propagates before ``ensure_future`` can wrap the coroutine into a Task, the bare coroutine object is orphaned, and ``coroutine.__del__`` emits the warning when the GC reaps it. On Python 3.14 / Windows that happens to land while the outer loop's ``_ready.clear()`` is mid-flight, which is what put ``self._ready.clear()`` in the warning's traceback in the failing ``test_batch_retcode`` / ``test_multiple_modules_in_batch`` jobs. Same trap for ``shutdown_default_executor`` and for the ``asyncio.gather(...)`` over cancelled pending tasks. Fix: * New ``SyncWrapper._loop_can_run_until_complete(loop)`` helper — ``True`` iff ``loop is not None and not loop.is_closed() and not loop.is_running()``. Anything that can't be driven to completion is gated out before the coroutine is ever constructed. * In ``SyncWrapper.close``, gate every ``run_until_complete`` call through that helper. As a belt-and-braces, on the ``except`` arm explicitly ``.close()`` the (now-orphaned) coroutine so even a loop-state race between the guard and the call can't leak. Replaces the previous ``asyncio.sleep(0)`` drain workaround introduced in commit d3bfd53, which only papered over the leak by giving the GC more wallclock to run the scheduled-but-unawaited coroutines before ``close()`` cleared them. Acceptance bar (Linux Py3.10): python -W error::RuntimeWarning -X dev -m pytest \ tests/pytests/integration/cli/test_batch.py::test_batch_retcode \ --core-tests -xvs PASSES. Full sweep (9 integration + 74 unit batch tests + ``test_fd_leak_asyncgens_executor.py`` + ``test_fd_leak_task_cancellation.py``) under the same flags: 85 passed, 0 failed.

The asyncio teardown leak fixed in 7f30570 slipped past every unit test because the unit tests mock ``SyncWrapper`` out entirely. Adding eleven CLI-level integration tests that exercise the full ``salt`` / ``salt-run`` lifecycle on the real event loop and gate on ``-W error::RuntimeWarning -X dev``-clean stderr. New tests under tests/pytests/integration/cli/test_batch_options.py: * test_batch_integer_size -- ``-b 2`` * test_batch_percentage -- ``-b 50%`` * test_batch_wait_between_subbatches -- ``--batch-wait`` * test_batch_safe_limit_triggers_batching -- ``--batch-safe-limit`` * test_batch_safe_size_one -- ``--batch-safe-size`` * test_batch_failhard_stops_on_first_bad_return -- ``--failhard`` * test_batch_async_handoff_to_batch_manager -- ``--async`` (driver=master) * test_batch_state_apply -- state.apply in batches * test_batch_stop_halts_sync_batch -- salt-run batch.stop * test_batch_list_active_sees_sync_batch -- salt-run batch.list_active * test_batch_status_returns_live_progress -- salt-run batch.status Each test: * runs through salt-factories' ``salt_cli`` / ``salt_run_cli`` fixtures against a real master + 2 minions (no SyncWrapper patching) * asserts ``cmd.returncode == 0`` for non-failhard tests * runs an ``_assert_clean_stderr`` gate that rejects ``BaseEventLoop.shutdown_asyncgens`` / ``shutdown_default_executor`` markers, ``coroutine '...' was never awaited``, and any traceback on the CLI's stderr -- the exact signature of the regression that broke Windows zeromq 2 CI Runner-visibility tests (9, 10, 11) spawn the sync batch in a background thread targeting ``test.sleep 5`` so the runner has ~10s window to observe a mid-run batch; ``test.sleep`` avoids ``state.apply``'s per-minion lock so the tests are safe back-to-back. The async-handoff test (7) captures the JID from the CLI's "Executed batch command with job ID:" line, then polls until the BatchManager retires it from ``batch.list_active`` -- verifying the master-side event-bus path end-to-end. Verified locally on Linux Py3.10: python -W error::RuntimeWarning -X dev -m pytest \ tests/pytests/integration/cli/test_batch.py \ tests/pytests/integration/cli/test_batch_options.py \ --slow-tests --core-tests 20 passed in 66.13s (9 pre-existing batch tests + 11 new options tests.)

dwoz requested a review from a team as a code owner June 13, 2026 00:19

dwoz added this to the Argon v3008.1 milestone Jun 13, 2026

dwoz added the test:full Run the full test suite label Jun 13, 2026

dwoz had a problem deploying to ci June 13, 2026 00:19 — with GitHub Actions Error

dwoz temporarily deployed to ci June 13, 2026 00:19 — with GitHub Actions Inactive

dwoz had a problem deploying to ci June 13, 2026 00:36 — with GitHub Actions Error

dwoz temporarily deployed to ci June 13, 2026 00:36 — with GitHub Actions Inactive

dwoz had a problem deploying to ci June 13, 2026 00:36 — with GitHub Actions Error

dwoz temporarily deployed to ci June 13, 2026 00:36 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 13, 2026 02:11 — with GitHub Actions Inactive

dwoz had a problem deploying to ci June 13, 2026 02:11 — with GitHub Actions Error

dwoz temporarily deployed to ci June 13, 2026 02:19 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 13, 2026 02:34 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 13, 2026 05:44 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 14, 2026 01:09 — with GitHub Actions Inactive

dwoz had a problem deploying to ci June 14, 2026 01:27 — with GitHub Actions Error

dwoz temporarily deployed to ci June 14, 2026 01:27 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 14, 2026 01:47 — with GitHub Actions Inactive

dwoz had a problem deploying to ci June 14, 2026 01:47 — with GitHub Actions Error

dwoz temporarily deployed to ci June 14, 2026 01:47 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 14, 2026 01:59 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 14, 2026 02:23 — with GitHub Actions Inactive

dwoz temporarily deployed to ci June 14, 2026 02:54 — with GitHub Actions Inactive

dwoz requested a deployment to ci June 14, 2026 03:28 — with GitHub Actions Queued

dwoz temporarily deployed to ci June 14, 2026 03:28 — with GitHub Actions Inactive

dwoz requested a deployment to ci June 14, 2026 03:28 — with GitHub Actions Queued

dwoz temporarily deployed to ci June 14, 2026 03:28 — with GitHub Actions Inactive

dwoz requested a deployment to ci June 14, 2026 03:28 — with GitHub Actions Queued

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sync batch mode failing under non-root salt-master (#69418)#69444

Fix sync batch mode failing under non-root salt-master (#69418)#69444
dwoz wants to merge 7 commits into
saltstack:3008.xfrom
dwoz:fix/issue-69418

dwoz commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dwoz commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What issues does this PR fix or reference?

Previous Behavior

New Behavior

Tests

Merge requirements satisfied?

Commits signed with GPG?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dwoz commented Jun 13, 2026 •

edited

Loading