Skip to content

Fix sync batch mode failing under non-root salt-master (#69418)#69444

Open
dwoz wants to merge 7 commits into
saltstack:3008.xfrom
dwoz:fix/issue-69418
Open

Fix sync batch mode failing under non-root salt-master (#69418)#69444
dwoz wants to merge 7 commits into
saltstack:3008.xfrom
dwoz:fix/issue-69418

Conversation

@dwoz

@dwoz dwoz commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Two commits:

  1. Regression fix — removes the sync CLI batch driver's direct writes under the master's cachedir. Restores 3007.x behavior: the CLI process never touches <cachedir>/jobs/<jid_dir>/.batch.p or <cachedir>/batch_active.p.
  2. Restored visibility through events — replaces the broken in-process disk writes with an event-bus handoff (salt/batch/<jid>/{new,progress,complete,halted}) so the master-side BatchManager persists .batch.p on the CLI's behalf. salt-run batch.list_active, batch.status, and batch.stop now work for sync batches in the same deployment shape (non-root master, root CLI) where the feature was broken from day one.

What issues does this PR fix or reference?

Fixes #69418

Previous Behavior

salt -b <N> '*' <fun> (sync batch mode) failed with SaltClientError: Some exception handling minion payload whenever the salt-master ran as a non-root user (the packaging default user: salt).

Root cause: the 3008.x batch refactor (PR #68964) introduced salt/utils/batch_state.py and made the sync CLI's Batch.run() call write_batch_state() and add_to_active_index(). write_batch_state() does os.makedirs(<cachedir>/jobs/<jid_dir>, exist_ok=True). When the salt CLI is invoked as root, this pre-creates the JID directory owned by root. The CLI then publishes the job via cmd_iter_no_block(jid=batch_jid); the master receives the publish, calls local_cache.prep_jid(passed_jid=batch_jid), finds the JID directory already present (so skips makedirs), and fails with PermissionError when fopen(<jid_dir>/jid, "wb+") tries to create the jid file in a root-owned directory. The retry loop hits its 5-try ceiling, raises SaltCacheError, which the master returns as PublishError, which the CLI surfaces as SaltClientError.

The reporter pinpointed the root cause by capturing the master log; switching user: salt -> user: root in the master config works around the bug. See issue #69418 for the full trace.

New Behavior

salt -b <N> '*' <fun> works the same way it did on 3007.x: the master assigns and owns the JID directory, and the CLI process touches no files under the master's cachedir.

The sync visibility feature added in PR #68964 (the runners batch.list_active, batch.status, batch.stop) is restored through a clean event-bus handoff. Batch.run() fires salt/batch/<jid>/new carrying the full state, then progress after each tick, then complete or halted at teardown. BatchManager listens for those, persists .batch.p on the master daemon's side of the FS trust boundary, and maintains the active-batch index. BatchManager._tick and _progress_one now defensively skip driver="cli" JIDs so a stale index entry can never trigger a spurious re-publish or false timeout of in-flight minions — a hazard that existed even in the broken pre-fix code. Batch.run() subscribes to salt/batch/<jid>/halted for the run's duration so salt-run batch.stop <jid> actually halts a sync batch.

Event-bus failures degrade gracefully: every fire_event / subscribe / get_event call in the CLI catches and logs at debug — if the master event bus is unreachable the batch still completes, just without visibility from the runner commands (same as 3007.x).

The async batch path (salt --batch --async) is unchanged — it already lived in salt/cli/salt.py:_run_batch_async and builds driver="master" state inline.

Tests

Full batch test suite: 167 passed.

  • tests/pytests/unit/cli/test_batch.py (4 new tests: event flow, halt-subscription lifecycle, halt observation, best-effort bus-failure handling).
  • tests/pytests/unit/cli/test_batch_visibility.py (new file, 4 tests) wires Batch.run + BatchManager end-to-end and asserts: batch.list_active / batch.status see the running sync batch; batch.stop halts it via the event round-trip; the post-completion active index is empty.
  • tests/pytests/unit/utils/test_batch_manager.py (new tests for CLI registration, _tick filtering, _handle_progress / _handle_terminal, extended event dispatch).
  • Two parity tests in test_batch_parity.py updated to capture driver state via progress_batch (since write_batch_state is no longer called from the CLI process).

Merge requirements satisfied?

  • Docs — N/A (no user-facing documentation describes the in-progress .batch.p file)
  • Changelog — changelog/69418.fixed.md
  • Tests written/updated

Commits signed with GPG?

Yes

The 3008.x batch refactor introduced a regression where the sync CLI
batch driver writes batch-state persistence files (``.batch.p``,
``batch_active.p``) under the master's ``cachedir`` from the CLI
process.  When the salt CLI is invoked as ``root`` against a master
running as user ``salt`` (the packaging default), the persistence
writes pre-create the JID directory with root ownership, which
trips a ``PermissionError`` in ``local_cache.prep_jid`` when the
master subsequently tries to write the ``jid`` file.  The cascading
failure surfaces to the user as
``SaltClientError: Some exception handling minion payload``.

``BatchManager`` only ever acts on ``driver="master"`` batch state
(``_handle_new``/``_handle_recover`` explicitly ignore
``driver="cli"``), so the sync CLI's persistence calls served no
functional purpose for any consumer of the on-disk batch state.
Remove them: ``write_batch_state``, ``add_to_active_index``, and
``remove_from_active_index`` are no longer called from the sync CLI
driver.  Add a code comment at the deletion site explaining why.

Update two parity tests in ``test_batch_parity.py`` to capture the
driver state via ``progress_batch`` instead of ``write_batch_state``
(same assertions, new capture mechanism).  Add a regression test in
``test_batch.py`` that asserts ``Batch.run()`` calls neither
``write_batch_state`` nor ``add_to_active_index`` /
``remove_from_active_index``.

Fixes saltstack#69418
Issue saltstack#69418's first commit removed the sync CLI driver's direct
writes under the master's ``cachedir`` to fix the
``PermissionError`` that bricked ``salt -b`` whenever the master
ran as a non-root user.  That fix was correct for the regression
but came at the cost of dropping the visibility feature added in
PR saltstack#68964 — ``salt-run batch.list_active`` / ``batch.status`` /
``batch.stop`` could no longer see sync batches.

This commit replaces the broken in-process disk writes with an
event-bus handoff that keeps every cachedir write on the master
daemon's side of the trust boundary:

* ``Batch.run()`` fires ``salt/batch/<jid>/{new,progress,complete,
  halted}`` with the full ``BatchState`` embedded in the payload
  (new helper ``salt.utils.batch_output.state_payload``).  All
  fire / subscribe / poll ops are best-effort — a missing master
  event bus degrades to "no visibility, batch still works."
* ``Batch.run()`` subscribes to ``salt/batch/<jid>/halted`` for
  the run's duration and polls non-blocking each loop iteration
  so a ``salt-run batch.stop`` request actually halts the run.
* ``BatchManager._handle_new`` learns to read the embedded state
  from the event data; for ``driver="cli"`` it persists
  ``.batch.p`` + the active index but does *not* drive the state
  machine (the CLI owns that).
* New ``_handle_progress`` and ``_handle_terminal`` update the
  on-disk state for sync CLI batches as the run progresses and
  clear the active-index entry on completion.
* ``_tick`` and ``_progress_one`` defensively skip
  ``driver="cli"`` JIDs so a stale index entry can never trigger
  a spurious re-publish or false timeout of in-flight minions.

Tests:

* New ``tests/pytests/unit/cli/test_batch_visibility.py`` wires
  ``Batch.run`` to a real ``BatchManager`` via a synchronous
  fire_event bridge and asserts the end-to-end visibility
  contract — ``batch.list_active`` and ``batch.status`` see the
  running sync batch, ``batch.stop`` halts it, the post-run
  active index is empty.
* New unit tests in ``test_batch.py`` cover the event flow,
  halt-subscription lifecycle, halt observation, and best-effort
  failure handling.
* New ``BatchManager`` tests cover CLI registration without
  adoption, ``_tick`` filtering, ``_handle_progress`` /
  ``_handle_terminal`` semantics, and the extended event
  dispatch.

Full batch test suite: 167 passed.

Fixes saltstack#69418
The new sync-batch visibility code (Batch._fire_event /
_subscribe_to_halt / _consume_halt_event) calls fire_event on the
LocalClient's master event handle.  fire_event lazily creates a
SyncWrapper(ipc_publish_server) — and on first publish a nested
SyncWrapper(PublishServerClient) — each owning its own asyncio
event loop.  LocalClient.destroy did clean those up, but only on
exit from Batch.run, racing Python 3.14's interpreter shutdown on
Windows where Tornado's _AddThreadSelectorEventLoop closes the
selector thread via the loop's shutdown_asyncgens path.  When that
close ran late, the wakeup Handles it scheduled onto _ready were
still alive when the loop's own close() ran self._ready.clear(),
GC'ing the Handles' wrapped shutdown_* coroutines unawaited and
spilling "RuntimeWarning: coroutine 'BaseEventLoop.shutdown_*'
was never awaited" onto the CLI's stderr — which the integration
tests (test_batch_retcode / test_multiple_modules_in_batch) gate on.

Fix in two complementary places:

* salt/cli/batch.py — call event.destroy() explicitly inside
  Batch.run's finally block, before LocalClient.destroy, so the
  SyncWrapper cleanup happens deterministically while we still
  control the loop.  SaltEvent.destroy is already idempotent, so
  LocalClient.destroy's follow-on call is a no-op.

* salt/utils/asynchronous.py — after running shutdown_asyncgens
  and shutdown_default_executor, drain up to 8
  asyncio.sleep(0) ticks so the selector-thread close path and
  any other call_soon-scheduled finalizers complete before
  asyncio_loop.close() clears _ready.

Verified on Linux Python 3.10 with `pytest -W error::RuntimeWarning`:
all 9 integration tests and 74 unit tests across the batch suite
pass; the existing FD-leak regression tests still pass.
Root cause of "RuntimeWarning: coroutine 'BaseEventLoop.shutdown_asyncgens'
was never awaited" on Python 3.14 / Windows in the batch CLI tests:

  loop.run_until_complete(loop.shutdown_asyncgens())

evaluates the inner argument FIRST -- creating the
``shutdown_asyncgens`` coroutine object -- and only THEN runs
``run_until_complete`` which calls ``_check_closed()`` /
``_check_running()``.  If either check raises (the loop has already
been closed or is already running for some other reason), the
``RuntimeError`` propagates before ``ensure_future`` can wrap the
coroutine into a Task, the bare coroutine object is orphaned, and
``coroutine.__del__`` emits the warning when the GC reaps it.  On
Python 3.14 / Windows that happens to land while the outer loop's
``_ready.clear()`` is mid-flight, which is what put
``self._ready.clear()`` in the warning's traceback in the failing
``test_batch_retcode`` / ``test_multiple_modules_in_batch`` jobs.

Same trap for ``shutdown_default_executor`` and for the
``asyncio.gather(...)`` over cancelled pending tasks.

Fix:

* New ``SyncWrapper._loop_can_run_until_complete(loop)`` helper —
  ``True`` iff ``loop is not None and not loop.is_closed() and not
  loop.is_running()``.  Anything that can't be driven to completion
  is gated out before the coroutine is ever constructed.

* In ``SyncWrapper.close``, gate every ``run_until_complete`` call
  through that helper.  As a belt-and-braces, on the ``except`` arm
  explicitly ``.close()`` the (now-orphaned) coroutine so even a
  loop-state race between the guard and the call can't leak.

Replaces the previous ``asyncio.sleep(0)`` drain workaround
introduced in commit d3bfd53, which only papered over the leak
by giving the GC more wallclock to run the scheduled-but-unawaited
coroutines before ``close()`` cleared them.

Acceptance bar (Linux Py3.10):

  python -W error::RuntimeWarning -X dev -m pytest \
    tests/pytests/integration/cli/test_batch.py::test_batch_retcode \
    --core-tests -xvs

PASSES.  Full sweep (9 integration + 74 unit batch tests +
``test_fd_leak_asyncgens_executor.py`` +
``test_fd_leak_task_cancellation.py``) under the same flags: 85
passed, 0 failed.
The asyncio teardown leak fixed in 7f30570 slipped past every
unit test because the unit tests mock ``SyncWrapper`` out entirely.
Adding eleven CLI-level integration tests that exercise the full
``salt`` / ``salt-run`` lifecycle on the real event loop and gate on
``-W error::RuntimeWarning -X dev``-clean stderr.

New tests under tests/pytests/integration/cli/test_batch_options.py:

* test_batch_integer_size                   -- ``-b 2``
* test_batch_percentage                     -- ``-b 50%``
* test_batch_wait_between_subbatches        -- ``--batch-wait``
* test_batch_safe_limit_triggers_batching   -- ``--batch-safe-limit``
* test_batch_safe_size_one                  -- ``--batch-safe-size``
* test_batch_failhard_stops_on_first_bad_return  -- ``--failhard``
* test_batch_async_handoff_to_batch_manager -- ``--async`` (driver=master)
* test_batch_state_apply                    -- state.apply in batches
* test_batch_stop_halts_sync_batch          -- salt-run batch.stop
* test_batch_list_active_sees_sync_batch    -- salt-run batch.list_active
* test_batch_status_returns_live_progress   -- salt-run batch.status

Each test:

* runs through salt-factories' ``salt_cli`` / ``salt_run_cli``
  fixtures against a real master + 2 minions (no SyncWrapper
  patching)
* asserts ``cmd.returncode == 0`` for non-failhard tests
* runs an ``_assert_clean_stderr`` gate that rejects
  ``BaseEventLoop.shutdown_asyncgens`` /
  ``shutdown_default_executor`` markers, ``coroutine '...' was
  never awaited``, and any traceback on the CLI's stderr -- the
  exact signature of the regression that broke Windows zeromq 2 CI

Runner-visibility tests (9, 10, 11) spawn the sync batch in a
background thread targeting ``test.sleep 5`` so the runner has
~10s window to observe a mid-run batch; ``test.sleep`` avoids
``state.apply``'s per-minion lock so the tests are safe back-to-back.

The async-handoff test (7) captures the JID from the CLI's
"Executed batch command with job ID:" line, then polls until the
BatchManager retires it from ``batch.list_active`` -- verifying the
master-side event-bus path end-to-end.

Verified locally on Linux Py3.10:

  python -W error::RuntimeWarning -X dev -m pytest \
    tests/pytests/integration/cli/test_batch.py \
    tests/pytests/integration/cli/test_batch_options.py \
    --slow-tests --core-tests

  20 passed in 66.13s

(9 pre-existing batch tests + 11 new options tests.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant