Skip to content

perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827

Open
ananthsub wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ananthsub:ansubramania/raysub-srun-optimizations
Open

perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827
ananthsub wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ananthsub:ansubramania/raysub-srun-optimizations

Conversation

@ananthsub

@ananthsub ananthsub commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

ray.sub previously scales srun/slurmctld RPCs with the node count: it launches one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polls head readiness via srun --overlap test -f, and polls cluster status via repeated srun ... ray status. For larger jobs, this dominates cluster bringup time and risks throttling slurmctld.

This PR replaces the per-node fan-out with shared-filesystem file signaling so srun calls stay constant relative to the node count:

  1. Launch all workers with a single batched srun (--nodes/--ntasks=N-1, --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID / SLURMD_NODENAME. This change drops the per-worker loop, the 3s stagger, and the 120s sleep.
  2. Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is listening, so workers connect on the first try; workers poll that file before ray start --address and retry the GCS connect.
  3. A head-side ray-status sidecar publishes the live worker_units count to a file; the submit host reads it instead of issuing srun ... ray status RPCs.
  4. The head container runs the driver inline (via a driver_command.sh file) and the submit host just waits on the head srun, removing the dedicated driver srun.
  5. Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its local port and touches SANDBOX_READY_<host>; the head waits for all instances before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs.

This change assumes that the LOG_DIR is on a shared filesystem visible across compute nodes.

This change also does the following:

  1. cleans stale Ray session state inside the head and worker retry loops
  2. Adds 0-indexed, zero-padded worker logs for natural sorting
  3. Suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue does not clobber the previous attempt's logs and signal files

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

…iness

ray.sub previously scaled srun/slurmctld RPCs with the node count: it launched
one srun per worker (each followed by a 3s sleep), slept a fixed 120s before
workers, polled head readiness via `srun --overlap test -f`, and polled cluster
status via repeated `srun ... ray status`. At scale this dominated cluster
bringup time and risked throttling slurmctld.

Replace the per-node fan-out with shared-filesystem file signaling so srun calls
stay roughly constant in the node count:

- Launch all workers with a single batched srun (--nodes/--ntasks=N-1,
  --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID /
  SLURMD_NODENAME. Drops the per-worker loop, the 3s stagger, and the 120s sleep.
- Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is
  listening, so workers connect on the first try; workers poll that file before
  `ray start --address` and retry the GCS connect.
- A head-side ray-status sidecar publishes the live worker_units count to a file;
  the submit host reads it instead of issuing `srun ... ray status` RPCs.
- The head container runs the driver inline (via a driver_command.sh file) and the
  submit host just waits on the head srun, removing the dedicated driver srun.

Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its
local port and touches SANDBOX_READY_<host>; the head waits for all instances
before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted
and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs.

Also: clean stale Ray session state inside the head and worker retry loops
(avoids the persisted-session AssertionError on retry); 0-indexed, zero-padded
worker logs (ray-worker-%0Nt.log) for natural sorting; support single-node jobs
(no workers needed); and suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue
does not clobber the previous attempt's logs and signal files.

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
@ananthsub ananthsub requested review from macandro96 and yfw June 15, 2026 17:55
@ananthsub ananthsub requested a review from a team as a code owner June 15, 2026 17:55
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ananthsub ananthsub added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 15, 2026
@ananthsub

Copy link
Copy Markdown
Contributor Author

/ok to test 2bcbb17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant