perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness by ananthsub · Pull Request #2827 · NVIDIA-NeMo/RL

ananthsub · 2026-06-15T17:55:24Z

What does this PR do ?

ray.sub previously scales srun/slurmctld RPCs with the node count: it launches one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polls head readiness via srun --overlap test -f, and polls cluster status via repeated srun ... ray status. For larger jobs, this dominates cluster bringup time and risks throttling slurmctld.

This PR replaces the per-node fan-out with shared-filesystem file signaling so srun calls stay constant relative to the node count:

Launch all workers with a single batched srun (--nodes/--ntasks=N-1, --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID / SLURMD_NODENAME. This change drops the per-worker loop, the 3s stagger, and the 120s sleep.
Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is listening, so workers connect on the first try; workers poll that file before ray start --address and retry the GCS connect.
A head-side ray-status sidecar publishes the live worker_units count to a file; the submit host reads it instead of issuing srun ... ray status RPCs.
The head container runs the driver inline (via a driver_command.sh file) and the submit host just waits on the head srun, removing the dedicated driver srun.
Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its local port and touches SANDBOX_READY_<host>; the head waits for all instances before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs.

This change assumes that the LOG_DIR is on a shared filesystem visible across compute nodes.

This change also does the following:

cleans stale Ray session state inside the head and worker retry loops
Adds 0-indexed, zero-padded worker logs for natural sorting
Suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue does not clobber the previous attempt's logs and signal files

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

…iness ray.sub previously scaled srun/slurmctld RPCs with the node count: it launched one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polled head readiness via `srun --overlap test -f`, and polled cluster status via repeated `srun ... ray status`. At scale this dominated cluster bringup time and risked throttling slurmctld. Replace the per-node fan-out with shared-filesystem file signaling so srun calls stay roughly constant in the node count: - Launch all workers with a single batched srun (--nodes/--ntasks=N-1, --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID / SLURMD_NODENAME. Drops the per-worker loop, the 3s stagger, and the 120s sleep. - Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is listening, so workers connect on the first try; workers poll that file before `ray start --address` and retry the GCS connect. - A head-side ray-status sidecar publishes the live worker_units count to a file; the submit host reads it instead of issuing `srun ... ray status` RPCs. - The head container runs the driver inline (via a driver_command.sh file) and the submit host just waits on the head srun, removing the dedicated driver srun. Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its local port and touches SANDBOX_READY_<host>; the head waits for all instances before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs. Also: clean stale Ray session state inside the head and worker retry loops (avoids the persisted-session AssertionError on retry); 0-indexed, zero-padded worker logs (ray-worker-%0Nt.log) for natural sorting; support single-node jobs (no workers needed); and suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue does not clobber the previous attempt's logs and signal files. Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

copy-pr-bot · 2026-06-15T17:55:29Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

ananthsub · 2026-06-15T21:14:52Z

/ok to test 2bcbb17

ananthsub requested review from macandro96 and yfw June 15, 2026 17:55

ananthsub requested a review from a team as a code owner June 15, 2026 17:55

ananthsub added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 15, 2026

copy-pr-bot Bot temporarily deployed to public June 15, 2026 21:15 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 21:16 Inactive

copy-pr-bot Bot temporarily deployed to test June 15, 2026 21:18 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 21:19 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827

perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827
ananthsub wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ananthsub:ansubramania/raysub-srun-optimizations

ananthsub commented Jun 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

ananthsub commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ananthsub commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

ananthsub commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ananthsub commented Jun 15, 2026 •

edited

Loading