perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827
Open
ananthsub wants to merge 1 commit into
Open
perf: reduce srun overhead in ray.sub and gate driver on sandbox readiness#2827ananthsub wants to merge 1 commit into
ananthsub wants to merge 1 commit into
Conversation
…iness ray.sub previously scaled srun/slurmctld RPCs with the node count: it launched one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polled head readiness via `srun --overlap test -f`, and polled cluster status via repeated `srun ... ray status`. At scale this dominated cluster bringup time and risked throttling slurmctld. Replace the per-node fan-out with shared-filesystem file signaling so srun calls stay roughly constant in the node count: - Launch all workers with a single batched srun (--nodes/--ntasks=N-1, --ntasks-per-node=1, --exclude=head); workers self-identify via SLURM_PROCID / SLURMD_NODENAME. Drops the per-worker loop, the 3s stagger, and the 120s sleep. - Start the head without --block and touch STARTED_RAY_HEAD only after the GCS is listening, so workers connect on the first try; workers poll that file before `ray start --address` and retry the GCS connect. - A head-side ray-status sidecar publishes the live worker_units count to a file; the submit host reads it instead of issuing `srun ... ray status` RPCs. - The head container runs the driver inline (via a driver_command.sh file) and the submit host just waits on the head srun, removing the dedicated driver srun. Sandbox readiness becomes a blocking gate: each per-node sandbox task polls its local port and touches SANDBOX_READY_<host>; the head waits for all instances before launching the driver. The SANDBOX_PORTS_DIR signal dir is always mounted and exported, independent of the SANDBOX_EXTRA_MOUNTS / SANDBOX_ENV_VARS knobs. Also: clean stale Ray session state inside the head and worker retry loops (avoids the persisted-session AssertionError on retry); 0-indexed, zero-padded worker logs (ray-worker-%0Nt.log) for natural sorting; support single-node jobs (no workers needed); and suffix LOG_DIR with SLURM_RESTART_COUNT so a requeue does not clobber the previous attempt's logs and signal files. Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test 2bcbb17 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
ray.subpreviously scales srun/slurmctld RPCs with the node count: it launches one srun per worker (each followed by a 3s sleep), slept a fixed 120s before workers, polls head readiness viasrun --overlap test -f, and polls cluster status via repeatedsrun ... ray status. For larger jobs, this dominates cluster bringup time and risks throttling slurmctld.This PR replaces the per-node fan-out with shared-filesystem file signaling so srun calls stay constant relative to the node count:
ray start --addressand retry the GCS connect.srun ... ray statusRPCs.SANDBOX_READY_<host>; the head waits for all instances before launching the driver. TheSANDBOX_PORTS_DIRsignal dir is always mounted and exported, independent of theSANDBOX_EXTRA_MOUNTS/SANDBOX_ENV_VARSknobs.This change assumes that the LOG_DIR is on a shared filesystem visible across compute nodes.
This change also does the following:
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information