Skip to content

feat: single controller (w/o sync_weight)#2819

Draft
yuki-97 wants to merge 22 commits into
mainfrom
yukih/sc-entrypoint
Draft

feat: single controller (w/o sync_weight)#2819
yuki-97 wants to merge 22 commits into
mainfrom
yukih/sc-entrypoint

Conversation

@yuki-97

@yuki-97 yuki-97 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the SC (Single Controller) entrypoint for async GRPO on the TransferQueue data plane.

  • SingleControllerActor (nemo_rl/algorithms/single_controller.py): CPU-only Ray actor with three asyncio pumps.
    • _rollout_pump: iterates the dataloader → RolloutManager.generate_and_pushTQReplayBuffer.add writes N training rows per group. Gated by _buffer_capacity / max_inflight_prompts / _rollout_permitted.
    • _train_pump: sampler.evictsampler.select_advantage_pumpTQPolicy split API (begin/train_microbatch/finish_train_step) → dp_client.clear_samples.
    • _sync_weights: drain gate + (stubbed) WeightSynchronizer.sync_weights.
  • single_controller_utils/:
    • SC MasterConfig + AsyncRLConfig;
    • One driver-side setup() builds the full SingleControllerBundle via inlined _build_clusters / _build_generation / _build_trainer.
  • async_utils/: TQReplayBuffer (group-granular, owns producer-side tensorization) + StalenessSampler (filter-only).
  • Launcher + exemplar: examples/run_grpo_single_controller.py, examples/configs/grpo_math_1B_single_controller.yaml.
  • Megatron worker fix: begin_train_step also nulls config.no_sync_func so the outer model.no_sync() covers the last MB — otherwise register_grad_ready leaks counts past the explicit sync and asserts on step 2.

Known limits (TODOs in code)

  • max_train_steps not bounded by max_num_epochs * len(dataloader).
  • compute_prev_logprobs / compute_reference_logprobs gating is provisional.
  • sampler.select only enforces min_prompt_groups_per_batch; no max_* yet.
  • _sync_weights stubbed.
  • Validation dataset wiring not in.
  • Log stuffs.
  • Integrate nemogym setup.

Current state

  • Runnable end-to-end via the launcher; functional test (grpo_dp_single_controller.sh, Qwen3-0.6B / 2 GPU / 2 steps).
  • Accuracy not validatedsync_weights not wired, generator never refreshes.
  • Perf not measured — no profiling vs grpo_train_sync yet.

@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mehraakash added a commit to mehraakash/RL that referenced this pull request Jun 16, 2026
Two bugs around the split-API state machine fixed in one pass:

1. **no_sync_func leak (latent assertion on step 2 with PP=1).**
   `forward_backward_no_pipelining` (the PP=1 path, which is the common
   one) wraps inner microbatches in `model.config.no_sync_func` but runs
   the *last* microbatch OUTSIDE of it. Our outer `with self.model.no_sync():`
   in `train_microbatch` was therefore bypassed for the trailing MB.
   `register_grad_ready` leaked per-param counts past the explicit DP
   sync at `finish_train_step`, and the next `begin_train_step` then
   asserted on the stale counts (typically step 2).

   Fix: in `begin_train_step` also save and null `no_sync_func` (set to
   `contextlib.nullcontext`). Restore both hooks at finish/abort via the
   new `_restore_saved_grad_sync_func` helper.

   Spotted in yuki-97's PR NVIDIA-NeMo#2819 (`begin_train_step` also nulls
   `config.no_sync_func`); same fix landed here.

2. **No exception safety around the open step (terrykong, NVIDIA-NeMo#2683:640).**
   If `train_microbatch` or `finish_train_step` raised mid-body, both
   `grad_sync_func` and (now also) `no_sync_func` would stay nulled and
   future steps would run with the PP scheduler bypass disabled silently.

   Fix: extract `_train_microbatch_body` and `_finish_train_step_body`,
   wrap each entry method in try/except that calls
   `_restore_saved_grad_sync_func` before re-raising. Caller is still
   expected to invoke `abort_train_step` (idempotent on the saved
   values) to drop `_train_step_state`.

3. **Cleanup (terrykong, NVIDIA-NeMo#2683:582).** Drop dead `no_sync_active` field
   from `_split_step_state_init` — never read or written; `no_sync` is
   applied via the `with self.model.no_sync():` context manager.

Also: add module-level `log = logging.getLogger(__name__)` so the
try/except handlers can `log.exception(...)` if the restore itself fails.

Signed-off-by: Akash Mehra <akamehra@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/sc-entrypoint branch 3 times, most recently from c1fc94f to 3a55904 Compare June 19, 2026 07:18
yuki-97 and others added 7 commits June 19, 2026 00:54
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Akash Mehra <akamehra@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>

update functional test

Signed-off-by: Yuki Huang <yukih@nvidia.com>
… reserve/commit slots

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update unit test

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/sc-entrypoint branch from 3a55904 to 22b9e48 Compare June 19, 2026 07:58
yuki-97 added 7 commits June 20, 2026 23:48
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/sc-entrypoint branch from d2b5182 to 98daa14 Compare June 21, 2026 13:30
yuki-97 added 2 commits June 21, 2026 06:46
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/sc-entrypoint branch from 98daa14 to 7976488 Compare June 21, 2026 13:47
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/sc-entrypoint branch 3 times, most recently from a049782 to 36c7e50 Compare June 21, 2026 15:00
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/sc-entrypoint branch from 36c7e50 to 44fc9a4 Compare June 21, 2026 15:25
yuki-97 added 4 commits June 22, 2026 02:10
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant