Skip to content

feat(ppo): Megatron value-model sequence packing + context parallelism#2839

Open
bg51717 wants to merge 6 commits into
NVIDIA-NeMo:mainfrom
bg51717:ppo-mcore-value-seqpack-cp
Open

feat(ppo): Megatron value-model sequence packing + context parallelism#2839
bg51717 wants to merge 6 commits into
NVIDIA-NeMo:mainfrom
bg51717:ppo-mcore-value-seqpack-cp

Conversation

@bg51717

@bg51717 bg51717 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Adds sequence packing and context parallelism (CP) support to the Megatron PPO value model, building on the in-model value head from PR #2825. Also fixes a value loss-scaling bug (the per-microbatch value loss was averaged instead of summed).

Issues

Closes #2687 (PR 2 of 2). Stacked on #2825 — the base branch is ppo-mcore-value-head until #2825 merges, after which this retargets to main.

Usage

Context parallelism for the value model requires sequence packing (matching Megatron-Core). Both are configured like the policy:

value:
  sequence_packing:
    enabled: true                    # now supported
  megatron_cfg:
    context_parallel_size: 2         # now supported (requires sequence_packing)
  make_sequence_length_divisible_by: 4   # must be a multiple of 2*CP

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Test

Tests run (all pass):

  • Unit (tests/unit/models/value/test_megatron_value_worker.py, mcore, 2 GPU): packed-unpack (variable-length, CPU), packed train step (incl. PP2), and get_values equivalence across SP / dynamic batching / PP / sequence packing / CP (CP=1 vs CP=2, bf16), plus DP/TP regression. 19 passed.
  • Functional (tests/functional/ppo_megatron.sh): passes.
  • Convergence (Qwen2.5-1.5B, GSM8K, 1n8g, 40 steps), train reward / val accuracy — the two nightly value-parallelism recipes:

valuetp2sp-dynbatch (TP2 + SP + dynamic batching; re-validates the loss-scaling fix does not regress convergence): 0.85 / 0.76
截屏2026-06-19 20 12 29

valuetp2sp-pp2cp2-pack (TP2 + SP + PP2 + CP2 + sequence packing): 0.84 / 0.72
截屏2026-06-19 20 13 46

@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@bg51717 bg51717 force-pushed the ppo-mcore-value-seqpack-cp branch from dde8f94 to b75f7d3 Compare June 19, 2026 09:01
Signed-off-by: bg51717 <biguo@nvidia.com>
@bg51717

bg51717 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 6cfe536

@bg51717 bg51717 added the CI:L1 Run doctests, unit tests, and functional tests label Jun 19, 2026
@bg51717

bg51717 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 6cfe536

@bg51717 bg51717 marked this pull request as ready for review June 19, 2026 15:32
@bg51717 bg51717 requested review from a team as code owners June 19, 2026 15:32
@bg51717 bg51717 requested a review from yuki-97 June 19, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sequence Parallelism, Sequence Packing, Dynamic Batching, and Context Parallelism support in PPO Mcore value model

1 participant