Skip to content

fix(examples): enable activation checkpointing for phi_4_squad#2634

Merged
akoumpa merged 3 commits into
mainfrom
akoumpa/fix/ckpt-robustness-oom
Jun 19, 2026
Merged

fix(examples): enable activation checkpointing for phi_4_squad#2634
akoumpa merged 3 commits into
mainfrom
akoumpa/fix/ckpt-robustness-oom

Conversation

@akoumpa

@akoumpa akoumpa commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What

Fixes the recurring OOM in the phi_4_squad sft_ckpt_robustness functional test.

Two commits:

  1. fix(examples) — the actual fix: enable activation_checkpointing: true for phi_4_squad.
  2. test(checkpoint) — memory hygiene in the robustness test (reclaim each phase's model + optimizer before the next builds its own). Useful on its own, but not what fixes this OOM.

Root cause (it was not a leak)

Profiled per-step CUDA memory on 8×H100. The OOM is not cross-phase accumulation, not FSDP, not validation, not the optimizer/_supports:

step=0 TRAIN alloc=10.33 peak=68.03 GiB
step=1 TRAIN alloc=10.33 peak=73.94 GiB
...
step=4 VAL   alloc=12.88 peak=15.44 GiB
→ step 5 OOM
  • alloc is flat at ~10 GiB every step → no leak / no accumulation.
  • Every training step peaks at ~70–74 GiB — phi-4 (14B) keeps ~60 GiB of activations per fwd/bwd, running right at the 80 GiB edge. SQuAD's variable-length batches spike some steps past 80 GiB.
  • The resume phase trains max_steps + 3 (8) steps vs the base 5, so it gets more rolls of the dice and reliably hits a spiking batch → OOM. (Base phase survives by a hair.)

Enabling activation checkpointing bounds the per-step peak so it fits.

Verification

Reproduced and fixed end-to-end on EOS (8×H100), running only the phi_4_squad checkpoint-robustness test in the CI container:

  • Without activation_checkpointing: OOM at step 5 (75 GiB allocated), matching CI.
  • With activation_checkpointing: true: passes — all phases incl. resume, peak ~70 GiB, 1 passed in 624s.

Failing CI before the fix (reference): https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/343744188

@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa marked this pull request as draft June 18, 2026 17:16
@akoumpa akoumpa added r0.3.0 Add for cherry-pick into release branch r0.3.0 r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. and removed r0.3.0 Add for cherry-pick into release branch r0.3.0 labels Jun 18, 2026
akoumpa added 2 commits June 18, 2026 19:12
The checkpoint-robustness test builds a full FSDP2 model + optimizer in every
phase. A bare `del <trainer>` did not reclaim them between phases: the per-part
optimizers are reachable from the model (built over model.parts), so the Adam
state lingered, and ModelSupports pinned the model strongly via model._supports.

- Add `_release_recipe_memory()`: clear optimizer state in place (the Adam
  moments are the bulk) + drop the recipe's model/optimizer/scheduler refs +
  gc, so each phase's state is reclaimed before the next phase allocates its own.
- Hold the model weakly in `ModelSupports` so the capability descriptor can
  never be the reason a multi-GiB model stays resident after its owner is gone.

Memory hygiene only (verified: model + optimizer reclaim to ~0 between phases).
This is NOT the OOM fix — see the activation-checkpointing change.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
phi-4 (14B) stores ~60 GiB of activations per training step; without
recomputation a single forward/backward peaks at ~74 GiB on an 80 GiB H100 and
OOMs on long SQuAD batches. This surfaced in the sft_ckpt_robustness suite,
whose resume phase trains max_steps+3 steps and reliably hits the spike
(steady alloc is a flat ~10 GiB, so it is a per-step activation peak, not a
leak or cross-phase accumulation).

Enabling activation_checkpointing bounds the per-step peak; verified end-to-end
on 8xH100 (EOS): the phi_4_squad checkpoint-robustness test passes
(peak ~70 GiB, no OOM, all phases incl. resume).

Ref (failing CI before the fix):
https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/343744188

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumpa/fix/ckpt-robustness-oom branch from d9e7937 to a42e855 Compare June 19, 2026 02:12
@akoumpa akoumpa changed the title test(checkpoint): free recipe memory between robustness phases to avoid OOM fix(examples): enable activation checkpointing for phi_4_squad Jun 19, 2026
@akoumpa

akoumpa commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test a42e855

@akoumpa akoumpa marked this pull request as ready for review June 19, 2026 02:15
@akoumpa akoumpa requested a review from a team as a code owner June 19, 2026 02:15
HuiyingLi
HuiyingLi previously approved these changes Jun 19, 2026
@akoumpa

akoumpa commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Verified on EOS (8×H100), running just the phi_4_squad checkpoint-robustness test in the CI container (automodel:pipe.55165794):

  • without activation_checkpointing: OOMs at step 5 of the resume phase (~75 GiB allocated), reproducing the CI failure (ref job 343744188).
  • with activation_checkpointing: true (this PR): passes1 passed in 624s, per-step peak ~70 GiB, all phases incl. resume.

Per-step profiling: steady alloc flat at ~10 GiB, peak ~70–74 GiB — i.e. the OOM is a per-step activation peak, not a leak or cross-phase accumulation.

ModelSupports now holds the model via weakref.ref (this PR), so the magi
capability helper that built a throwaway `_BackendModel(attn)` let CPython
collect it before `.supports_*` was read -> ReferenceError. Bind the model in
each caller so it outlives the capability check; this mirrors production, where
the model owns its `_supports` and is always live at `model.supports.X` access.

Fixes the 6 test_capabilities_magi.py failures in L0_Unit_Tests_CPU
(GHA run 27801277153, job 82274392657).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa

akoumpa commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 7eae20f

@akoumpa akoumpa merged commit c75d9cf into main Jun 19, 2026
80 checks passed
@akoumpa akoumpa deleted the akoumpa/fix/ckpt-robustness-oom branch June 19, 2026 05:33
akoumpa added a commit that referenced this pull request Jun 20, 2026
* test(checkpoint): reclaim model+optimizer between robustness phases

The checkpoint-robustness test builds a full FSDP2 model + optimizer in every
phase. A bare `del <trainer>` did not reclaim them between phases: the per-part
optimizers are reachable from the model (built over model.parts), so the Adam
state lingered, and ModelSupports pinned the model strongly via model._supports.

- Add `_release_recipe_memory()`: clear optimizer state in place (the Adam
  moments are the bulk) + drop the recipe's model/optimizer/scheduler refs +
  gc, so each phase's state is reclaimed before the next phase allocates its own.
- Hold the model weakly in `ModelSupports` so the capability descriptor can
  never be the reason a multi-GiB model stays resident after its owner is gone.

Memory hygiene only (verified: model + optimizer reclaim to ~0 between phases).
This is NOT the OOM fix — see the activation-checkpointing change.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix(examples): enable activation checkpointing for phi_4_squad

phi-4 (14B) stores ~60 GiB of activations per training step; without
recomputation a single forward/backward peaks at ~74 GiB on an 80 GiB H100 and
OOMs on long SQuAD batches. This surfaced in the sft_ckpt_robustness suite,
whose resume phase trains max_steps+3 steps and reliably hits the spike
(steady alloc is a flat ~10 GiB, so it is a per-step activation peak, not a
leak or cross-phase accumulation).

Enabling activation_checkpointing bounds the per-step peak; verified end-to-end
on 8xH100 (EOS): the phi_4_squad checkpoint-robustness test passes
(peak ~70 GiB, no OOM, all phases incl. resume).

Ref (failing CI before the fix):
https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/343744188

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
(cherry picked from commit c75d9cf)
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
HuiyingLi pushed a commit that referenced this pull request Jun 20, 2026
…2661)

fix(examples): enable activation checkpointing for phi_4_squad (#2634)

* test(checkpoint): reclaim model+optimizer between robustness phases

The checkpoint-robustness test builds a full FSDP2 model + optimizer in every
phase. A bare `del <trainer>` did not reclaim them between phases: the per-part
optimizers are reachable from the model (built over model.parts), so the Adam
state lingered, and ModelSupports pinned the model strongly via model._supports.

- Add `_release_recipe_memory()`: clear optimizer state in place (the Adam
  moments are the bulk) + drop the recipe's model/optimizer/scheduler refs +
  gc, so each phase's state is reclaimed before the next phase allocates its own.
- Hold the model weakly in `ModelSupports` so the capability descriptor can
  never be the reason a multi-GiB model stays resident after its owner is gone.

Memory hygiene only (verified: model + optimizer reclaim to ~0 between phases).
This is NOT the OOM fix — see the activation-checkpointing change.



* fix(examples): enable activation checkpointing for phi_4_squad

phi-4 (14B) stores ~60 GiB of activations per training step; without
recomputation a single forward/backward peaks at ~74 GiB on an 80 GiB H100 and
OOMs on long SQuAD batches. This surfaced in the sft_ckpt_robustness suite,
whose resume phase trains max_steps+3 steps and reliably hits the spike
(steady alloc is a flat ~10 GiB, so it is a per-step activation peak, not a
leak or cross-phase accumulation).

Enabling activation_checkpointing bounds the per-step peak; verified end-to-end
on 8xH100 (EOS): the phi_4_squad checkpoint-robustness test passes
(peak ~70 GiB, no OOM, all phases incl. resume).

Ref (failing CI before the fix):
https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/343744188



---------


(cherry picked from commit c75d9cf)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants