fix(examples): enable activation checkpointing for phi_4_squad by akoumpa · Pull Request #2634 · NVIDIA-NeMo/Automodel

akoumpa · 2026-06-18T17:09:22Z

What

Fixes the recurring OOM in the phi_4_squad sft_ckpt_robustness functional test.

Two commits:

fix(examples) — the actual fix: enable activation_checkpointing: true for phi_4_squad.
test(checkpoint) — memory hygiene in the robustness test (reclaim each phase's model + optimizer before the next builds its own). Useful on its own, but not what fixes this OOM.

Root cause (it was not a leak)

Profiled per-step CUDA memory on 8×H100. The OOM is not cross-phase accumulation, not FSDP, not validation, not the optimizer/_supports:

step=0 TRAIN alloc=10.33 peak=68.03 GiB
step=1 TRAIN alloc=10.33 peak=73.94 GiB
...
step=4 VAL   alloc=12.88 peak=15.44 GiB
→ step 5 OOM

alloc is flat at ~10 GiB every step → no leak / no accumulation.
Every training step peaks at ~70–74 GiB — phi-4 (14B) keeps ~60 GiB of activations per fwd/bwd, running right at the 80 GiB edge. SQuAD's variable-length batches spike some steps past 80 GiB.
The resume phase trains max_steps + 3 (8) steps vs the base 5, so it gets more rolls of the dice and reliably hits a spiking batch → OOM. (Base phase survives by a hair.)

Enabling activation checkpointing bounds the per-step peak so it fits.

Verification

Reproduced and fixed end-to-end on EOS (8×H100), running only the phi_4_squad checkpoint-robustness test in the CI container:

Without activation_checkpointing: OOM at step 5 (75 GiB allocated), matching CI.
With activation_checkpointing: true: passes — all phases incl. resume, peak ~70 GiB, 1 passed in 624s.

Failing CI before the fix (reference): https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/343744188

copy-pr-bot · 2026-06-18T17:09:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The checkpoint-robustness test builds a full FSDP2 model + optimizer in every phase. A bare `del <trainer>` did not reclaim them between phases: the per-part optimizers are reachable from the model (built over model.parts), so the Adam state lingered, and ModelSupports pinned the model strongly via model._supports. - Add `_release_recipe_memory()`: clear optimizer state in place (the Adam moments are the bulk) + drop the recipe's model/optimizer/scheduler refs + gc, so each phase's state is reclaimed before the next phase allocates its own. - Hold the model weakly in `ModelSupports` so the capability descriptor can never be the reason a multi-GiB model stays resident after its owner is gone. Memory hygiene only (verified: model + optimizer reclaim to ~0 between phases). This is NOT the OOM fix — see the activation-checkpointing change. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

phi-4 (14B) stores ~60 GiB of activations per training step; without recomputation a single forward/backward peaks at ~74 GiB on an 80 GiB H100 and OOMs on long SQuAD batches. This surfaced in the sft_ckpt_robustness suite, whose resume phase trains max_steps+3 steps and reliably hits the spike (steady alloc is a flat ~10 GiB, so it is a per-step activation peak, not a leak or cross-phase accumulation). Enabling activation_checkpointing bounds the per-step peak; verified end-to-end on 8xH100 (EOS): the phi_4_squad checkpoint-robustness test passes (peak ~70 GiB, no OOM, all phases incl. resume). Ref (failing CI before the fix): https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/343744188 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-06-19T02:15:16Z

/ok to test a42e855

akoumpa · 2026-06-19T02:23:49Z

Verified on EOS (8×H100), running just the phi_4_squad checkpoint-robustness test in the CI container (automodel:pipe.55165794):

without activation_checkpointing: OOMs at step 5 of the resume phase (~75 GiB allocated), reproducing the CI failure (ref job 343744188).
with activation_checkpointing: true (this PR): passes — 1 passed in 624s, per-step peak ~70 GiB, all phases incl. resume.

Per-step profiling: steady alloc flat at ~10 GiB, peak ~70–74 GiB — i.e. the OOM is a per-step activation peak, not a leak or cross-phase accumulation.

ModelSupports now holds the model via weakref.ref (this PR), so the magi capability helper that built a throwaway `_BackendModel(attn)` let CPython collect it before `.supports_*` was read -> ReferenceError. Bind the model in each caller so it outlives the capability check; this mirrors production, where the model owns its `_supports` and is always live at `model.supports.X` access. Fixes the 6 test_capabilities_magi.py failures in L0_Unit_Tests_CPU (GHA run 27801277153, job 82274392657). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-06-19T03:18:07Z

/ok to test 7eae20f

* test(checkpoint): reclaim model+optimizer between robustness phases The checkpoint-robustness test builds a full FSDP2 model + optimizer in every phase. A bare `del <trainer>` did not reclaim them between phases: the per-part optimizers are reachable from the model (built over model.parts), so the Adam state lingered, and ModelSupports pinned the model strongly via model._supports. - Add `_release_recipe_memory()`: clear optimizer state in place (the Adam moments are the bulk) + drop the recipe's model/optimizer/scheduler refs + gc, so each phase's state is reclaimed before the next phase allocates its own. - Hold the model weakly in `ModelSupports` so the capability descriptor can never be the reason a multi-GiB model stays resident after its owner is gone. Memory hygiene only (verified: model + optimizer reclaim to ~0 between phases). This is NOT the OOM fix — see the activation-checkpointing change. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix(examples): enable activation checkpointing for phi_4_squad phi-4 (14B) stores ~60 GiB of activations per training step; without recomputation a single forward/backward peaks at ~74 GiB on an 80 GiB H100 and OOMs on long SQuAD batches. This surfaced in the sft_ckpt_robustness suite, whose resume phase trains max_steps+3 steps and reliably hits the spike (steady alloc is a flat ~10 GiB, so it is a per-step activation peak, not a leak or cross-phase accumulation). Enabling activation_checkpointing bounds the per-step peak; verified end-to-end on 8xH100 (EOS): the phi_4_squad checkpoint-robustness test passes (peak ~70 GiB, no OOM, all phases incl. resume). Ref (failing CI before the fix): https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/343744188 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> (cherry picked from commit c75d9cf) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

…2661) fix(examples): enable activation checkpointing for phi_4_squad (#2634) * test(checkpoint): reclaim model+optimizer between robustness phases The checkpoint-robustness test builds a full FSDP2 model + optimizer in every phase. A bare `del <trainer>` did not reclaim them between phases: the per-part optimizers are reachable from the model (built over model.parts), so the Adam state lingered, and ModelSupports pinned the model strongly via model._supports. - Add `_release_recipe_memory()`: clear optimizer state in place (the Adam moments are the bulk) + drop the recipe's model/optimizer/scheduler refs + gc, so each phase's state is reclaimed before the next phase allocates its own. - Hold the model weakly in `ModelSupports` so the capability descriptor can never be the reason a multi-GiB model stays resident after its owner is gone. Memory hygiene only (verified: model + optimizer reclaim to ~0 between phases). This is NOT the OOM fix — see the activation-checkpointing change. * fix(examples): enable activation checkpointing for phi_4_squad phi-4 (14B) stores ~60 GiB of activations per training step; without recomputation a single forward/backward peaks at ~74 GiB on an 80 GiB H100 and OOMs on long SQuAD batches. This surfaced in the sft_ckpt_robustness suite, whose resume phase trains max_steps+3 steps and reliably hits the spike (steady alloc is a flat ~10 GiB, so it is a per-step activation peak, not a leak or cross-phase accumulation). Enabling activation_checkpointing bounds the per-step peak; verified end-to-end on 8xH100 (EOS): the phi_4_squad checkpoint-robustness test passes (peak ~70 GiB, no OOM, all phases incl. resume). Ref (failing CI before the fix): https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/343744188 --------- (cherry picked from commit c75d9cf) Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa marked this pull request as draft June 18, 2026 17:16

akoumpa added r0.3.0 Add for cherry-pick into release branch r0.3.0 r0.5.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. and removed r0.3.0 Add for cherry-pick into release branch r0.3.0 labels Jun 18, 2026

akoumpa added 2 commits June 18, 2026 19:12

akoumpa force-pushed the akoumpa/fix/ckpt-robustness-oom branch from d9e7937 to a42e855 Compare June 19, 2026 02:12

akoumpa changed the title ~~test(checkpoint): free recipe memory between robustness phases to avoid OOM~~ fix(examples): enable activation checkpointing for phi_4_squad Jun 19, 2026

akoumpa marked this pull request as ready for review June 19, 2026 02:15

akoumpa requested a review from a team as a code owner June 19, 2026 02:15

copy-pr-bot Bot temporarily deployed to test June 19, 2026 02:15 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 19, 2026 02:15 Inactive

copy-pr-bot Bot temporarily deployed to public June 19, 2026 02:15 Inactive

HuiyingLi previously approved these changes Jun 19, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 19, 2026 02:17 Inactive

copy-pr-bot Bot temporarily deployed to public June 19, 2026 02:18 Inactive

akoumpa enabled auto-merge (squash) June 19, 2026 02:18

copy-pr-bot Bot temporarily deployed to nemo-ci June 19, 2026 02:20 Inactive

akoumpa dismissed HuiyingLi’s stale review via 7eae20f June 19, 2026 03:04

copy-pr-bot Bot temporarily deployed to test June 19, 2026 03:18 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 19, 2026 03:18 Inactive

copy-pr-bot Bot temporarily deployed to public June 19, 2026 03:18 Inactive

copy-pr-bot Bot temporarily deployed to public June 19, 2026 03:21 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 19, 2026 03:23 Inactive

HuiyingLi approved these changes Jun 19, 2026

View reviewed changes

akoumpa merged commit c75d9cf into main Jun 19, 2026
80 checks passed

akoumpa deleted the akoumpa/fix/ckpt-robustness-oom branch June 19, 2026 05:33

akoumpa mentioned this pull request Jun 20, 2026

cp: fix(examples): enable ac for phi_4_squad (2634) into r0.5.0 #2661

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(examples): enable activation checkpointing for phi_4_squad#2634

fix(examples): enable activation checkpointing for phi_4_squad#2634
akoumpa merged 3 commits into
mainfrom
akoumpa/fix/ckpt-robustness-oom

akoumpa commented Jun 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

akoumpa commented Jun 19, 2026

Uh oh!

akoumpa commented Jun 19, 2026

Uh oh!

akoumpa commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akoumpa commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Root cause (it was not a leak)

Verification

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

akoumpa commented Jun 19, 2026

Uh oh!

akoumpa commented Jun 19, 2026

Uh oh!

akoumpa commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akoumpa commented Jun 18, 2026 •

edited

Loading