fix(train): allow zero-step training with bias adjustment#5477
fix(train): allow zero-step training with bias adjustment#5477njzjz-bot wants to merge 7 commits into
Conversation
Skip post-training bias adjustment when no training step has run, so zero-step jobs can keep the existing initial-checkpoint behavior without evaluating step -1 learning rates. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
|
Caution Review failedAn error occurred during the review process. Please try again later. 📝 WalkthroughWalkthroughThis PR adds a step-count guard to the PyTorch trainer's post-training bias-change block, reformats the corresponding Paddle trainer conditional, and adds tests for zero-step training that verify initial checkpoint creation and metadata. ChangesTrainer Zero-Step Guard and Test Coverage
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/tests/pd/test_training.py`:
- Around line 167-181: This test method
test_zero_step_with_change_bias_saves_initial_checkpoint runs training and needs
a 60s timeout to prevent CI hangs; add a pytest timeout decorator to the method
(e.g. `@pytest.mark.timeout`(60)) and ensure pytest is imported in the test module
so the decorator is available; locate the method by name in the test class in
test_training.py and place the decorator immediately above the def to enforce
the <=60s limit.
- Line 176: The assertion compares a Path object to a raw string
(Path("model.ckpt-0.pd") vs Path("checkpoint").read_text()), causing spurious
failures; change the test to compare Path to Path by wrapping the read text as a
Path and stripping whitespace/newline: replace the RHS with
Path(Path("checkpoint").read_text().strip()) so the assertion becomes
self.assertEqual(Path("model.ckpt-0.pd"),
Path(Path("checkpoint").read_text().strip())). This ensures both sides are Path
objects and ignores trailing newlines.
In `@source/tests/pt/test_training.py`:
- Around line 266-282: Add the 60s timeout decorator to the test function by
annotating test_zero_step_with_change_bias_saves_initial_checkpoint with
`@TRAINING_TEST_TIMEOUT` (place the decorator immediately above the def). If
TRAINING_TEST_TIMEOUT is not in scope in that module, import it where other test
helpers are imported so the symbol is available before use; keep the rest of the
test unchanged.
- Line 275: The assertion mixes a Path object and a raw string; change the
comparison so both sides use the same type and strip any newline: replace the
RHS Path("checkpoint").read_text() with
Path(Path("checkpoint").read_text().strip()) (or alternatively compare
str(Path("model.ckpt-0.pt")) to Path("checkpoint").read_text().strip()) so the
call in self.assertEqual compares two strings or two Path objects consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c57bc0f6-4dcf-4067-87e2-99022da10b56
📒 Files selected for processing (4)
deepmd/pd/train/training.pydeepmd/pt/train/training.pysource/tests/pd/test_training.pysource/tests/pt/test_training.py
There was a problem hiding this comment.
Pull request overview
This PR fixes zero-step training when change_bias_after_training is enabled for PyTorch and Paddle, ensuring the initial checkpoint path remains valid without running post-training bias adjustment.
Changes:
- Adds a
num_steps > start_stepguard before bias adjustment in PT/PD trainers. - Adds regression tests for zero-step training with bias adjustment enabled.
- Verifies saved checkpoint metadata reports
step=0andlr=0.0.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
deepmd/pt/train/training.py |
Skips PT bias adjustment when no training step ran. |
deepmd/pd/train/training.py |
Skips Paddle bias adjustment when no training step ran. |
source/tests/pt/test_training.py |
Adds PT regression coverage for zero-step checkpoint save. |
source/tests/pd/test_training.py |
Adds Paddle regression coverage for zero-step checkpoint save. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5477 +/- ##
==========================================
- Coverage 82.25% 82.16% -0.09%
==========================================
Files 833 896 +63
Lines 89100 102586 +13486
Branches 4225 4339 +114
==========================================
+ Hits 73290 84291 +11001
- Misses 14518 16958 +2440
- Partials 1292 1337 +45 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Compare checkpoint pointers as paths and add timeout guards to zero-step training regression tests. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
|
Thanks, fixed in 631039c:
Validation:
Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5) |
This reverts commit 631039c.
Compare checkpoint pointers as paths without adding timeout guards, since the regression covers the zero-step no-op path. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
|
Update: I kept the checkpoint pointer assertion fix but intentionally removed the added timeout guards in d27334c. This regression covers Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5) |
Build the expected zero-step checkpoint path from trainer.save_ckpt so the regression follows each test fixture's configured checkpoint prefix. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
|
Fixed the failing tests in 3d7168f. The fixtures configure Validation:
Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5) |
Paddle's load helper treats pathlib.Path as a buffer on the tested version, so pass the checkpoint path as a string in the zero-step regression test.\n\nAuthored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
| self.assertEqual(0, train_infos["step"]) | ||
| self.assertEqual(0.0, train_infos["lr"]) |
There was a problem hiding this comment.
This regression test passes on the unfixed code, so it does not guard the bug it targets. With numb_steps=0, the pre-existing if self.num_steps == 0: block re-saves <ckpt>-0.pt with step=0, lr=0 and rewrites the checkpoint pointer after the bias block runs. So every assertion here — file exists, latest_model == -0, pointer content, train_infos["step"]==0, train_infos["lr"]==0.0 — holds whether or not the num_steps > start_step guard is applied.
The only behavior the fix actually changes is whether model_change_out_bias mutates the output bias, which this test never checks. Verified: reverting the guard in training.py while keeping this test still yields 1 passed.
Suggest mirroring test_ema_checkpoint_keeps_changed_out_bias — patch model_change_out_bias and assert it is not called (or assert out_bias is unchanged) when numb_steps=0. That makes it a genuine regression test.
There was a problem hiding this comment.
Fixed in 1273e6e. The PT zero-step regression test now patches deepmd.pt.train.training.model_change_out_bias, returns the original model if it is called, and asserts assert_not_called() after trainer.run(). This makes the test fail if the num_steps > start_step guard is removed, while keeping the existing checkpoint metadata assertions.
Validation:
pytest source/tests/pt/test_training.py::TestEnergyModelSeA::test_zero_step_with_change_bias_saves_initial_checkpoint -vuvx ruff check .uvx ruff format --check .
| self.assertEqual(0, train_infos["step"]) | ||
| self.assertEqual(0.0, train_infos["lr"]) |
There was a problem hiding this comment.
Same issue as the PT test: this passes on the unfixed code. With numb_steps=0, the pre-existing if self.num_steps == 0: block re-saves <ckpt>-0.pd with step=0, lr=0 and rewrites the checkpoint pointer after the bias block, so the assertions here (existence, latest_model, pointer, step==0, lr==0.0) are satisfied regardless of the num_steps > start_step guard. The only thing the fix changes — whether model_change_out_bias runs — is never asserted.
Suggest patching model_change_out_bias and asserting it is not called for numb_steps=0. Note the PD suite also has no test exercising the true branch (bias adjustment running for numb_steps>0), unlike PT's test_ema_checkpoint_keeps_changed_out_bias.
There was a problem hiding this comment.
Fixed in 1273e6e. The Paddle zero-step regression test now patches deepmd.pd.train.training.model_change_out_bias, returns the original model if it is called, and asserts assert_not_called() after trainer.run(). This directly checks the behavior changed by the guard instead of only checking the checkpoint rewrite path.
Validation:
uvx ruff check .uvx ruff format --check .
I could not run the Paddle test locally because this environment is missing paddle (ModuleNotFoundError: No module named paddle).
Problem
numb_steps=0is a valid no-optimization path that should save the initial checkpoint.change_bias_after_trainingis enabled, the post-training bias adjustment still ran after zero steps and evaluated learning-rate/checkpoint metadata at step-1.Change
change_bias_after_training=trueand verify the saved*-0checkpoint metadata.Notes
python3 -m pytest ...could not run in this workspace because pytest is not installed in the available Python environment.uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.pypassed.uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.pypassed.Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
Summary by CodeRabbit
Bug Fixes
Refactor
Tests