fix(train): allow zero-step training with bias adjustment by njzjz-bot · Pull Request #5477 · deepmodeling/deepmd-kit

njzjz-bot · 2026-05-30T06:14:20Z

Problem

numb_steps=0 is a valid no-optimization path that should save the initial checkpoint.
When change_bias_after_training is enabled, the post-training bias adjustment still ran after zero steps and evaluated learning-rate/checkpoint metadata at step -1.

Change

Skip post-training bias adjustment unless at least one training step has run.
Keep the existing zero-step initial checkpoint save path for both PyTorch and Paddle backends.
Add PT/PD regression tests that run zero-step training with change_bias_after_training=true and verify the saved *-0 checkpoint metadata.

Notes

python3 -m pytest ... could not run in this workspace because pytest is not installed in the available Python environment.
uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py passed.
uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py passed.
Closes Runtime Error when Step is 0 #4988.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Summary by CodeRabbit

Bug Fixes
- Prevented unintended bias-adjustment during zero-step PyTorch training so the initial checkpoint is created and recorded correctly.
Refactor
- Clarified the post-training bias-adjustment conditional in Paddle for readability (no behavior change).
Tests
- Added tests for zero-step training with bias-adjustment enabled for both Paddle and PyTorch, verifying initial checkpoint creation and training metadata.

Skip post-training bias adjustment when no training step has run, so zero-step jobs can keep the existing initial-checkpoint behavior without evaluating step -1 learning rates. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

coderabbitai · 2026-05-30T06:17:50Z

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

This PR adds a step-count guard to the PyTorch trainer's post-training bias-change block, reformats the corresponding Paddle trainer conditional, and adds tests for zero-step training that verify initial checkpoint creation and metadata.

Changes

Trainer Zero-Step Guard and Test Coverage

Layer / File(s)	Summary
PyTorch trainer step-count guard for bias-change block `deepmd/pt/train/training.py`	The `change_bias_after_training` conditional block at the end of `Trainer.run()` now requires `self.num_steps > self.start_step` in addition to the rank-0 check, preventing execution when the training loop performed zero steps.
Paddle trainer post-training block formatting `deepmd/pd/train/training.py`	Reformatted the `change_bias_after_training` conditional in Paddle's `Trainer.run()` to span multiple lines, preserving the identical rank-0 check logic.
Zero-step training test coverage (Paddle) `source/tests/pd/test_training.py`	Adds `import paddle` and `test_zero_step_with_change_bias_saves_initial_checkpoint` which runs zero-step training with `change_bias_after_training=True`, asserts `trainer.save_ckpt-0.pd` is created and matches `trainer.latest_model`, and verifies `_extra_state.train_infos.step == 0` and `lr == 0.0`.
Zero-step training test coverage (PyTorch) `source/tests/pt/test_training.py`	Adds `test_zero_step_with_change_bias_saves_initial_checkpoint` which runs zero-step training with `change_bias_after_training=True`, asserts `trainer.save_ckpt-0.pt` is created and is the latest model, checks the `checkpoint` pointer file, and verifies `model._extra_state.train_infos.step == 0` and `lr == 0.0`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

Chengqian-Zhang
iProzd

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: enabling zero-step training with bias adjustment, which is the core functional fix in this PR.
Linked Issues check	✅ Passed	The PR addresses issue `#4988` by allowing zero-step training with bias adjustment; it skips post-training bias adjustment when no steps have run and saves the initial checkpoint correctly.
Out of Scope Changes check	✅ Passed	All changes directly address the zero-step training issue: logic updates to Trainer.run() in both backends and regression tests verifying the fix align with issue `#4988` requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/tests/pd/test_training.py`:
- Around line 167-181: This test method
test_zero_step_with_change_bias_saves_initial_checkpoint runs training and needs
a 60s timeout to prevent CI hangs; add a pytest timeout decorator to the method
(e.g. `@pytest.mark.timeout`(60)) and ensure pytest is imported in the test module
so the decorator is available; locate the method by name in the test class in
test_training.py and place the decorator immediately above the def to enforce
the <=60s limit.
- Line 176: The assertion compares a Path object to a raw string
(Path("model.ckpt-0.pd") vs Path("checkpoint").read_text()), causing spurious
failures; change the test to compare Path to Path by wrapping the read text as a
Path and stripping whitespace/newline: replace the RHS with
Path(Path("checkpoint").read_text().strip()) so the assertion becomes
self.assertEqual(Path("model.ckpt-0.pd"),
Path(Path("checkpoint").read_text().strip())). This ensures both sides are Path
objects and ignores trailing newlines.

In `@source/tests/pt/test_training.py`:
- Around line 266-282: Add the 60s timeout decorator to the test function by
annotating test_zero_step_with_change_bias_saves_initial_checkpoint with
`@TRAINING_TEST_TIMEOUT` (place the decorator immediately above the def). If
TRAINING_TEST_TIMEOUT is not in scope in that module, import it where other test
helpers are imported so the symbol is available before use; keep the rest of the
test unchanged.
- Line 275: The assertion mixes a Path object and a raw string; change the
comparison so both sides use the same type and strip any newline: replace the
RHS Path("checkpoint").read_text() with
Path(Path("checkpoint").read_text().strip()) (or alternatively compare
str(Path("model.ckpt-0.pt")) to Path("checkpoint").read_text().strip()) so the
call in self.assertEqual compares two strings or two Path objects consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c57bc0f6-4dcf-4067-87e2-99022da10b56

📥 Commits

Reviewing files that changed from the base of the PR and between e679b8d and ef84d6c.

📒 Files selected for processing (4)

deepmd/pd/train/training.py
deepmd/pt/train/training.py
source/tests/pd/test_training.py
source/tests/pt/test_training.py

Copilot

Pull request overview

This PR fixes zero-step training when change_bias_after_training is enabled for PyTorch and Paddle, ensuring the initial checkpoint path remains valid without running post-training bias adjustment.

Changes:

Adds a num_steps > start_step guard before bias adjustment in PT/PD trainers.
Adds regression tests for zero-step training with bias adjustment enabled.
Verifies saved checkpoint metadata reports step=0 and lr=0.0.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`deepmd/pt/train/training.py`	Skips PT bias adjustment when no training step ran.
`deepmd/pd/train/training.py`	Skips Paddle bias adjustment when no training step ran.
`source/tests/pt/test_training.py`	Adds PT regression coverage for zero-step checkpoint save.
`source/tests/pd/test_training.py`	Adds Paddle regression coverage for zero-step checkpoint save.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-05-30T07:02:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.16%. Comparing base (e679b8d) to head (1273e6e).
⚠️ Report is 48 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5477      +/-   ##
==========================================
- Coverage   82.25%   82.16%   -0.09%     
==========================================
  Files         833      896      +63     
  Lines       89100   102586   +13486     
  Branches     4225     4339     +114     
==========================================
+ Hits        73290    84291   +11001     
- Misses      14518    16958    +2440     
- Partials     1292     1337      +45

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Compare checkpoint pointers as paths and add timeout guards to zero-step training regression tests. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-30T07:55:44Z

Thanks, fixed in 631039c:

compare the checkpoint pointer as a Path after stripping the file content
add timeout guards to the zero-step training regression tests

Validation:

uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py
uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

This reverts commit 631039c.

Compare checkpoint pointers as paths without adding timeout guards, since the regression covers the zero-step no-op path. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-30T08:05:40Z

Update: I kept the checkpoint pointer assertion fix but intentionally removed the added timeout guards in d27334c.

This regression covers numb_steps=0, so it verifies the no-op path and should not enter the training loop. A training-test timeout is useful for tests that actually run optimization, but it adds noise here, especially for the Paddle test file where no timeout helper existed.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Build the expected zero-step checkpoint path from trainer.save_ckpt so the regression follows each test fixture's configured checkpoint prefix. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot · 2026-05-30T09:58:23Z

Fixed the failing tests in 3d7168f.

The fixtures configure training.save_ckpt as model, so the zero-step checkpoint is model-0.{pt,pd}, not model.ckpt-0.{pt,pd}. The tests now derive the expected path from trainer.save_ckpt and still verify that the checkpoint pointer and saved metadata are correct.

Validation:

uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py
uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Paddle's load helper treats pathlib.Path as a buffer on the tested version, so pass the checkpoint path as a string in the zero-step regression test.\n\nAuthored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

wanghan-iapcm · 2026-06-18T09:26:23Z

+        self.assertEqual(0, train_infos["step"])
+        self.assertEqual(0.0, train_infos["lr"])


This regression test passes on the unfixed code, so it does not guard the bug it targets. With numb_steps=0, the pre-existing if self.num_steps == 0: block re-saves <ckpt>-0.pt with step=0, lr=0 and rewrites the checkpoint pointer after the bias block runs. So every assertion here — file exists, latest_model == -0, pointer content, train_infos["step"]==0, train_infos["lr"]==0.0 — holds whether or not the num_steps > start_step guard is applied.

The only behavior the fix actually changes is whether model_change_out_bias mutates the output bias, which this test never checks. Verified: reverting the guard in training.py while keeping this test still yields 1 passed.

Suggest mirroring test_ema_checkpoint_keeps_changed_out_bias — patch model_change_out_bias and assert it is not called (or assert out_bias is unchanged) when numb_steps=0. That makes it a genuine regression test.

Fixed in 1273e6e. The PT zero-step regression test now patches deepmd.pt.train.training.model_change_out_bias, returns the original model if it is called, and asserts assert_not_called() after trainer.run(). This makes the test fail if the num_steps > start_step guard is removed, while keeping the existing checkpoint metadata assertions.

Validation:

pytest source/tests/pt/test_training.py::TestEnergyModelSeA::test_zero_step_with_change_bias_saves_initial_checkpoint -v

uvx ruff check .

uvx ruff format --check .

wanghan-iapcm · 2026-06-18T09:26:25Z

+        self.assertEqual(0, train_infos["step"])
+        self.assertEqual(0.0, train_infos["lr"])


Same issue as the PT test: this passes on the unfixed code. With numb_steps=0, the pre-existing if self.num_steps == 0: block re-saves <ckpt>-0.pd with step=0, lr=0 and rewrites the checkpoint pointer after the bias block, so the assertions here (existence, latest_model, pointer, step==0, lr==0.0) are satisfied regardless of the num_steps > start_step guard. The only thing the fix changes — whether model_change_out_bias runs — is never asserted.

Suggest patching model_change_out_bias and asserting it is not called for numb_steps=0. Note the PD suite also has no test exercising the true branch (bias adjustment running for numb_steps>0), unlike PT's test_ema_checkpoint_keeps_changed_out_bias.

Fixed in 1273e6e. The Paddle zero-step regression test now patches deepmd.pd.train.training.model_change_out_bias, returns the original model if it is called, and asserts assert_not_called() after trainer.run(). This directly checks the behavior changed by the guard instead of only checking the checkpoint rewrite path.

Validation:

uvx ruff check .

uvx ruff format --check .

I could not run the Paddle test locally because this environment is missing paddle (ModuleNotFoundError: No module named paddle).

iProzd

LGTM. The guard is correct, and after 1273e6e the zero-step test patches model_change_out_bias and asserts it isn't called, so it now genuinely fails if the fix is reverted. CI is green — approving.

dosubot Bot added the bug label May 30, 2026

github-actions Bot added the Python label May 30, 2026

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread source/tests/pd/test_training.py Outdated

Comment thread source/tests/pd/test_training.py Outdated

Comment thread source/tests/pt/test_training.py Outdated

Comment thread source/tests/pt/test_training.py Outdated

njzjz requested a review from Copilot May 30, 2026 06:50

Copilot started reviewing on behalf of njzjz May 30, 2026 06:50 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

Comment thread source/tests/pt/test_training.py Outdated

Comment thread source/tests/pd/test_training.py Outdated

Comment thread source/tests/pt/test_training.py Outdated

Comment thread source/tests/pd/test_training.py Outdated

test(train): fix zero-step checkpoint assertions

631039c

Compare checkpoint pointers as paths and add timeout guards to zero-step training regression tests. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot added 2 commits May 30, 2026 08:03

Revert "test(train): fix zero-step checkpoint assertions"

5da1412

This reverts commit 631039c.

test(train): fix zero-step checkpoint assertions

d27334c

Compare checkpoint pointers as paths without adding timeout guards, since the regression covers the zero-step no-op path. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

test(train): use configured zero-step checkpoint path

3d7168f

Build the expected zero-step checkpoint path from trainer.save_ckpt so the regression follows each test fixture's configured checkpoint prefix. Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz marked this pull request as draft June 1, 2026 06:10

test(train): pass string path to paddle.load

c8b454d

Paddle's load helper treats pathlib.Path as a buffer on the tested version, so pass the checkpoint path as a string in the zero-step regression test.\n\nAuthored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz marked this pull request as ready for review June 17, 2026 16:24

njzjz requested review from iProzd and wanghan-iapcm June 17, 2026 16:25

wanghan-iapcm reviewed Jun 18, 2026

View reviewed changes

test(train): assert zero-step skips bias adjustment

1273e6e

iProzd approved these changes Jun 18, 2026

View reviewed changes

njzjz requested a review from wanghan-iapcm June 19, 2026 06:58

		self.assertEqual(0, train_infos["step"])
		self.assertEqual(0.0, train_infos["lr"])

Conversation

njzjz-bot commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njzjz-bot commented May 30, 2026

Uh oh!

njzjz-bot commented May 30, 2026

Uh oh!

njzjz-bot commented May 30, 2026

Uh oh!

wanghan-iapcm Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

njzjz Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

wanghan-iapcm Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

njzjz Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

iProzd left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

njzjz-bot commented May 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 30, 2026 •

edited

Loading

codecov Bot commented May 30, 2026 •

edited

Loading