Skip to content

fix(train): allow zero-step training with bias adjustment#5477

Open
njzjz-bot wants to merge 7 commits into
deepmodeling:masterfrom
njzjz-bothub:fix-4988-zero-step-training
Open

fix(train): allow zero-step training with bias adjustment#5477
njzjz-bot wants to merge 7 commits into
deepmodeling:masterfrom
njzjz-bothub:fix-4988-zero-step-training

Conversation

@njzjz-bot

@njzjz-bot njzjz-bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

Problem

  • numb_steps=0 is a valid no-optimization path that should save the initial checkpoint.
  • When change_bias_after_training is enabled, the post-training bias adjustment still ran after zero steps and evaluated learning-rate/checkpoint metadata at step -1.

Change

  • Skip post-training bias adjustment unless at least one training step has run.
  • Keep the existing zero-step initial checkpoint save path for both PyTorch and Paddle backends.
  • Add PT/PD regression tests that run zero-step training with change_bias_after_training=true and verify the saved *-0 checkpoint metadata.

Notes

  • python3 -m pytest ... could not run in this workspace because pytest is not installed in the available Python environment.
  • uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py passed.
  • uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py passed.
  • Closes Runtime Error when Step is 0 #4988.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Summary by CodeRabbit

  • Bug Fixes

    • Prevented unintended bias-adjustment during zero-step PyTorch training so the initial checkpoint is created and recorded correctly.
  • Refactor

    • Clarified the post-training bias-adjustment conditional in Paddle for readability (no behavior change).
  • Tests

    • Added tests for zero-step training with bias-adjustment enabled for both Paddle and PyTorch, verifying initial checkpoint creation and training metadata.

Review Change Stack

Skip post-training bias adjustment when no training step has run, so zero-step jobs can keep the existing initial-checkpoint behavior without evaluating step -1 learning rates.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@dosubot dosubot Bot added the bug label May 30, 2026
@coderabbitai

coderabbitai Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

This PR adds a step-count guard to the PyTorch trainer's post-training bias-change block, reformats the corresponding Paddle trainer conditional, and adds tests for zero-step training that verify initial checkpoint creation and metadata.

Changes

Trainer Zero-Step Guard and Test Coverage

Layer / File(s) Summary
PyTorch trainer step-count guard for bias-change block
deepmd/pt/train/training.py
The change_bias_after_training conditional block at the end of Trainer.run() now requires self.num_steps > self.start_step in addition to the rank-0 check, preventing execution when the training loop performed zero steps.
Paddle trainer post-training block formatting
deepmd/pd/train/training.py
Reformatted the change_bias_after_training conditional in Paddle's Trainer.run() to span multiple lines, preserving the identical rank-0 check logic.
Zero-step training test coverage (Paddle)
source/tests/pd/test_training.py
Adds import paddle and test_zero_step_with_change_bias_saves_initial_checkpoint which runs zero-step training with change_bias_after_training=True, asserts trainer.save_ckpt-0.pd is created and matches trainer.latest_model, and verifies _extra_state.train_infos.step == 0 and lr == 0.0.
Zero-step training test coverage (PyTorch)
source/tests/pt/test_training.py
Adds test_zero_step_with_change_bias_saves_initial_checkpoint which runs zero-step training with change_bias_after_training=True, asserts trainer.save_ckpt-0.pt is created and is the latest model, checks the checkpoint pointer file, and verifies model._extra_state.train_infos.step == 0 and lr == 0.0.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • Chengqian-Zhang
  • iProzd
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enabling zero-step training with bias adjustment, which is the core functional fix in this PR.
Linked Issues check ✅ Passed The PR addresses issue #4988 by allowing zero-step training with bias adjustment; it skips post-training bias adjustment when no steps have run and saves the initial checkpoint correctly.
Out of Scope Changes check ✅ Passed All changes directly address the zero-step training issue: logic updates to Trainer.run() in both backends and regression tests verifying the fix align with issue #4988 requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@source/tests/pd/test_training.py`:
- Around line 167-181: This test method
test_zero_step_with_change_bias_saves_initial_checkpoint runs training and needs
a 60s timeout to prevent CI hangs; add a pytest timeout decorator to the method
(e.g. `@pytest.mark.timeout`(60)) and ensure pytest is imported in the test module
so the decorator is available; locate the method by name in the test class in
test_training.py and place the decorator immediately above the def to enforce
the <=60s limit.
- Line 176: The assertion compares a Path object to a raw string
(Path("model.ckpt-0.pd") vs Path("checkpoint").read_text()), causing spurious
failures; change the test to compare Path to Path by wrapping the read text as a
Path and stripping whitespace/newline: replace the RHS with
Path(Path("checkpoint").read_text().strip()) so the assertion becomes
self.assertEqual(Path("model.ckpt-0.pd"),
Path(Path("checkpoint").read_text().strip())). This ensures both sides are Path
objects and ignores trailing newlines.

In `@source/tests/pt/test_training.py`:
- Around line 266-282: Add the 60s timeout decorator to the test function by
annotating test_zero_step_with_change_bias_saves_initial_checkpoint with
`@TRAINING_TEST_TIMEOUT` (place the decorator immediately above the def). If
TRAINING_TEST_TIMEOUT is not in scope in that module, import it where other test
helpers are imported so the symbol is available before use; keep the rest of the
test unchanged.
- Line 275: The assertion mixes a Path object and a raw string; change the
comparison so both sides use the same type and strip any newline: replace the
RHS Path("checkpoint").read_text() with
Path(Path("checkpoint").read_text().strip()) (or alternatively compare
str(Path("model.ckpt-0.pt")) to Path("checkpoint").read_text().strip()) so the
call in self.assertEqual compares two strings or two Path objects consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c57bc0f6-4dcf-4067-87e2-99022da10b56

📥 Commits

Reviewing files that changed from the base of the PR and between e679b8d and ef84d6c.

📒 Files selected for processing (4)
  • deepmd/pd/train/training.py
  • deepmd/pt/train/training.py
  • source/tests/pd/test_training.py
  • source/tests/pt/test_training.py

Comment thread source/tests/pd/test_training.py Outdated
Comment thread source/tests/pd/test_training.py Outdated
Comment thread source/tests/pt/test_training.py Outdated
Comment thread source/tests/pt/test_training.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes zero-step training when change_bias_after_training is enabled for PyTorch and Paddle, ensuring the initial checkpoint path remains valid without running post-training bias adjustment.

Changes:

  • Adds a num_steps > start_step guard before bias adjustment in PT/PD trainers.
  • Adds regression tests for zero-step training with bias adjustment enabled.
  • Verifies saved checkpoint metadata reports step=0 and lr=0.0.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
deepmd/pt/train/training.py Skips PT bias adjustment when no training step ran.
deepmd/pd/train/training.py Skips Paddle bias adjustment when no training step ran.
source/tests/pt/test_training.py Adds PT regression coverage for zero-step checkpoint save.
source/tests/pd/test_training.py Adds Paddle regression coverage for zero-step checkpoint save.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/tests/pt/test_training.py Outdated
Comment thread source/tests/pd/test_training.py Outdated
Comment thread source/tests/pt/test_training.py Outdated
Comment thread source/tests/pd/test_training.py Outdated
@codecov

codecov Bot commented May 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.16%. Comparing base (e679b8d) to head (1273e6e).
⚠️ Report is 48 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5477      +/-   ##
==========================================
- Coverage   82.25%   82.16%   -0.09%     
==========================================
  Files         833      896      +63     
  Lines       89100   102586   +13486     
  Branches     4225     4339     +114     
==========================================
+ Hits        73290    84291   +11001     
- Misses      14518    16958    +2440     
- Partials     1292     1337      +45     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Compare checkpoint pointers as paths and add timeout guards to zero-step training regression tests.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@njzjz-bot

Copy link
Copy Markdown
Contributor Author

Thanks, fixed in 631039c:

  • compare the checkpoint pointer as a Path after stripping the file content
  • add timeout guards to the zero-step training regression tests

Validation:

  • uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py
  • uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

njzjz-bot added 2 commits May 30, 2026 08:03
Compare checkpoint pointers as paths without adding timeout guards, since the regression covers the zero-step no-op path.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@njzjz-bot

Copy link
Copy Markdown
Contributor Author

Update: I kept the checkpoint pointer assertion fix but intentionally removed the added timeout guards in d27334c.

This regression covers numb_steps=0, so it verifies the no-op path and should not enter the training loop. A training-test timeout is useful for tests that actually run optimization, but it adds noise here, especially for the Paddle test file where no timeout helper existed.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Build the expected zero-step checkpoint path from trainer.save_ckpt so the regression follows each test fixture's configured checkpoint prefix.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@njzjz-bot

Copy link
Copy Markdown
Contributor Author

Fixed the failing tests in 3d7168f.

The fixtures configure training.save_ckpt as model, so the zero-step checkpoint is model-0.{pt,pd}, not model.ckpt-0.{pt,pd}. The tests now derive the expected path from trainer.save_ckpt and still verify that the checkpoint pointer and saved metadata are correct.

Validation:

  • uvx ruff check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py
  • uvx ruff format --check deepmd/pd/train/training.py deepmd/pt/train/training.py source/tests/pd/test_training.py source/tests/pt/test_training.py

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

@njzjz njzjz marked this pull request as draft June 1, 2026 06:10
Paddle's load helper treats pathlib.Path as a buffer on the tested version, so pass the checkpoint path as a string in the zero-step regression test.\n\nAuthored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@njzjz njzjz marked this pull request as ready for review June 17, 2026 16:24
@njzjz njzjz requested review from iProzd and wanghan-iapcm June 17, 2026 16:25
Comment on lines +282 to +283
self.assertEqual(0, train_infos["step"])
self.assertEqual(0.0, train_infos["lr"])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regression test passes on the unfixed code, so it does not guard the bug it targets. With numb_steps=0, the pre-existing if self.num_steps == 0: block re-saves <ckpt>-0.pt with step=0, lr=0 and rewrites the checkpoint pointer after the bias block runs. So every assertion here — file exists, latest_model == -0, pointer content, train_infos["step"]==0, train_infos["lr"]==0.0 — holds whether or not the num_steps > start_step guard is applied.

The only behavior the fix actually changes is whether model_change_out_bias mutates the output bias, which this test never checks. Verified: reverting the guard in training.py while keeping this test still yields 1 passed.

Suggest mirroring test_ema_checkpoint_keeps_changed_out_bias — patch model_change_out_bias and assert it is not called (or assert out_bias is unchanged) when numb_steps=0. That makes it a genuine regression test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1273e6e. The PT zero-step regression test now patches deepmd.pt.train.training.model_change_out_bias, returns the original model if it is called, and asserts assert_not_called() after trainer.run(). This makes the test fail if the num_steps > start_step guard is removed, while keeping the existing checkpoint metadata assertions.

Validation:

  • pytest source/tests/pt/test_training.py::TestEnergyModelSeA::test_zero_step_with_change_bias_saves_initial_checkpoint -v
  • uvx ruff check .
  • uvx ruff format --check .

Comment on lines +183 to +184
self.assertEqual(0, train_infos["step"])
self.assertEqual(0.0, train_infos["lr"])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the PT test: this passes on the unfixed code. With numb_steps=0, the pre-existing if self.num_steps == 0: block re-saves <ckpt>-0.pd with step=0, lr=0 and rewrites the checkpoint pointer after the bias block, so the assertions here (existence, latest_model, pointer, step==0, lr==0.0) are satisfied regardless of the num_steps > start_step guard. The only thing the fix changes — whether model_change_out_bias runs — is never asserted.

Suggest patching model_change_out_bias and asserting it is not called for numb_steps=0. Note the PD suite also has no test exercising the true branch (bias adjustment running for numb_steps>0), unlike PT's test_ema_checkpoint_keeps_changed_out_bias.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1273e6e. The Paddle zero-step regression test now patches deepmd.pd.train.training.model_change_out_bias, returns the original model if it is called, and asserts assert_not_called() after trainer.run(). This directly checks the behavior changed by the guard instead of only checking the checkpoint rewrite path.

Validation:

  • uvx ruff check .
  • uvx ruff format --check .

I could not run the Paddle test locally because this environment is missing paddle (ModuleNotFoundError: No module named paddle).

@iProzd iProzd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The guard is correct, and after 1273e6e the zero-step test patches model_change_out_bias and asserts it isn't called, so it now genuinely fails if the fix is reverted. CI is green — approving.

@njzjz njzjz requested a review from wanghan-iapcm June 19, 2026 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Runtime Error when Step is 0

5 participants