Skip to content

fix: Support CP2K 2025 output format for energy and forces (fixes #850)#978

Open
newtontech wants to merge 5 commits into
deepmodeling:masterfrom
newtontech:fix-cp2k-2025-format-v2
Open

fix: Support CP2K 2025 output format for energy and forces (fixes #850)#978
newtontech wants to merge 5 commits into
deepmodeling:masterfrom
newtontech:fix-cp2k-2025-format-v2

Conversation

@newtontech

@newtontech newtontech commented Jun 18, 2026

Copy link
Copy Markdown

This is a recreation of #947 which was closed because the head repository was deleted.

Adds support for parsing CP2K 2025 version output files.

Changes in CP2K 2025 format:

  1. Energy line format changed from: ENERGY| Total FORCE_EVAL ( QS ) energy (a.u.): to: ENERGY| Total FORCE_EVAL ( QS ) energy [hartree]
  2. Forces output format changed from ATOMIC FORCES in [a.u.] table to FORCES| Atomic forces [hartree/bohr] with FORCES| Atom x y z |f| prefix lines

Implementation:

  • Detect CP2K 2025 format by checking for 'energy [hartree]' in the content
  • Parse energy from new '[hartree]' format
  • Parse forces from new 'FORCES|' prefixed lines
  • Maintain backward compatibility with CP2K 2023 format

Review comments addressed since #947:

  • Raise clear RuntimeError when energy cannot be parsed from CP2K 2025 format
  • Fix literal \n in test fixture
  • Replace truthiness checks with explicit None checks in test helper
  • Add numeric value assertions to edge case tests
  • Add force value assertions to header filtering tests

Testing:

  • Added test file for CP2K 2025 format (tests/cp2k/cp2k_2025_output/)
  • Added regression test for CP2K 2023 backward compatibility
  • Added edge case tests for whitespace, header lines, and atomic forces variants
  • All 110 CP2K tests pass
  • Previously approved by @wanghan-iapcm

Summary by CodeRabbit

Release Notes

  • New Features

    • CP2K 2025 output format is now supported with enhanced energy and force data extraction capabilities
  • Tests

    • Comprehensive test coverage added for CP2K 2025 format parsing, including edge cases and backward compatibility validation for earlier CP2K versions

OpenClaw Bot and others added 5 commits June 18, 2026 23:10
…pmodeling#850)

This commit adds support for parsing CP2K 2025 version output files:

**Changes in CP2K 2025 format:**
1. Energy line format changed from:
   'ENERGY| Total FORCE_EVAL ( QS ) energy (a.u.): -7.997403996236343'
   to:
   'ENERGY| Total FORCE_EVAL ( QS ) energy [hartree] -7.364190264587725'

2. Forces output format changed from:
   'ATOMIC FORCES in [a.u.]' table with ' Atom   Kind   Element X Y Z' header
   to:
   'FORCES| Atomic forces [hartree/bohr]' with 'FORCES| Atom x y z |f|' prefix lines

**Implementation:**
- Detect CP2K 2025 format by checking for 'energy [hartree]' in the content
- Parse energy from new '[hartree]' format
- Parse forces from new 'FORCES|' prefixed lines
- Maintain backward compatibility with CP2K 2023 format

**Testing:**
- Added test file for CP2K 2025 format (tests/cp2k/cp2k_2025_output/)
- Added test case TestCp2k2025Output to verify parsing
- Added regression test TestCp2k2023FormatStillWorks to ensure backward compatibility
- All existing CP2K tests pass
- Add tests for energy parsing with extra whitespace
- Add tests for FORCES| header line filtering (Atom x y z, Atomic forces)
- Add integration test for CP2K 2025 format with LabeledSystem
- Improve code coverage for CP2K 2025 format support
- Raise clear RuntimeError when energy cannot be parsed from CP2K 2025 line
- Fix literal backslash-n in test fixture line 71
- Replace truthiness checks with explicit None checks in test helper
- Add numeric value assertions to edge case tests
- Add force value assertions to header filtering tests
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. cp2k dpdata enhancement New feature or request labels Jun 18, 2026
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

get_frames in the CP2K output parser gains a is_cp2k_2025 flag, set by detecting energy [hartree] in file content. Energy and force parsing then branch on this flag: the 2025 path uses token-based extraction and FORCES| line scanning; the prior fixed-field/state-machine path is kept for older formats. A fixture file and a new test module with integration, regression, and edge-case tests are added.

Changes

CP2K 2025 format support

Layer / File(s) Summary
Format detection and energy/force parsing
dpdata/formats/cp2k/output.py
Adds is_cp2k_2025 detection via energy [hartree] header check, then branches energy extraction to a token/float-fallback path and force extraction to `FORCES
CP2K 2025 test fixture
tests/cp2k/cp2k_2025_output/cp2k_2025_output, tests/cp2k/cp2k_2025_output/deepmd/type.raw, tests/cp2k/cp2k_2025_output/deepmd/type_map.raw
Adds a complete CP2K 2025 run transcript (banner through timing footer) and matching deepmd reference type files used by integration tests.
Test module
tests/test_cp2k_2025_output.py
Adds TestCp2k2025Output (energy and forces assertions against fixture), TestCp2k2023FormatStillWorks (regression guard), and TestCp2k2025EdgeCases (temporary-file tests for whitespace in energy lines and header skipping in force blocks).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding CP2K 2025 output format support with backward compatibility, directly matching the PR's primary objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dpdata/formats/cp2k/output.py (1)

405-483: ⚠️ Potential issue | 🟠 Major

Fix ruff linting errors in this file.

This file has 2 linting issues found by ruff check that must be fixed to comply with coding guidelines:

  • Line 118: Rename unused loop variable ii to _ii
  • Line 534: Prefix unused variable tmp_names with an underscore

While the code changes at lines 405-483 themselves appear compliant, the file contains linting violations elsewhere that must be resolved before committing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dpdata/formats/cp2k/output.py` around lines 405 - 483, The file has two ruff
linting violations that need to be fixed: (1) on line 118, rename the unused
loop variable `ii` to `_ii` to indicate it is intentionally unused, and (2) on
line 534, prefix the unused variable `tmp_names` with an underscore to make it
`_tmp_names`. These changes follow Python naming conventions for variables that
are intentionally not used in the code.

Source: Coding guidelines

🧹 Nitpick comments (1)
tests/test_cp2k_2025_output.py (1)

11-212: ⚡ Quick win

Consider adding test for energy parsing error condition.

The parser raises a RuntimeError when energy parsing fails (dpdata/formats/cp2k/output.py:455-457), but there's no test coverage for this error path. Adding a test that provides a malformed energy line and asserts the expected exception would improve coverage.

🧪 Suggested test for error condition
def test_cp2k2025_energy_parsing_failure_raises_error(self):
    """Test that malformed energy line raises RuntimeError with clear message."""
    fname = self.create_cp2k_output_2025(
        energy_line=" ENERGY| Total FORCE_EVAL ( QS ) energy [hartree] invalid"
    )
    try:
        with self.assertRaises(RuntimeError) as cm:
            dpdata.LabeledSystem(fname, fmt="cp2k/output")
        self.assertIn("Cannot parse energy from CP2K 2025 output", str(cm.exception))
    finally:
        os.unlink(fname)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_cp2k_2025_output.py` around lines 11 - 212, Add a new test method
to the TestCp2k2025EdgeCases class to verify that the energy parser properly
raises a RuntimeError when encountering a malformed energy line. The test should
call create_cp2k_output_2025() with an energy_line parameter containing invalid
data (e.g., a non-numeric value where the energy should be), then use
assertRaises to verify that dpdata.LabeledSystem raises a RuntimeError when
attempting to parse the file, and optionally verify the error message contains
expected text like "Cannot parse energy from CP2K 2025 output". Remember to
clean up the temporary file in a finally block after the test completes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dpdata/formats/cp2k/output.py`:
- Around line 405-406: Fix the ruff linting violations in the file
dpdata/formats/cp2k/output.py by renaming the unused loop variable ii to _ii on
line 118 to comply with the B007 rule, and by prefixing the unused unpacked
variable tmp_names with an underscore to become _tmp_names on line 534 to comply
with the RUF059 rule. These changes follow the convention of marking unused
variables with a leading underscore to satisfy linting requirements.

---

Outside diff comments:
In `@dpdata/formats/cp2k/output.py`:
- Around line 405-483: The file has two ruff linting violations that need to be
fixed: (1) on line 118, rename the unused loop variable `ii` to `_ii` to
indicate it is intentionally unused, and (2) on line 534, prefix the unused
variable `tmp_names` with an underscore to make it `_tmp_names`. These changes
follow Python naming conventions for variables that are intentionally not used
in the code.

---

Nitpick comments:
In `@tests/test_cp2k_2025_output.py`:
- Around line 11-212: Add a new test method to the TestCp2k2025EdgeCases class
to verify that the energy parser properly raises a RuntimeError when
encountering a malformed energy line. The test should call
create_cp2k_output_2025() with an energy_line parameter containing invalid data
(e.g., a non-numeric value where the energy should be), then use assertRaises to
verify that dpdata.LabeledSystem raises a RuntimeError when attempting to parse
the file, and optionally verify the error message contains expected text like
"Cannot parse energy from CP2K 2025 output". Remember to clean up the temporary
file in a finally block after the test completes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7409312a-3219-49a4-8a0e-ceb4c476fc28

📥 Commits

Reviewing files that changed from the base of the PR and between 1b63c9b and 62a3892.

📒 Files selected for processing (10)
  • dpdata/formats/cp2k/output.py
  • tests/cp2k/cp2k_2025_output/cp2k_2025_output
  • tests/cp2k/cp2k_2025_output/deepmd/set.000/box.npy
  • tests/cp2k/cp2k_2025_output/deepmd/set.000/coord.npy
  • tests/cp2k/cp2k_2025_output/deepmd/set.000/energy.npy
  • tests/cp2k/cp2k_2025_output/deepmd/set.000/force.npy
  • tests/cp2k/cp2k_2025_output/deepmd/set.000/virial.npy
  • tests/cp2k/cp2k_2025_output/deepmd/type.raw
  • tests/cp2k/cp2k_2025_output/deepmd/type_map.raw
  • tests/test_cp2k_2025_output.py

Comment on lines +405 to +406
# Check if this is CP2K 2025 format
is_cp2k_2025 = "energy [hartree]" in content

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify that "energy [hartree]" does not appear in CP2K 2023 format files

# Search for the detection string in CP2K 2023 test fixtures
rg -n 'energy \[hartree\]' tests/cp2k/ -g '!*2025*'

Repository: deepmodeling/dpdata

Length of output: 45


🏁 Script executed:

# Verify the detection string appears in CP2K 2025 format files
rg -n 'energy \[hartree\]' tests/cp2k/ -g '*2025*'

Repository: deepmodeling/dpdata

Length of output: 180


🏁 Script executed:

# Check ruff linting compliance for the file
cd dpdata && ruff check formats/cp2k/output.py

Repository: deepmodeling/dpdata

Length of output: 1286


🏁 Script executed:

# Inspect the code context around lines 405-406
sed -n '400,490p' dpdata/formats/cp2k/output.py | cat -n

Repository: deepmodeling/dpdata

Length of output: 4362


Fix ruff linting violations before committing.

The format detection via substring search works as intended—the string "energy [hartree]" correctly discriminates CP2K 2025 format from 2023 format and appears only in the former. However, the file has linting violations that must be resolved per the coding guidelines:

  • Line 118: Rename unused loop variable ii to _ii (B007)
  • Line 534: Prefix unused unpacked variable tmp_names with underscore (RUF059)

Run ruff check --fix dpdata/formats/cp2k/output.py to address these issues.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dpdata/formats/cp2k/output.py` around lines 405 - 406, Fix the ruff linting
violations in the file dpdata/formats/cp2k/output.py by renaming the unused loop
variable ii to _ii on line 118 to comply with the B007 rule, and by prefixing
the unused unpacked variable tmp_names with an underscore to become _tmp_names
on line 534 to comply with the RUF059 rule. These changes follow the convention
of marking unused variables with a leading underscore to satisfy linting
requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cp2k dpdata enhancement New feature or request size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant