fix(checkpoint): write consolidated safetensors without append by huahuajhu · Pull Request #2627 · NVIDIA-NeMo/Automodel

huahuajhu · 2026-06-18T05:07:06Z

What does this PR do ?

Fixes consolidated HF safetensors export to write each output shard in a single wb pass instead of writing metadata first and reopening the file in append mode.

Changelog

Add specific line-by-line info of high-level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and follow Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

If you haven't finished some of the above items, you can still open a "Draft" PR.

Additional Information

Related to Closes staging-free consolidation for databricks #1092.

copy-pr-bot · 2026-06-18T05:07:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copilot

Pull request overview

This PR updates the HuggingFace safetensors consolidation path to avoid reopening output shards in append mode by writing the safetensors header and payload in a single wb stream, improving compatibility with filesystems that do not support append.

Changes:

Refactors safetensors consolidation to compute header metadata/offsets and write header+tensor bytes in one wb pass per output shard (no ab reopen).
Changes HF storage writer consolidation to only use staging when staging_dir is explicitly provided (direct consolidation by default).
Updates unit tests and Databricks guide examples to reflect staging being optional.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tests/unit_tests/checkpoint/test_consolidate_safetensors.py`	Adds regression tests for “no append-mode opens” and verifies direct consolidation defaults.
`nemo_automodel/components/checkpoint/config.py`	Clarifies `staging_dir` semantics in the checkpointing config comments.
`nemo_automodel/components/checkpoint/_backports/hf_storage.py`	Makes staging opt-in based on `staging_dir` presence for consolidation.
`nemo_automodel/components/checkpoint/_backports/consolidate_hf_safetensors.py`	Implements single-stream (`wb`) header+payload writing and removes append-mode usage.
`docs/guides/llm/databricks.mdx`	Removes staging_dir from example invocations and describes it as optional for consolidation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

akoumpa · 2026-06-18T19:31:33Z

Hi @huahuajhu , thank you for making this!

Since we don't have databricks on our CI, i want to ask you if you have tested this on databricks and what's the difference in perf (before and after). I'll try to find someone to review, but that will be next week probably.

Thank you.

huahuajhu · 2026-06-18T21:13:34Z

Thanks for the question. I tested the checkpoint consolidation path on Databricks using a Unity Catalog volume, since that is the filesystem behavior this PR changes.

Test environment:

Databricks workspace with Unity Catalog enabled
Catalogs available: samples, system, workspace
Test volume path: /Volumes/workspace/automodel_pr2627/checkpoints/automodel-pr2627
Created:
- schema: workspace.automodel_pr2627
- volume: workspace.automodel_pr2627.checkpoints
Verified the volume path exists from Python.

Test scope:

This was a CPU-only Databricks UC-volume smoke test for the safetensors consolidation writer.
It directly exercises consolidate_safetensors_files(..., use_staging=False) on a UC volume.
It does not measure end-to-end GPU training throughput yet.

Before:

Code: NVIDIA-NeMo/Automodel@main
Resolved commit: 83e4aad1ed49068c22f8ce527742e727215c0323
Test: write sharded safetensors input, then consolidate to the UC volume with use_staging=False
Result: failed
Error:
```
OSError: [Errno 29] Illegal seek
```

Failure occurred inside:

nemo_automodel/components/checkpoint/_backports/consolidate_hf_safetensors.py
consolidate_safetensors_files(...)
_consolidate_safetensors_files(...)
_write_data(...)

After:
Code: this PR branch
Resolved commit: c2158e7
Same UC volume path and same input tensors
Test wrote:

/Volumes/workspace/automodel_pr2627/checkpoints/automodel-pr2627/cpu_smoke_output/model-00001-of-00001.safetensors
/Volumes/workspace/automodel_pr2627/checkpoints/automodel-pr2627/cpu_smoke_output/model.safetensors.index.json

yuhezhang-ai · 2026-06-22T20:11:09Z

/ok to test c8aee48

yuhezhang-ai · 2026-06-22T20:15:42Z

Hi @huahuajhu, thanks, the PR looks great to me. I just added one unit test that covers multi-input/multi-output safetensors consolidation and verifies the output tensors/index metadata, plus the no-append behavior.

Hi @akoumpa, FYI I reviewed this PR and it looks good to me. I just triggered CI.

akoumpa · 2026-06-24T06:29:20Z

/ok to test fcadea1

…dation

akoumpa · 2026-06-24T20:52:00Z

/ok to test a9eb684

akoumpa · 2026-06-24T21:06:37Z

Thanks a lot for testing this on unity @huahuajhu !

I've merged main and retriggered CI , that should get around the temporary job issue.

github-actions Bot added the community-request label Jun 18, 2026

huahuajhu marked this pull request as ready for review June 18, 2026 05:14

huahuajhu requested review from a team and jgerh as code owners June 18, 2026 05:14

Copilot AI review requested due to automatic review settings June 18, 2026 05:14

Copilot started reviewing on behalf of huahuajhu June 18, 2026 05:14 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread tests/unit_tests/checkpoint/test_consolidate_safetensors.py

huahuajhu force-pushed the huahuajhu/fix/issue-1092-single-pass-consolidation branch from eab01b7 to 9b4b32d Compare June 18, 2026 05:32

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 18, 2026

svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels Jun 19, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci June 22, 2026 20:11 Inactive

copy-pr-bot Bot temporarily deployed to test June 22, 2026 20:11 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 20:11 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 20:15 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 22, 2026 20:16 Inactive

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label Jun 22, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci June 24, 2026 06:29 Inactive

copy-pr-bot Bot temporarily deployed to test June 24, 2026 06:29 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 06:29 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 06:31 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 06:32 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 24, 2026 06:34 Inactive

Merge branch 'main' into huahuajhu/fix/issue-1092-single-pass-consoli…

a9eb684

…dation

copy-pr-bot Bot temporarily deployed to nemo-ci June 24, 2026 20:52 Inactive

copy-pr-bot Bot temporarily deployed to test June 24, 2026 20:52 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 20:52 Inactive

copy-pr-bot Bot temporarily deployed to public June 24, 2026 20:56 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 24, 2026 20:58 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(checkpoint): write consolidated safetensors without append#2627

fix(checkpoint): write consolidated safetensors without append#2627
huahuajhu wants to merge 34 commits into
NVIDIA-NeMo:mainfrom
huahuajhu:huahuajhu/fix/issue-1092-single-pass-consolidation

huahuajhu commented Jun 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

akoumpa commented Jun 18, 2026

Uh oh!

huahuajhu commented Jun 18, 2026

Uh oh!

yuhezhang-ai commented Jun 22, 2026

Uh oh!

yuhezhang-ai commented Jun 22, 2026

Uh oh!

akoumpa commented Jun 24, 2026

Uh oh!

akoumpa commented Jun 24, 2026

Uh oh!

akoumpa commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

huahuajhu commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

akoumpa commented Jun 18, 2026

Uh oh!

huahuajhu commented Jun 18, 2026

Uh oh!

yuhezhang-ai commented Jun 22, 2026

Uh oh!

yuhezhang-ai commented Jun 22, 2026

Uh oh!

akoumpa commented Jun 24, 2026

Uh oh!

akoumpa commented Jun 24, 2026

Uh oh!

akoumpa commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

huahuajhu commented Jun 18, 2026 •

edited

Loading