Skip to content

fix(data): stabilize multi-turn chat chunking and tokenization#2856

Open
jinglinglingling wants to merge 3 commits into
mainfrom
fix/issue-2821-2844-message-log-tokenization-main
Open

fix(data): stabilize multi-turn chat chunking and tokenization#2856
jinglinglingling wants to merge 3 commits into
mainfrom
fix/issue-2821-2844-message-log-tokenization-main

Conversation

@jinglinglingling

@jinglinglingling jinglinglingling commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Use overlap-aware chunk extraction and context-aware token slicing in get_formatted_message_log so non-monotonic reasoning templates do not duplicate prior assistant text and sentencepiece tokenizers do not produce leading-space drift on assistant targets.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

#2844
#2821

@jinglinglingling jinglinglingling requested review from a team as code owners June 17, 2026 04:55
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jinglinglingling jinglinglingling added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 17, 2026
@jinglinglingling

Copy link
Copy Markdown
Contributor Author

/ok to test c716f68

Linglin Jing added 2 commits June 16, 2026 22:06
Use overlap-aware chunk extraction and context-aware token slicing in get_formatted_message_log so non-monotonic reasoning templates do not duplicate prior assistant text and sentencepiece tokenizers do not produce leading-space drift on assistant targets.

Signed-off-by: Linglin Jing <linglinj@cw-dfw-cs-001-vscode-01.cm.cluster>
Signed-off-by: Linglin Jing <linglinj@cw-dfw-cs-001-vscode-01.cm.cluster>
@jinglinglingling jinglinglingling force-pushed the fix/issue-2821-2844-message-log-tokenization-main branch from aeb1875 to df2f8a2 Compare June 17, 2026 05:08
@jinglinglingling

Copy link
Copy Markdown
Contributor Author

/ok to test df2f8a2

Signed-off-by: Linglin Jing <linglinj@cw-dfw-cs-001-vscode-01.cm.cluster>
@jinglinglingling jinglinglingling requested a review from yuki-97 June 17, 2026 05:54
@jinglinglingling

jinglinglingling commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

hi @yuki-97, please review this PR for #2844 and #2821 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant