docs(xtoken): X-Token distillation guide and README updates#2854
Open
avenkateshha wants to merge 4 commits into
Open
docs(xtoken): X-Token distillation guide and README updates#2854avenkateshha wants to merge 4 commits into
avenkateshha wants to merge 4 commits into
Conversation
- Shorten the H1 to 'Cross-Tokenizer (X-Token)'. - Trim the future-work note to a generic 'actively improving support'. - Reframe Step 2 around tokenizer overlap (similar algorithms such as BPE). - Describe Step 3's sparse [V_student, top_k] projection representation and why it avoids a dense [student_vocab, teacher_vocab] matrix. - Remove the --preserve_last paragraph; the recommended recipe disables the scale trick, so it never engages. - Drop the 'via CUDA IPC' qualifier from the P-KL loss-mode row. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
- Drop 'Off-Policy' from the X-Token section/subsection headings and the feature-list entry; update the matching table-of-contents anchors. - Reword the section's first line to 'distillation' (no 'off-policy'). - Add a link to the x-token distillation paper alongside the implementation guide in the 'read about the details' line. Real file paths (run_xtoken_off_policy_distillation.py, xtoken_off_policy_distillation.yaml, the guide filename) are left unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
Add a paragraph and figure on pairwise tokenizer vocabulary overlap (intersection over min vocab size) to motivate why the cross-tokenizer projection is necessary. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6b0c4a0 to
439d6dd
Compare
Add a Distillation overview table comparing the MOPD (on-policy, Megatron) and xToken (off-policy, DTensor V2) recipes across multi-teacher, async, policy, loss, tokenizer, and backend, preceding the recipe sections. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Documentation-only changes for X-Token (cross-tokenizer off-policy) distillation, split out from #2797 so the docs can be reviewed independently of the multi-teacher code:
docs/guides/xtoken-off-policy-distillation.md— guide updates: simplified per review, multi-teacher run for results + eval table, and a tokenizer-overlap motivation paragraph + figure.README.md— rename the feature to "X-Token Distillation" and add a distillation support matrix.docs/assets/— addtokenizer_overlap_matrix.pngandxtoken_mt_curves.png; drop the stalextoken_pkl_smoke_curves.png.Why a separate PR
These docs incorporate offline review feedback on the (now-merged) #2508. The same changes are also present in the multi-teacher PR #2797; this PR carries only the docs so they can land independently of the code.
🤖 Generated with Claude Code