Skip to content

docs(xtoken): X-Token distillation guide and README updates#2854

Open
avenkateshha wants to merge 4 commits into
mainfrom
avenkateshha/xtoken-docs
Open

docs(xtoken): X-Token distillation guide and README updates#2854
avenkateshha wants to merge 4 commits into
mainfrom
avenkateshha/xtoken-docs

Conversation

@avenkateshha

Copy link
Copy Markdown
Contributor

What

Documentation-only changes for X-Token (cross-tokenizer off-policy) distillation, split out from #2797 so the docs can be reviewed independently of the multi-teacher code:

  • docs/guides/xtoken-off-policy-distillation.md — guide updates: simplified per review, multi-teacher run for results + eval table, and a tokenizer-overlap motivation paragraph + figure.
  • README.md — rename the feature to "X-Token Distillation" and add a distillation support matrix.
  • docs/assets/ — add tokenizer_overlap_matrix.png and xtoken_mt_curves.png; drop the stale xtoken_pkl_smoke_curves.png.

Why a separate PR

These docs incorporate offline review feedback on the (now-merged) #2508. The same changes are also present in the multi-teacher PR #2797; this PR carries only the docs so they can land independently of the code.

🤖 Generated with Claude Code

@avenkateshha avenkateshha requested review from a team as code owners June 16, 2026 21:50
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Jun 16, 2026
avenkateshha and others added 3 commits June 16, 2026 15:04
- Shorten the H1 to 'Cross-Tokenizer (X-Token)'.
- Trim the future-work note to a generic 'actively improving support'.
- Reframe Step 2 around tokenizer overlap (similar algorithms such as BPE).
- Describe Step 3's sparse [V_student, top_k] projection representation and
  why it avoids a dense [student_vocab, teacher_vocab] matrix.
- Remove the --preserve_last paragraph; the recommended recipe disables the
  scale trick, so it never engages.
- Drop the 'via CUDA IPC' qualifier from the P-KL loss-mode row.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
- Drop 'Off-Policy' from the X-Token section/subsection headings and the
  feature-list entry; update the matching table-of-contents anchors.
- Reword the section's first line to 'distillation' (no 'off-policy').
- Add a link to the x-token distillation paper alongside the implementation
  guide in the 'read about the details' line.

Real file paths (run_xtoken_off_policy_distillation.py,
xtoken_off_policy_distillation.yaml, the guide filename) are left unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
Add a paragraph and figure on pairwise tokenizer vocabulary overlap
(intersection over min vocab size) to motivate why the cross-tokenizer
projection is necessary.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@avenkateshha avenkateshha force-pushed the avenkateshha/xtoken-docs branch from 6b0c4a0 to 439d6dd Compare June 16, 2026 22:05
Add a Distillation overview table comparing the MOPD (on-policy,
Megatron) and xToken (off-policy, DTensor V2) recipes across
multi-teacher, async, policy, loss, tokenizer, and backend, preceding
the recipe sections.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant