Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-06-12T01:38:37.021Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6890847086906433},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.06607","authors":[{"_id":"6848c79942e4f9106973f23e","user":{"_id":"630495b0ce6b12280b193c25","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630495b0ce6b12280b193c25/bT61kBtQPhDYk00AI0o0g.png","isPro":false,"fullname":"Charles Goddard","user":"chargoddard","type":"user"},"name":"Charles Goddard","status":"claimed_verified","statusLastChangedAt":"2025-06-11T08:35:12.039Z","hidden":false},{"_id":"6848c79942e4f9106973f23f","user":{"_id":"646e57a5cb6ea6e6b6df1ad4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646e57a5cb6ea6e6b6df1ad4/PlGhM2SUynFBUdYAylaZK.jpeg","isPro":true,"fullname":"Fernando Fernandes Neto","user":"fernandofernandes","type":"user"},"name":"Fernando Fernandes Neto","status":"claimed_verified","statusLastChangedAt":"2025-12-02T16:51:27.115Z","hidden":false}],"publishedAt":"2025-06-07T00:51:27.000Z","submittedOnDailyAt":"2025-06-10T22:36:07.861Z","title":"Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit","submittedOnDailyBy":{"_id":"630495b0ce6b12280b193c25","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630495b0ce6b12280b193c25/bT61kBtQPhDYk00AI0o0g.png","isPro":false,"fullname":"Charles Goddard","user":"chargoddard","type":"user"},"summary":"We present a training-free method to transplant tokenizers in pretrained\nlarge language models (LLMs) by reconstructing unseen token embeddings via\nOrthogonal Matching Pursuit (OMP). Specifically, we approximate each\nout-of-vocabulary token as a sparse linear combination of shared tokens, in two\nphases: first, compute each new token's representation in the donor embedding\nspace with a small dictionary of shared anchor tokens, then transfer these same\nsparse coefficients back into the base model's embedding space.\n On two challenging cross-tokenizer tasks--LlamatoMistral NeMo (12B) and\nQwentoLlama (1B)--we show that OMP achieves best zero-shot preservation of\nthe base model's performance across multiple benchmarks, while other zero-shot\napproaches degrade significantly. Compared to baselines (zero-init, mean-init,\nand existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves\nthe best overall performance, effectively bridging large tokenizer\ndiscrepancies without gradient updates. Our analysis further identifies\nmismatched numerical tokenization schemes as a critical challenge for\npreserving mathematical reasoning capabilities. This technique enables direct\nreuse of pretrained model weights with new tokenizers, facilitating\ncross-tokenizer knowledge distillation, speculative decoding, ensembling,\nmerging, and domain-specific vocabulary adaptations. We integrate our method\ninto the open-source mergekit-tokensurgeon tool for post hoc vocabulary\nrealignment.","upvotes":2,"discussionId":"6848c79a42e4f9106973f240","ai_summary":"A training-free method using Orthogonal Matching Pursuit (OMP) effectively transplants tokenizers in pretrained large language models, preserving performance across different tokenizers without gradient updates.","ai_keywords":["tokenizers","pretrained large language models (LLMs)","Orthogonal Matching Pursuit (OMP)","token embeddings","sparse linear combination","anchor tokens","cross-tokenizer tasks","zero-shot performance","baseline methods","WECHSEL","FOCUS","ZETT","numerical tokenization schemes","knowledge distillation","speculative decoding","ensembling","merging","domain-specific vocabulary adaptations","mergekit-tokensurgeon"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630495b0ce6b12280b193c25","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630495b0ce6b12280b193c25/bT61kBtQPhDYk00AI0o0g.png","isPro":false,"fullname":"Charles Goddard","user":"chargoddard","type":"user"},{"_id":"5fad8602b8423e1d80b8a965","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5fad8602b8423e1d80b8a965/tRqTwcZmrGka8c1vFq2wX.jpeg","isPro":false,"fullname":"Victor Gallego","user":"vicgalle","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2506.06607

Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

Published on Jun 7, 2025
· Submitted by
Charles Goddard
on Jun 10, 2025

Abstract

A training-free method using Orthogonal Matching Pursuit (OMP) effectively transplants tokenizers in pretrained large language models, preserving performance across different tokenizers without gradient updates.

AI-generated summary

We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token's representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model's embedding space. On two challenging cross-tokenizer tasks--LlamatoMistral NeMo (12B) and QwentoLlama (1B)--we show that OMP achieves best zero-shot preservation of the base model's performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment.

Community

Paper author Paper submitter

A training-free method to transplant tokenizers between pretrained language models.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.06607 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.06607 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.06607 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.