Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - OVD: On-policy Verbal Distillation
[go: Go Back, main page]

https://OVD.github.io.

\n","updatedAt":"2026-02-03T05:45:43.483Z","author":{"_id":"60851545a5da133ac6c38686","avatarUrl":"/avatars/d385fcc513acef80a3b711aa92d898e5.svg","fullname":"Jing Xiong","name":"menik1126","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8844358921051025},"editors":["menik1126"],"editorAvatarUrls":["/avatars/d385fcc513acef80a3b711aa92d898e5.svg"],"reactions":[],"isReport":false}},{"id":"6982a3ccf18d8e4480ddb410","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-04T01:41:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models](https://huggingface.co/papers/2601.18734) (2026)\n* [CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation](https://huggingface.co/papers/2601.16547) (2026)\n* [CTPD: Cross Tokenizer Preference Distillation](https://huggingface.co/papers/2601.11865) (2026)\n* [Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models](https://huggingface.co/papers/2601.20687) (2026)\n* [Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization](https://huggingface.co/papers/2512.07478) (2025)\n* [Reinforcement Learning via Self-Distillation](https://huggingface.co/papers/2601.20802) (2026)\n* [ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation](https://huggingface.co/papers/2601.21912) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-04T01:41:32.576Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.72675621509552},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.21968","authors":[{"_id":"69818b62ce18b1862809639b","name":"Jing Xiong","hidden":false},{"_id":"69818b62ce18b1862809639c","name":"Hui Shen","hidden":false},{"_id":"69818b62ce18b1862809639d","name":"Shansan Gong","hidden":false},{"_id":"69818b62ce18b1862809639e","name":"Yuxin Cheng","hidden":false},{"_id":"69818b62ce18b1862809639f","name":"Jianghan Shen","hidden":false},{"_id":"69818b62ce18b186280963a0","name":"Chaofan Tao","hidden":false},{"_id":"69818b62ce18b186280963a1","name":"Haochen Tan","hidden":false},{"_id":"69818b62ce18b186280963a2","name":"Haoli Bai","hidden":false},{"_id":"69818b62ce18b186280963a3","name":"Lifeng Shang","hidden":false},{"_id":"69818b62ce18b186280963a4","name":"Ngai Wong","hidden":false}],"publishedAt":"2026-01-29T16:48:14.000Z","submittedOnDailyAt":"2026-02-03T03:15:43.476Z","title":"OVD: On-policy Verbal Distillation","submittedOnDailyBy":{"_id":"60851545a5da133ac6c38686","avatarUrl":"/avatars/d385fcc513acef80a3b711aa92d898e5.svg","isPro":false,"fullname":"Jing Xiong","user":"menik1126","type":"user"},"summary":"Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io","upvotes":2,"discussionId":"69818b63ce18b186280963a5","ai_summary":"On-policy Verbal Distillation (OVD) enables efficient knowledge transfer from teacher to student models by replacing token-level probability matching with trajectory matching using discrete verbal scores, reducing memory consumption and enabling free exploration without token alignment constraints.","ai_keywords":["knowledge distillation","token-level probability matching","trajectory matching","discrete verbal scores","on-policy distillation","memory efficiency","reinforcement learning","token-level alignment","student models","teacher models"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"60851545a5da133ac6c38686","avatarUrl":"/avatars/d385fcc513acef80a3b711aa92d898e5.svg","isPro":false,"fullname":"Jing Xiong","user":"menik1126","type":"user"},{"_id":"67ad9767d3a5cc6789882e10","avatarUrl":"/avatars/e5d16dc828670963ffe3bf2cf33318ab.svg","isPro":false,"fullname":"Kim Jiwan","user":"JiwanKim","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2601.21968

OVD: On-policy Verbal Distillation

Published on Jan 29
· Submitted by
Jing Xiong
on Feb 3
Authors:
,
,
,
,
,
,
,
,
,

Abstract

On-policy Verbal Distillation (OVD) enables efficient knowledge transfer from teacher to student models by replacing token-level probability matching with trajectory matching using discrete verbal scores, reducing memory consumption and enabling free exploration without token alignment constraints.

AI-generated summary

Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io

Community

Paper submitter

Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model’s exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0–9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.21968 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.21968 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.21968 in a Space README.md to link it from this page.

Collections including this paper 1