Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Training-Free Group Relative Policy Optimization
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-10-11T01:38:11.512Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7371744513511658},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"6993f2379ce0e2fe49600528","author":{"_id":"6471d2075ffbc18f197a1e16","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6471d2075ffbc18f197a1e16/t1NA_KMUkYHKokuzfijLy.jpeg","fullname":"ScottzModelz","name":"ScottzModelz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-17T04:44:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"== Inference providers can detect prompt engineering when the same Context is used over and over (KV Cache does it) and process your context versions for the final best one, and learn what is the best continued pre-training data to extract and add to next continued training round!\n== Everyone using this method is finding the missing knowledge expertise models (and Inference Providers) need for the next round of continued pre-training\n","html":"
== Inference providers can detect prompt engineering when the same Context is used over and over (KV Cache does it) and process your context versions for the final best one, and learn what is the best continued pre-training data to extract and add to next continued training round! == Everyone using this method is finding the missing knowledge expertise models (and Inference Providers) need for the next round of continued pre-training
\n","updatedAt":"2026-02-17T04:44:39.443Z","author":{"_id":"6471d2075ffbc18f197a1e16","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6471d2075ffbc18f197a1e16/t1NA_KMUkYHKokuzfijLy.jpeg","fullname":"ScottzModelz","name":"ScottzModelz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9338377118110657},"editors":["ScottzModelz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6471d2075ffbc18f197a1e16/t1NA_KMUkYHKokuzfijLy.jpeg"],"reactions":[{"reaction":"๐คฏ","users":["ScottzModelz"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.08191","authors":[{"_id":"68e8725295e8e6771df389aa","user":{"_id":"66e258bdc70c02b46dfed6e3","avatarUrl":"/avatars/ccc2d604616c018f45a268a610472cac.svg","isPro":false,"fullname":"Yuzheng Cai","user":"Ucreate","type":"user"},"name":"Yuzheng Cai","status":"claimed_verified","statusLastChangedAt":"2025-10-10T08:58:05.142Z","hidden":false},{"_id":"68e8725295e8e6771df389ab","name":"Siqi Cai","hidden":false},{"_id":"68e8725295e8e6771df389ac","user":{"_id":"622b00a776c20fee5d14501b","avatarUrl":"/avatars/e00496dda1e309548e7b5b437839bb65.svg","isPro":false,"fullname":"Eason shi","user":"Easonshi","type":"user"},"name":"Yuchen Shi","status":"claimed_verified","statusLastChangedAt":"2026-01-04T20:32:40.035Z","hidden":false},{"_id":"68e8725295e8e6771df389ad","name":"Zihan Xu","hidden":false},{"_id":"68e8725295e8e6771df389ae","name":"Lichao Chen","hidden":false},{"_id":"68e8725295e8e6771df389af","user":{"_id":"6390525c00fb8ec4a424e0ff","avatarUrl":"/avatars/4302571e2ef4a9875491221aa630a329.svg","isPro":false,"fullname":"Yulei Qin","user":"yolay","type":"user"},"name":"Yulei Qin","status":"claimed_verified","statusLastChangedAt":"2025-10-10T04:32:09.755Z","hidden":false},{"_id":"68e8725295e8e6771df389b0","user":{"_id":"637af0a7bdf7309aa6da1c36","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637af0a7bdf7309aa6da1c36/NHZ-09otVCfbpXVxm8f-e.png","isPro":false,"fullname":"Xiaoyu Tan","user":"WIlliam1900","type":"user"},"name":"Xiaoyu Tan","status":"claimed_verified","statusLastChangedAt":"2026-01-05T08:28:13.746Z","hidden":false},{"_id":"68e8725295e8e6771df389b1","name":"Gang Li","hidden":false},{"_id":"68e8725295e8e6771df389b2","name":"Zongyi Li","hidden":false},{"_id":"68e8725295e8e6771df389b3","name":"Haojia Lin","hidden":false},{"_id":"68e8725295e8e6771df389b4","name":"Yong Mao","hidden":false},{"_id":"68e8725295e8e6771df389b5","user":{"_id":"63280915eeee4dd858083092","avatarUrl":"/avatars/78347af4af42527d53e88d9969c5c934.svg","isPro":false,"fullname":"Ke Li","user":"tristanli","type":"user"},"name":"Ke Li","status":"claimed_verified","statusLastChangedAt":"2026-01-06T10:05:42.273Z","hidden":false},{"_id":"68e8725295e8e6771df389b6","user":{"_id":"647401e50da364bd0d002f2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/vPuPn7EV092mLBOM2YZXd.png","isPro":false,"fullname":"XING SUN","user":"tedsun","type":"user"},"name":"Xing Sun","status":"claimed_verified","statusLastChangedAt":"2025-10-11T13:48:07.194Z","hidden":false}],"publishedAt":"2025-10-09T13:18:17.000Z","submittedOnDailyAt":"2025-10-10T01:11:31.780Z","title":"Training-Free Group Relative Policy Optimization","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"Recent advances in Large Language Model (LLM) agents have demonstrated their\npromising general capabilities. However, their performance in specialized\nreal-world domains often degrades due to challenges in effectively integrating\nexternal tools and specific prompting strategies. While methods like agentic\nreinforcement learning have been proposed to address this, they typically rely\non costly parameter updates, for example, through a process that uses\nSupervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase\nwith Group Relative Policy Optimization (GRPO) to alter the output\ndistribution. However, we argue that LLMs can achieve a similar effect on the\noutput distribution by learning experiential knowledge as a token prior, which\nis a far more lightweight approach that not only addresses practical data\nscarcity but also avoids the common issue of overfitting. To this end, we\npropose Training-Free Group Relative Policy Optimization (Training-Free GRPO),\na cost-effective solution that enhances LLM agent performance without any\nparameter updates. Our method leverages the group relative semantic advantage\ninstead of numerical ones within each group of rollouts, iteratively distilling\nhigh-quality experiential knowledge during multi-epoch learning on a minimal\nground-truth data. Such knowledge serves as the learned token prior, which is\nseamlessly integrated during LLM API calls to guide model behavior. Experiments\non mathematical reasoning and web searching tasks demonstrate that\nTraining-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly\nimproves out-of-domain performance. With just a few dozen training samples,\nTraining-Free GRPO outperforms fine-tuned small LLMs with marginal training\ndata and cost.","upvotes":45,"discussionId":"68e8725395e8e6771df389b7","ai_summary":"Training-Free GRPO enhances LLM agent performance in specialized domains by learning experiential knowledge as a token prior without parameter updates, improving out-of-domain tasks with minimal data.","ai_keywords":["Large Language Model (LLM)","agentic reinforcement learning","Supervised Fine-Tuning (SFT)","Reinforcement Learning (RL)","Group Relative Policy Optimization (GRPO)","token prior","Training-Free GRPO","group relative semantic advantage","multi-epoch learning","minimal ground-truth data","DeepSeek-V3.1-Terminus","mathematical reasoning","web searching tasks"],"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"638ecea26251c8bd7abc85e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638ecea26251c8bd7abc85e2/a-i9jvRyPINmgFNqZkZAv.png","isPro":true,"fullname":"Peiyong Wang","user":"Addwater","type":"user"},{"_id":"644a1dbb9c340e5e1e713153","avatarUrl":"/avatars/21cb93ad067a798a39829ef7e67c70b8.svg","isPro":false,"fullname":"JGC","user":"Nothing2Say","type":"user"},{"_id":"63280915eeee4dd858083092","avatarUrl":"/avatars/78347af4af42527d53e88d9969c5c934.svg","isPro":false,"fullname":"Ke Li","user":"tristanli","type":"user"},{"_id":"66e258bdc70c02b46dfed6e3","avatarUrl":"/avatars/ccc2d604616c018f45a268a610472cac.svg","isPro":false,"fullname":"Yuzheng Cai","user":"Ucreate","type":"user"},{"_id":"6685036c8e62fe0770abd272","avatarUrl":"/avatars/dc81c27b80b51a6e0bafee165d6873bd.svg","isPro":false,"fullname":"siqiicai","user":"sepichoii","type":"user"},{"_id":"682c28f1015fffeef9c3c88f","avatarUrl":"/avatars/8aab271e2a57df7e4de52e39f0aa9ae5.svg","isPro":false,"fullname":"yongmao","user":"yyong119","type":"user"},{"_id":"66b732099f87c6d4b6a3c9f5","avatarUrl":"/avatars/623acf0dad42690b963545ea05b81a0b.svg","isPro":false,"fullname":"H","user":"SunSwallow","type":"user"},{"_id":"68d693feccf464a96ac8242a","avatarUrl":"/avatars/402e13a0ee882c7c3755f972ec388e0a.svg","isPro":false,"fullname":"SII-XuanZhang","user":"SII-XuanZhang","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}}">
Training-Free GRPO enhances LLM agent performance in specialized domains by learning experiential knowledge as a token prior without parameter updates, improving out-of-domain tasks with minimal data.
AI-generated summary
Recent advances in Large Language Model (LLM) agents have demonstrated their
promising general capabilities. However, their performance in specialized
real-world domains often degrades due to challenges in effectively integrating
external tools and specific prompting strategies. While methods like agentic
reinforcement learning have been proposed to address this, they typically rely
on costly parameter updates, for example, through a process that uses
Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase
with Group Relative Policy Optimization (GRPO) to alter the output
distribution. However, we argue that LLMs can achieve a similar effect on the
output distribution by learning experiential knowledge as a token prior, which
is a far more lightweight approach that not only addresses practical data
scarcity but also avoids the common issue of overfitting. To this end, we
propose Training-Free Group Relative Policy Optimization (Training-Free GRPO),
a cost-effective solution that enhances LLM agent performance without any
parameter updates. Our method leverages the group relative semantic advantage
instead of numerical ones within each group of rollouts, iteratively distilling
high-quality experiential knowledge during multi-epoch learning on a minimal
ground-truth data. Such knowledge serves as the learned token prior, which is
seamlessly integrated during LLM API calls to guide model behavior. Experiments
on mathematical reasoning and web searching tasks demonstrate that
Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly
improves out-of-domain performance. With just a few dozen training samples,
Training-Free GRPO outperforms fine-tuned small LLMs with marginal training
data and cost.
Proposes Training-Free GRPO to boost LLM agent performance without parameter updates by distilling experiential knowledge as a token prior during limited-data, multi-epoch learning.
== Inference providers can detect prompt engineering when the same Context is used over and over (KV Cache does it) and process your context versions for the final best one, and learn what is the best continued pre-training data to extract and add to next continued training round! == Everyone using this method is finding the missing knowledge expertise models (and Inference Providers) need for the next round of continued pre-training