Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-01-30T01:38:53.698Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7387544512748718},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.19280","authors":[{"_id":"697bbd09a67238fac88cbfd7","user":{"_id":"68266b5261ed4d89177c3612","avatarUrl":"/avatars/e41ef277047334174eca408aba2a63db.svg","isPro":false,"fullname":"Kishan Panaganti","user":"kishanpb","type":"user"},"name":"Kishan Panaganti","status":"claimed_verified","statusLastChangedAt":"2026-01-29T21:15:19.705Z","hidden":false},{"_id":"697bbd09a67238fac88cbfd8","user":{"_id":"62ffa3f8311cad266f9af236","avatarUrl":"/avatars/203dac40bc546ee25a01d8715a4b3049.svg","isPro":false,"fullname":"Zhenwen Liang","user":"invokerliang","type":"user"},"name":"Zhenwen Liang","status":"claimed_verified","statusLastChangedAt":"2026-02-02T17:00:23.099Z","hidden":false},{"_id":"697bbd09a67238fac88cbfd9","name":"Wenhao Yu","hidden":false},{"_id":"697bbd09a67238fac88cbfda","name":"Haitao Mi","hidden":false},{"_id":"697bbd09a67238fac88cbfdb","name":"Dong Yu","hidden":false}],"publishedAt":"2026-01-27T07:10:41.000Z","submittedOnDailyAt":"2026-01-29T17:39:08.045Z","title":"Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning","submittedOnDailyBy":{"_id":"68266b5261ed4d89177c3612","avatarUrl":"/avatars/e41ef277047334174eca408aba2a63db.svg","isPro":false,"fullname":"Kishan Panaganti","user":"kishanpb","type":"user"},"summary":"Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution.\n We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.","upvotes":9,"discussionId":"697bbd0aa67238fac88cbfdc","ai_summary":"A novel optimization framework dynamically adapts training distributions for large language models by classifying prompt difficulty and reallocating computational resources to improve reasoning performance.","ai_keywords":["Large Language Model","post-training","Reinforcement Learning","GRPO","Group Distributionally Robust Optimization","Online Difficulty Classifier","EMA-debiased multiplicative-weights bandit sampler","shadow-price controller","pass@k","variance reduction","no-regret guarantees","emergent curriculum"],"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b2f97434a92b848c7e941e","avatarUrl":"/avatars/c699c50f3b43cd1641469521127753bb.svg","isPro":false,"fullname":"Nagori","user":"MohammedNaeem","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"68266b5261ed4d89177c3612","avatarUrl":"/avatars/e41ef277047334174eca408aba2a63db.svg","isPro":false,"fullname":"Kishan Panaganti","user":"kishanpb","type":"user"},{"_id":"641129818573c51c0458b793","avatarUrl":"/avatars/d4bc67c160a07146cf41c614678aa36b.svg","isPro":false,"fullname":"Tianqing Fang","user":"tqfang229","type":"user"},{"_id":"65c4063740d617a14238f3df","avatarUrl":"/avatars/726b1470e46ad71c9ec233f3f0f396ec.svg","isPro":false,"fullname":"Zikun Li","user":"zikun-li","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"62ffa3f8311cad266f9af236","avatarUrl":"/avatars/203dac40bc546ee25a01d8715a4b3049.svg","isPro":false,"fullname":"Zhenwen Liang","user":"invokerliang","type":"user"},{"_id":"5feab3a28a3201f8e554c969","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660795228685-5feab3a28a3201f8e554c969.png","isPro":false,"fullname":"Wenhao Yu","user":"wyu1","type":"user"},{"_id":"65147a1426fbd558dbd08f1b","avatarUrl":"/avatars/86574ee2d5c22e940be1c4e50be88675.svg","isPro":false,"fullname":"Haitao Mi","user":"haitaominlp","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}}">
A novel optimization framework dynamically adapts training distributions for large language models by classifying prompt difficulty and reallocating computational resources to improve reasoning performance.
AI-generated summary
Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution.
We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.
Beyond Uniform-data LLM Reasoning: We propose an optimization-first framework, based on Group Distributionally Robust Optimization (GDRO), that moves beyond uniform reasoning models by dynamically adapting the training distribution.