Librarian Bot. I found the following papers similar to this paper. \n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-12T01:43:55.637Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7342989444732666},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.08847","authors":[{"_id":"698c133d6052d3bed9630bf2","user":{"_id":"66ba29dd59e8e7a957154c5f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ba29dd59e8e7a957154c5f/VvVS7IZNPUIB023GAEf5u.png","isPro":false,"fullname":"Lang Feng","user":"langfeng01","type":"user"},"name":"Lang Feng","status":"claimed_verified","statusLastChangedAt":"2026-02-11T11:13:38.406Z","hidden":false},{"_id":"698c133d6052d3bed9630bf3","user":{"_id":"63db5dc49f2687298a1547bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg","isPro":false,"fullname":"Longtao Zheng","user":"ltzheng","type":"user"},"name":"Longtao Zheng","status":"claimed_verified","statusLastChangedAt":"2026-02-11T11:13:41.494Z","hidden":false},{"_id":"698c133d6052d3bed9630bf4","user":{"_id":"641d6099f9a3a9c532bd3954","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641d6099f9a3a9c532bd3954/NBrTueSRKkxhdLvCdZ0lt.jpeg","isPro":false,"fullname":"Shuo He","user":"heshuo","type":"user"},"name":"Shuo He","status":"claimed_verified","statusLastChangedAt":"2026-02-12T13:57:44.843Z","hidden":false},{"_id":"698c133d6052d3bed9630bf5","user":{"_id":"64054d8a3d49e1e066bfa32b","avatarUrl":"/avatars/9044f937145cc5aa4bc3a5ffa751f724.svg","isPro":false,"fullname":"Fuxiang Zhang","user":"sicer","type":"user"},"name":"Fuxiang Zhang","status":"claimed_verified","statusLastChangedAt":"2026-02-11T11:13:54.465Z","hidden":false},{"_id":"698c133d6052d3bed9630bf6","name":"Bo An","hidden":false}],"publishedAt":"2026-02-09T16:13:39.000Z","submittedOnDailyAt":"2026-02-11T03:10:03.516Z","title":"Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems","submittedOnDailyBy":{"_id":"66ba29dd59e8e7a957154c5f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ba29dd59e8e7a957154c5f/VvVS7IZNPUIB023GAEf5u.png","isPro":false,"fullname":"Lang Feng","user":"langfeng01","type":"user"},"summary":"Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\\% avg@16 and +4.6\\% pass@16 on math, and +15.2\\% avg@16 and +13.1\\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.","upvotes":24,"discussionId":"698c133d6052d3bed9630bf7","githubRepo":"https://github.com/langfengQ/DrMAS","githubRepoAddedBy":"user","ai_summary":"Multi-agent large language model systems face training instability in reinforcement learning due to global normalization mismatches, which is addressed by Dr. MAS through agent-specific advantage normalization and enhanced training stability.","ai_keywords":["multi-agent LLM systems","reinforcement learning","GRPO-style optimization","gradient-norm instability","agent-wise remedy","advantage normalization","Qwen2.5","Qwen3 series"],"githubStars":78,"organization":{"_id":"6508b28cf36bb51c50faad98","name":"NanyangTechnologicalUniversity","fullname":"Nanyang Technological University","avatar":"https://cdn-uploads.huggingface.co/production/uploads/630ca0817dacb93b33506ce7/ZPD1fvei0bcIGeDXxeSkn.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66ba29dd59e8e7a957154c5f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66ba29dd59e8e7a957154c5f/VvVS7IZNPUIB023GAEf5u.png","isPro":false,"fullname":"Lang Feng","user":"langfeng01","type":"user"},{"_id":"64054d8a3d49e1e066bfa32b","avatarUrl":"/avatars/9044f937145cc5aa4bc3a5ffa751f724.svg","isPro":false,"fullname":"Fuxiang Zhang","user":"sicer","type":"user"},{"_id":"6728558fc5e9db9a8c054fff","avatarUrl":"/avatars/83635ae0fbe52331a4560b1350f69c7e.svg","isPro":false,"fullname":"Jack Chen","user":"Chen-Jack","type":"user"},{"_id":"652f7bf41ad13fee8c407247","avatarUrl":"/avatars/5c7a74a9edf748025bffeeba97a61505.svg","isPro":false,"fullname":"Shumin","user":"Mystery","type":"user"},{"_id":"63db5dc49f2687298a1547bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db5dc49f2687298a1547bf/xVFi0kRkYud191cQgma16.jpeg","isPro":false,"fullname":"Longtao Zheng","user":"ltzheng","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6458af46f4d212d780bd7c68","avatarUrl":"/avatars/832fd34bcc041b0b7b551873a459fc3c.svg","isPro":false,"fullname":"Wei Liu","user":"PeterV09","type":"user"},{"_id":"6462def82a83863b97c0611e","avatarUrl":"/avatars/c03e9cc7d75b0266fcc56ecb6ee62148.svg","isPro":false,"fullname":"Yuzhen Huang","user":"yuzhen17","type":"user"},{"_id":"664395621b88258a527cd7d1","avatarUrl":"/avatars/8489ccebe4fd1262679ba63a5cb50bb8.svg","isPro":false,"fullname":"Kira","user":"Kira-wang","type":"user"},{"_id":"641d6099f9a3a9c532bd3954","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641d6099f9a3a9c532bd3954/NBrTueSRKkxhdLvCdZ0lt.jpeg","isPro":false,"fullname":"Shuo He","user":"heshuo","type":"user"},{"_id":"66349404f2c753240d02952a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66349404f2c753240d02952a/xKBKicwyk7BoOITQPwBJn.png","isPro":true,"fullname":"ZhuofengLi","user":"ZhuofengLi","type":"user"},{"_id":"624ac233c04d55ec0f42b11e","avatarUrl":"/avatars/58a9abce945e71a65abc8a54085de6d7.svg","isPro":false,"fullname":"oh sehun","user":"sehun","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"6508b28cf36bb51c50faad98","name":"NanyangTechnologicalUniversity","fullname":"Nanyang Technological University","avatar":"https://cdn-uploads.huggingface.co/production/uploads/630ca0817dacb93b33506ce7/ZPD1fvei0bcIGeDXxeSkn.png"}}">
Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems
Abstract
Multi-agent large language model systems face training instability in reinforcement learning due to global normalization mismatches, which is addressed by Dr. MAS through agent-specific advantage normalization and enhanced training stability.
Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.