Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
\n","updatedAt":"2026-02-06T09:30:30.919Z","author":{"_id":"60c82ac0dfbd57a384a01127","avatarUrl":"/avatars/5422e1be5f9eb06ac396fcd2430641c5.svg","fullname":"Ji Seunghyun","name":"sorryhyun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6025627851486206},"editors":["sorryhyun"],"editorAvatarUrls":["/avatars/5422e1be5f9eb06ac396fcd2430641c5.svg"],"reactions":[],"isReport":false}},{"id":"698657c1c5a3101833444c1a","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-02-06T21:06:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/length-unbiased-sequence-policy-optimization-revealing-and-controlling-response-length-variation-in-rlvr-6117-71c4edfe\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"
\n","updatedAt":"2026-02-06T21:06:09.272Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7309274673461914},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"6986986adfd17b2d3dacaa4a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-07T01:42:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning](https://huggingface.co/papers/2512.15274) (2025)\n* [Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR](https://huggingface.co/papers/2601.05607) (2026)\n* [Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards](https://huggingface.co/papers/2512.21625) (2025)\n* [DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning](https://huggingface.co/papers/2602.00983) (2026)\n* [A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization](https://huggingface.co/papers/2601.22718) (2026)\n* [Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning](https://huggingface.co/papers/2602.03190) (2026)\n* [Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts](https://huggingface.co/papers/2601.10079) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-07T01:42:02.519Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7457563281059265},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.05261","authors":[{"_id":"69855cbe4ad556f294b7eb29","user":{"_id":"6689f7a1683da3ea29b4cee5","avatarUrl":"/avatars/988641cde34a60d183208fd9a2a72392.svg","isPro":false,"fullname":"Fanfan Liu","user":"liufanfanlff","type":"user"},"name":"Fanfan Liu","status":"claimed_verified","statusLastChangedAt":"2026-02-06T18:51:59.637Z","hidden":false},{"_id":"69855cbe4ad556f294b7eb2a","name":"Youyang Yin","hidden":false},{"_id":"69855cbe4ad556f294b7eb2b","name":"Peng Shi","hidden":false},{"_id":"69855cbe4ad556f294b7eb2c","name":"Siqi Yang","hidden":false},{"_id":"69855cbe4ad556f294b7eb2d","name":"Zhixiong Zeng","hidden":false},{"_id":"69855cbe4ad556f294b7eb2e","name":"Haibo Qiu","hidden":false}],"publishedAt":"2026-02-05T03:35:38.000Z","submittedOnDailyAt":"2026-02-06T00:52:48.391Z","title":"Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR","submittedOnDailyBy":{"_id":"6689f7a1683da3ea29b4cee5","avatarUrl":"/avatars/988641cde34a60d183208fd9a2a72392.svg","isPro":false,"fullname":"Fanfan Liu","user":"liufanfanlff","type":"user"},"summary":"Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.","upvotes":49,"discussionId":"69855cbe4ad556f294b7eb2f","githubRepo":"https://github.com/murphy4122/LUSPO","githubRepoAddedBy":"user","ai_summary":"Research analyzes RLVR algorithms' impact on response length in LLMs and VLMs, proposing LUSPO to eliminate length bias and improve reasoning performance.","ai_keywords":["Reinforcement Learning with Verifiable Rewards","LLMs","Vision-Language Models","response length","sequence policy optimization","Group Sequence Policy Optimization","Length-Unbiased Sequence Policy Optimization","mathematical reasoning","multimodal reasoning"],"githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6689f7a1683da3ea29b4cee5","avatarUrl":"/avatars/988641cde34a60d183208fd9a2a72392.svg","isPro":false,"fullname":"Fanfan Liu","user":"liufanfanlff","type":"user"},{"_id":"64fc2679d75293f417f7a254","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/9lcCFu4ec2Xa3Uo3QRH_x.jpeg","isPro":false,"fullname":"Haibo Qiu","user":"haiboqiu","type":"user"},{"_id":"646f3418a6a58aa29505fd30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646f3418a6a58aa29505fd30/1z13rnpb6rsUgQsYumWPg.png","isPro":false,"fullname":"QINGHE WANG","user":"Qinghew","type":"user"},{"_id":"693f91d7ed7d40c019934508","avatarUrl":"/avatars/0d73f098627c9ebd2ae7d90e693a34f6.svg","isPro":false,"fullname":"Yufeng Zhong","user":"Albert-Zhong","type":"user"},{"_id":"695fa5ed209d45ebd9fbdb20","avatarUrl":"/avatars/d0bdf267da10535a207122608e8e921d.svg","isPro":false,"fullname":"UITron-hub","user":"UITron-hub","type":"user"},{"_id":"6881d354d7bd4dc474194654","avatarUrl":"/avatars/6079a4092f1be03bba2ffbf9ff566099.svg","isPro":false,"fullname":"DocTron","user":"DocTron","type":"user"},{"_id":"6593c00a674349122cac04d8","avatarUrl":"/avatars/4c1ea7a8e5f95d06a961e11cf98a43d5.svg","isPro":false,"fullname":"zlm","user":"zlm898","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"697c61b179d1cb6bcb88fe26","avatarUrl":"/avatars/49e3db4c68ebfcdc7798d7758e7d0ae6.svg","isPro":false,"fullname":"Pengyu He","user":"pengyu11","type":"user"},{"_id":"6436618aeef1f55654a9f458","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6436618aeef1f55654a9f458/OvxGtuDg2GAFG9As-2hzW.jpeg","isPro":false,"fullname":"Haoran Wei","user":"HaoranWei","type":"user"},{"_id":"648e77184cae4f6921dbb382","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648e77184cae4f6921dbb382/zAAJRvOStC0wZplqVWrk_.jpeg","isPro":false,"fullname":"Xue Yang","user":"yangxue","type":"user"},{"_id":"65813bd3035c028f3340a12b","avatarUrl":"/avatars/488f401f1abcad4ac5ea3c18205c885c.svg","isPro":false,"fullname":"siqi yang","user":"siqiya","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Research analyzes RLVR algorithms' impact on response length in LLMs and VLMs, proposing LUSPO to eliminate length bias and improve reasoning performance.
AI-generated summary
Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
We introduce Length-Unbiased Sequence Policy Optimization (LUSPO), a novel reinforcement learning algorithm for training large language models. LUSPO consistently outperforms GRPO and GSPO on both dense small-scale models and large-scale MoE models. github: https://github.com/murphy4122/LUSPO