Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - The Art of Scaling Reinforcement Learning Compute for LLMs
@lewtun\n\t\n","updatedAt":"2025-10-16T12:21:29.732Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9562104940414429},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[{"reaction":"😎","users":["lewtun","rishabh2k1","devvrit"],"count":3}],"isReport":false}},{"id":"68f19dfc4ec2f8966ba3bb30","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-10-17T01:38:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning](https://huggingface.co/papers/2509.25300) (2025)\n* [Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training](https://huggingface.co/papers/2510.04996) (2025)\n* [BroRL: Scaling Reinforcement Learning via Broadened Exploration](https://huggingface.co/papers/2510.01180) (2025)\n* [Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning](https://huggingface.co/papers/2510.10959) (2025)\n* [Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks](https://huggingface.co/papers/2508.18672) (2025)\n* [Prompt Curriculum Learning for Efficient LLM Post-Training](https://huggingface.co/papers/2510.01135) (2025)\n* [QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs](https://huggingface.co/papers/2510.11696) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-10-17T01:38:04.345Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7303285002708435},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2510.13786","authors":[{"_id":"68f0e30636f8b025381e1b7e","user":{"_id":"65c320631bcc84e918fd4294","avatarUrl":"/avatars/a7be2c609662a4ce054163fe6e8ce4ca.svg","isPro":false,"fullname":"Devvrit","user":"devvrit","type":"user"},"name":"Devvrit Khatri","status":"claimed_verified","statusLastChangedAt":"2025-10-21T14:06:14.269Z","hidden":false},{"_id":"68f0e30636f8b025381e1b7f","name":"Lovish Madaan","hidden":false},{"_id":"68f0e30636f8b025381e1b80","user":{"_id":"66db38c38d2688295f731283","avatarUrl":"/avatars/a1f832d354a1f5d5c11593bf276b47a6.svg","isPro":false,"fullname":"Rishabh Tiwari","user":"rishabh2k1","type":"user"},"name":"Rishabh Tiwari","status":"claimed_verified","statusLastChangedAt":"2025-10-17T04:12:46.664Z","hidden":false},{"_id":"68f0e30636f8b025381e1b81","user":{"_id":"5fc9113f1a91b8cacef77502","avatarUrl":"/avatars/fa57426420ca3b874117f9424abd0066.svg","isPro":false,"fullname":"Rachit Bansal","user":"RacBan","type":"user"},"name":"Rachit Bansal","status":"claimed_verified","statusLastChangedAt":"2025-10-17T04:12:49.042Z","hidden":false},{"_id":"68f0e30636f8b025381e1b82","name":"Sai Surya Duvvuri","hidden":false},{"_id":"68f0e30636f8b025381e1b83","name":"Manzil Zaheer","hidden":false},{"_id":"68f0e30636f8b025381e1b84","name":"Inderjit S. Dhillon","hidden":false},{"_id":"68f0e30636f8b025381e1b85","name":"David Brandfonbrener","hidden":false},{"_id":"68f0e30636f8b025381e1b86","name":"Rishabh Agarwal","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/5f1158120c833276f61f1a84/1GwLDsIbjzAmKdUuh1soU.png"],"publishedAt":"2025-10-15T17:43:03.000Z","submittedOnDailyAt":"2025-10-16T10:51:29.724Z","title":"The Art of Scaling Reinforcement Learning Compute for LLMs","submittedOnDailyBy":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","isPro":false,"fullname":"Niels Rogge","user":"nielsr","type":"user"},"summary":"Reinforcement learning (RL) has become central to training large language\nmodels (LLMs), yet the field lacks predictive scaling methodologies comparable\nto those established for pre-training. Despite rapidly rising compute budgets,\nthere is no principled understanding of how to evaluate algorithmic\nimprovements for scaling RL compute. We present the first large-scale\nsystematic study, amounting to more than 400,000 GPU-hours, that defines a\nprincipled framework for analyzing and predicting RL scaling in LLMs. We fit\nsigmoidal compute-performance curves for RL training and ablate a wide range of\ncommon design choices to analyze their effects on asymptotic performance and\ncompute efficiency. We observe: (1) Not all recipes yield similar asymptotic\nperformance, (2) Details such as loss aggregation, normalization, curriculum,\nand off-policy algorithm primarily modulate compute efficiency without\nmaterially shifting the asymptote, and (3) Stable, scalable recipes follow\npredictable scaling trajectories, enabling extrapolation from smaller-scale\nruns. Combining these insights, we propose a best-practice recipe, ScaleRL, and\ndemonstrate its effectiveness by successfully scaling and predicting validation\nperformance on a single RL run scaled up to 100,000 GPU-hours. Our work\nprovides both a scientific framework for analyzing scaling in RL and a\npractical recipe that brings RL training closer to the predictability long\nachieved in pre-training.","upvotes":32,"discussionId":"68f0e30736f8b025381e1b87","ai_summary":"A systematic study defines a framework for analyzing and predicting reinforcement learning scaling in large language models, identifying key design choices that affect compute efficiency and proposing a best-practice recipe.","ai_keywords":["reinforcement learning","large language models","sigmoidal compute-performance curves","loss aggregation","normalization","curriculum","off-policy algorithm","asymptotic performance","compute efficiency","scaling trajectories","ScaleRL"],"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1592839207516-noauth.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5f0c746619cb630495b814fd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594651707950-noauth.jpeg","isPro":true,"fullname":"Lewis Tunstall","user":"lewtun","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"65cbfa6c968742be942e6cba","avatarUrl":"/avatars/1a6cc0983edc28fa92178d3abc283ba1.svg","isPro":false,"fullname":"Feng","user":"Yunzhen","type":"user"},{"_id":"6527151d5606f146974d60d8","avatarUrl":"/avatars/00ac8ab005ceadae866dea5471f6aab9.svg","isPro":false,"fullname":"Nilesh Gupta","user":"quicktensor","type":"user"},{"_id":"65ea6558b985ae912c57b294","avatarUrl":"/avatars/54eeba9ce8cb5bf35edfa8fad76a813d.svg","isPro":false,"fullname":"Denis Akhiyarov","user":"dtanow","type":"user"},{"_id":"6670e31a459aa2d3b96037f1","avatarUrl":"/avatars/7b957227f72c23ac99721100d98ec2a7.svg","isPro":false,"fullname":"Rafael Coelho de Souza Krzonkalla","user":"krzonkalla","type":"user"},{"_id":"62ffa3f8311cad266f9af236","avatarUrl":"/avatars/203dac40bc546ee25a01d8715a4b3049.svg","isPro":false,"fullname":"Zhenwen Liang","user":"invokerliang","type":"user"},{"_id":"64903f017b630c141867877f","avatarUrl":"/avatars/3c7b248baed446ecc2b6adc2c444320d.svg","isPro":false,"fullname":"Umberto Cappellazzo","user":"hisoka94","type":"user"},{"_id":"6752f04c4d9c0482807e05ab","avatarUrl":"/avatars/9d66693a5540d8b07bd66162de2f369f.svg","isPro":true,"fullname":"Bela Stoyan","user":"0xbe7a","type":"user"},{"_id":"65c320631bcc84e918fd4294","avatarUrl":"/avatars/a7be2c609662a4ce054163fe6e8ce4ca.svg","isPro":false,"fullname":"Devvrit","user":"devvrit","type":"user"},{"_id":"63c5c3d76d132b995ff01948","avatarUrl":"/avatars/3253f04e00b13ec280490a43a274b4de.svg","isPro":false,"fullname":"Sai Surya Duvvuri","user":"dvsaisurya","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1592839207516-noauth.png"}}">
A systematic study defines a framework for analyzing and predicting reinforcement learning scaling in large language models, identifying key design choices that affect compute efficiency and proposing a best-practice recipe.
AI-generated summary
Reinforcement learning (RL) has become central to training large language
models (LLMs), yet the field lacks predictive scaling methodologies comparable
to those established for pre-training. Despite rapidly rising compute budgets,
there is no principled understanding of how to evaluate algorithmic
improvements for scaling RL compute. We present the first large-scale
systematic study, amounting to more than 400,000 GPU-hours, that defines a
principled framework for analyzing and predicting RL scaling in LLMs. We fit
sigmoidal compute-performance curves for RL training and ablate a wide range of
common design choices to analyze their effects on asymptotic performance and
compute efficiency. We observe: (1) Not all recipes yield similar asymptotic
performance, (2) Details such as loss aggregation, normalization, curriculum,
and off-policy algorithm primarily modulate compute efficiency without
materially shifting the asymptote, and (3) Stable, scalable recipes follow
predictable scaling trajectories, enabling extrapolation from smaller-scale
runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and
demonstrate its effectiveness by successfully scaling and predicting validation
performance on a single RL run scaled up to 100,000 GPU-hours. Our work
provides both a scientific framework for analyzing scaling in RL and a
practical recipe that brings RL training closer to the predictability long
achieved in pre-training.