Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary
[go: Go Back, main page]

\n\t\t\n\t\n\t\n\t\tAbstract\n\t\n\n

Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs' reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show that the effectiveness of PRL could be verified and generalized.

\n","updatedAt":"2026-01-16T04:18:59.498Z","author":{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","fullname":"Jiarui Yao","name":"FlippyDora","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9492015242576599},"editors":["FlippyDora"],"editorAvatarUrls":["/avatars/43a61a528c5779103eaf5687ba44ee14.svg"],"reactions":[],"isReport":false}},{"id":"696ae78785619ece0dc46aa5","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-17T01:36:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization](https://huggingface.co/papers/2601.07182) (2026)\n* [Reinforced Efficient Reasoning via Semantically Diverse Exploration](https://huggingface.co/papers/2601.05053) (2026)\n* [From Solving to Verifying: A Unified Objective for Robust Reasoning in LLMs](https://huggingface.co/papers/2511.15137) (2025)\n* [Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization](https://huggingface.co/papers/2512.07478) (2025)\n* [Coupled Variational Reinforcement Learning for Language Model General Reasoning](https://huggingface.co/papers/2512.12576) (2025)\n* [Rectifying LLM Thought from Lens of Optimization](https://huggingface.co/papers/2512.01925) (2025)\n* [Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning](https://huggingface.co/papers/2601.03823) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-17T01:36:07.362Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7592732906341553},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.10201","authors":[{"_id":"6969bbc432f0333869ff94b8","user":{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},"name":"Jiarui Yao","status":"admin_assigned","statusLastChangedAt":"2026-01-16T15:48:39.837Z","hidden":false},{"_id":"6969bbc432f0333869ff94b9","user":{"_id":"6618d1c3c1167c8d8702b19d","avatarUrl":"/avatars/24611ca6c4158f4978ac8476a87d8d9c.svg","isPro":false,"fullname":"Ruida WANG","user":"RickyDeSkywalker","type":"user"},"name":"Ruida Wang","status":"admin_assigned","statusLastChangedAt":"2026-01-16T15:48:45.697Z","hidden":false},{"_id":"6969bbc432f0333869ff94ba","user":{"_id":"64afcb35a80174e17e924422","avatarUrl":"/avatars/e8ed10280b80cc7fd20048e6c2a192ee.svg","isPro":false,"fullname":"Tong Zhang","user":"TongZhang","type":"user"},"name":"Tong Zhang","status":"admin_assigned","statusLastChangedAt":"2026-01-16T15:48:53.695Z","hidden":false}],"publishedAt":"2026-01-15T09:01:53.000Z","submittedOnDailyAt":"2026-01-16T01:48:59.489Z","title":"PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary","submittedOnDailyBy":{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},"summary":"Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs' reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.","upvotes":8,"discussionId":"6969bbc432f0333869ff94bb","githubRepo":"https://github.com/MaxwellJryao/Process-Reward-Learning","githubRepoAddedBy":"user","ai_summary":"Process Reward Learning decomposes reinforcement learning objectives into intermediate steps to provide fine-grained supervision for improving large language model reasoning abilities.","ai_keywords":["Large Language Models","reinforcement learning","entropy regularization","reward maximization","KL-divergence penalty","process rewards","reasoning ability","pass@n","average@n"],"githubStars":10},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},{"_id":"6618d1c3c1167c8d8702b19d","avatarUrl":"/avatars/24611ca6c4158f4978ac8476a87d8d9c.svg","isPro":false,"fullname":"Ruida WANG","user":"RickyDeSkywalker","type":"user"},{"_id":"6470e0f1cfd57849519033a5","avatarUrl":"/avatars/7ffefee3e36a4e37b9f4510bc6b689d1.svg","isPro":false,"fullname":"Hanning Zhang","user":"HanningZhang","type":"user"},{"_id":"649369b34f0e40ee1a0ed5ba","avatarUrl":"/avatars/50d0e77883579d5002906c8d29c26ec5.svg","isPro":false,"fullname":"Maxwell Yao","user":"MaxwellJryao","type":"user"},{"_id":"63f0c0e69cf89c9ed1bdbf67","avatarUrl":"/avatars/7b74bf24a956f896c36bf75e0027ca2d.svg","isPro":false,"fullname":"Celestine","user":"GoForCoding","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"6676179e4b1e661916d0c654","avatarUrl":"/avatars/a074b2c7baa49de9324329c752b49dfd.svg","isPro":false,"fullname":"Thomas Katraouras","user":"Tomk187","type":"user"},{"_id":"686db5d4af2b856fabbf13aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6BjMv2LVNoqvbX8fQSTPI.png","isPro":false,"fullname":"V bbbb","user":"Bbbbbnnn","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2601.10201

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

Published on Jan 15
· Submitted by
Jiarui Yao
on Jan 16

Abstract

Process Reward Learning decomposes reinforcement learning objectives into intermediate steps to provide fine-grained supervision for improving large language model reasoning abilities.

AI-generated summary

Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs' reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.

Community

Paper author Paper submitter

Abstract

Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs' reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show that the effectiveness of PRL could be verified and generalized.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.10201 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.10201 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.10201 in a Space README.md to link it from this page.

Collections including this paper 1