Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-05-31T01:36:17.439Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7357176542282104},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.22653","authors":[{"_id":"6838bb282b382ba50bdcddc4","name":"Ang Lv","hidden":false},{"_id":"6838bb282b382ba50bdcddc5","user":{"_id":"6622443b9b0614a760dd8123","avatarUrl":"/avatars/acb6c1c9c429af1112530dcf76a8e420.svg","isPro":false,"fullname":"Ruobing Xie","user":"Ruobing-Xie","type":"user"},"name":"Ruobing Xie","status":"claimed_verified","statusLastChangedAt":"2025-05-30T06:54:45.319Z","hidden":false},{"_id":"6838bb282b382ba50bdcddc6","name":"Xingwu Sun","hidden":false},{"_id":"6838bb282b382ba50bdcddc7","name":"Zhanhui Kang","hidden":false},{"_id":"6838bb282b382ba50bdcddc8","name":"Rui Yan","hidden":false}],"publishedAt":"2025-05-28T17:59:03.000Z","submittedOnDailyAt":"2025-05-30T00:44:54.555Z","title":"The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in\n Learning to Reason","submittedOnDailyBy":{"_id":"64b8ca3c5067873176d4b436","avatarUrl":"/avatars/b659d147b2454b47c9a7e89bbed525fc.svg","isPro":false,"fullname":"AngLv","user":"AngLv","type":"user"},"summary":"Recent studies on post-training large language models (LLMs) for reasoning\nthrough reinforcement learning (RL) typically focus on tasks that can be\naccurately verified and rewarded, such as solving math problems. In contrast,\nour research investigates the impact of reward noise, a more practical\nconsideration for real-world scenarios involving the post-training of LLMs\nusing reward models. We found that LLMs demonstrate strong robustness to\nsubstantial reward noise. For example, manually flipping 40% of the reward\nfunction's outputs in math tasks still allows a Qwen-2.5-7B model to achieve\nrapid convergence, improving its performance on math tasks from 5% to 72%,\ncompared to the 75% accuracy achieved by a model trained with noiseless\nrewards. Surprisingly, by only rewarding the appearance of key reasoning\nphrases (namely reasoning pattern reward, RPR), such as ``first, I need\nto''-without verifying the correctness of answers, the model achieved peak\ndownstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models\ntrained with strict correctness verification and accurate rewards. Recognizing\nthe importance of the reasoning process over the final results, we combined RPR\nwith noisy reward models. RPR helped calibrate the noisy reward models,\nmitigating potential false negatives and enhancing the LLM's performance on\nopen-ended tasks. These findings suggest the importance of improving models'\nfoundational abilities during the pre-training phase while providing insights\nfor advancing post-training techniques. Our code and scripts are available at\nhttps://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.","upvotes":43,"discussionId":"6838bb2a2b382ba50bdcde1b","githubRepo":"https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason","githubRepoAddedBy":"user","ai_summary":"LLMs exhibit robustness to reward noise during post-training and achieve high performance using reasoning pattern rewards (RPR) in conjunction with noisy reward models.","ai_keywords":["large language models (LLMs)","post-training","reinforcement learning (RL)","reward noise","reward models","rapid convergence","reasoning pattern reward (RPR)","false negatives","open-ended tasks"],"githubStars":105},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b8ca3c5067873176d4b436","avatarUrl":"/avatars/b659d147b2454b47c9a7e89bbed525fc.svg","isPro":false,"fullname":"AngLv","user":"AngLv","type":"user"},{"_id":"6530ec0213906da29b8f6298","avatarUrl":"/avatars/aa9fbfeed7e9f7e026d93001e212ee21.svg","isPro":false,"fullname":"Song Jin","user":"jinsong0415","type":"user"},{"_id":"655dd12bdcb845354c1990a3","avatarUrl":"/avatars/9001fc7d08d09df59d01608b11e59252.svg","isPro":false,"fullname":"Tan","user":"RiccardTo","type":"user"},{"_id":"6397f6081323f19c578f142e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6397f6081323f19c578f142e/it7FYYKjlLX8wSsMLm8EO.jpeg","isPro":false,"fullname":"QizhiPei","user":"QizhiPei","type":"user"},{"_id":"642bd79677967049cf4aa29d","avatarUrl":"/avatars/1adfee31ddea6b466d26a452df590bf9.svg","isPro":false,"fullname":"x","user":"bb66","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6622443b9b0614a760dd8123","avatarUrl":"/avatars/acb6c1c9c429af1112530dcf76a8e420.svg","isPro":false,"fullname":"Ruobing Xie","user":"Ruobing-Xie","type":"user"},{"_id":"641047877a15af878ae65195","avatarUrl":"/avatars/cdd2803c71109a8513aa63ac9cf98fe9.svg","isPro":false,"fullname":"xiaoqingzhang","user":"xiaoqzhwhu","type":"user"},{"_id":"6448c5ba638bfa3ce1374558","avatarUrl":"/avatars/9ab1cb58cf90873ddcaf834099651b80.svg","isPro":false,"fullname":"vdk","user":"Elucidator-V","type":"user"},{"_id":"655de8188b77e2c30b61104d","avatarUrl":"/avatars/492aa927a57cbdde98133340fe42f84b.svg","isPro":false,"fullname":"chengminjie","user":"minjiecheng","type":"user"},{"_id":"66f18cfe16bffdccd58ac6a8","avatarUrl":"/avatars/56d75d7e91d9c3eb3e1e9ac9aa590e2a.svg","isPro":false,"fullname":"Desky D","user":"Deskyii","type":"user"},{"_id":"627a124ffe55fa0f8ce0eaf7","avatarUrl":"/avatars/41e0dc029faed6dc45d620c5fe2652a5.svg","isPro":true,"fullname":"Serendipity","user":"Yuhan","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2505.22653

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Published on May 28, 2025
· Submitted by
AngLv
on May 30, 2025
Authors:
,
,
,

Abstract

LLMs exhibit robustness to reward noise during post-training and achieve high performance using reasoning pattern rewards (RPR) in conjunction with noisy reward models.

AI-generated summary

Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.

Community

Paper submitter
  1. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. We hypothesize that outputs leading to incorrect answers may still contain valuable information—specifically, useful reasoning patterns.

  2. To test this hypothesis, we only reward the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''—without verifying the correctness of answers. The model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. This indicates that the gains from reinforcement learning might primarily come from teaching the model to adopt appropriate reasoning styles. The fundamental problem-solving abilities required are largely acquired during pretraining.

  3. Recognizing the importance of the reasoning process, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models by mitigating potential false negative rewards. With calibrated RMs, LLMs' performance in open-ended NLP tasks are enhanced and smaller models are able to successfully acquire reasoning capabilities through RL.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.22653 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.22653 in a Space README.md to link it from this page.

Collections including this paper 2