Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Self-rewarding correction for mathematical reasoning
[go: Go Back, main page]

\"461740638119_.pic.jpg\"

\n","updatedAt":"2025-02-28T08:06:04.710Z","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.39244067668914795},"editors":["weqweasdas"],"editorAvatarUrls":["/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg"],"reactions":[],"isReport":false}},{"id":"67c16e7401cef6d4b9c523f3","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false},"createdAt":"2025-02-28T08:06:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"\n![451740638069_.pic.jpg](https://cdn-uploads.huggingface.co/production/uploads/643e59806db6ba8c5ee123f3/nZAUhrOGF0XAdDD1KygZv.jpeg)\n","html":"

\"451740638069_.pic.jpg\"

\n","updatedAt":"2025-02-28T08:06:12.776Z","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.330601304769516},"editors":["weqweasdas"],"editorAvatarUrls":["/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg"],"reactions":[],"isReport":false}},{"id":"67c20b6f7be4b9efaf411770","author":{"_id":"67818b1fa6b75c5dc3cf430c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png","fullname":"Ribbit Ribbit","name":"ribbitribbit365","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-02-28T19:15:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Great work! We made a deep dive video for this paper: https://www.youtube.com/watch?v=4U3oUIWyVTI. Happy learning together!\n![TitleImage.png](https://cdn-uploads.huggingface.co/production/uploads/67818b1fa6b75c5dc3cf430c/Yomom7gNSirP3KXdEKv8r.png)\n","html":"

Great work! We made a deep dive video for this paper: https://www.youtube.com/watch?v=4U3oUIWyVTI. Happy learning together!
\"TitleImage.png\"

\n","updatedAt":"2025-02-28T19:15:59.376Z","author":{"_id":"67818b1fa6b75c5dc3cf430c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png","fullname":"Ribbit Ribbit","name":"ribbitribbit365","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5195221900939941},"editors":["ribbitribbit365"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png"],"reactions":[],"isReport":false}},{"id":"67c2646a9e95defcfb24cb0a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2025-03-01T01:35:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search](https://huggingface.co/papers/2502.02508) (2025)\n* [Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://huggingface.co/papers/2501.11651) (2025)\n* [S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning](https://huggingface.co/papers/2502.12853) (2025)\n* [ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization](https://huggingface.co/papers/2502.05605) (2025)\n* [AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification](https://huggingface.co/papers/2502.11520) (2025)\n* [Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models](https://huggingface.co/papers/2502.08922) (2025)\n* [rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking](https://huggingface.co/papers/2501.04519) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-03-01T01:35:38.887Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7593852877616882},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"67c2912e6a5a2b2a16ad0a5e","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false},"createdAt":"2025-03-01T04:46:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Twitter post:\nhttps://x.com/weixiong_1/status/1895498017446171019","html":"

Twitter post:
https://x.com/weixiong_1/status/1895498017446171019

\n","updatedAt":"2025-03-01T04:46:38.823Z","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3013138175010681},"editors":["weqweasdas"],"editorAvatarUrls":["/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.19613","authors":[{"_id":"67c12987505a88e4a185e0d7","name":"Wei Xiong","hidden":false},{"_id":"67c12987505a88e4a185e0d8","user":{"_id":"6470e0f1cfd57849519033a5","avatarUrl":"/avatars/7ffefee3e36a4e37b9f4510bc6b689d1.svg","isPro":false,"fullname":"Hanning Zhang","user":"HanningZhang","type":"user"},"name":"Hanning Zhang","status":"admin_assigned","statusLastChangedAt":"2025-02-28T12:22:33.128Z","hidden":false},{"_id":"67c12987505a88e4a185e0d9","user":{"_id":"65eec5c1d7d63c2ed0615421","avatarUrl":"/avatars/8c32f5e7d4b1940088bdec73c0b86fab.svg","isPro":false,"fullname":"Chenlu Ye","user":"Chenlu123","type":"user"},"name":"Chenlu Ye","status":"admin_assigned","statusLastChangedAt":"2025-02-28T12:22:38.981Z","hidden":false},{"_id":"67c12987505a88e4a185e0da","user":{"_id":"62323bb408bcea92917e42ee","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62323bb408bcea92917e42ee/2vHxkv-oSROtLteOnqa8P.jpeg","isPro":false,"fullname":"Lichang Chen","user":"Lichang-Chen","type":"user"},"name":"Lichang Chen","status":"claimed_verified","statusLastChangedAt":"2025-02-28T12:14:29.479Z","hidden":false},{"_id":"67c12987505a88e4a185e0db","user":{"_id":"64b8922ca1827cc8d04ae919","avatarUrl":"/avatars/0aaa83e3d09a82434e1d6af724aaa485.svg","isPro":false,"fullname":"Nan Jiang","user":"nanjiang","type":"user"},"name":"Nan Jiang","status":"admin_assigned","statusLastChangedAt":"2025-02-28T12:23:02.992Z","hidden":false},{"_id":"67c12987505a88e4a185e0dc","name":"Tong Zhang","hidden":false}],"publishedAt":"2025-02-26T23:01:16.000Z","submittedOnDailyAt":"2025-02-28T00:45:54.222Z","title":"Self-rewarding correction for mathematical reasoning","submittedOnDailyBy":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","isPro":false,"fullname":"Wei Xiong","user":"weqweasdas","type":"user"},"summary":"We study self-rewarding reasoning large language models (LLMs), which can\nsimultaneously generate step-by-step reasoning and evaluate the correctness of\ntheir outputs during the inference time-without external feedback. This\nintegrated approach allows a single model to independently guide its reasoning\nprocess, offering computational advantages for model deployment. We\nparticularly focus on the representative task of self-correction, where models\nautonomously detect errors in their responses, revise outputs, and decide when\nto terminate iterative refinement loops. To enable this, we propose a\ntwo-staged algorithmic framework for constructing self-rewarding reasoning\nmodels using only self-generated data. In the first stage, we employ sequential\nrejection sampling to synthesize long chain-of-thought trajectories that\nincorporate both self-rewarding and self-correction mechanisms. Fine-tuning\nmodels on these curated data allows them to learn the patterns of\nself-rewarding and self-correction. In the second stage, we further enhance the\nmodels' ability to assess response accuracy and refine outputs through\nreinforcement learning with rule-based signals. Experiments with Llama-3 and\nQwen-2.5 demonstrate that our approach surpasses intrinsic self-correction\ncapabilities and achieves performance comparable to systems that rely on\nexternal reward models.","upvotes":82,"discussionId":"67c12989505a88e4a185e115","githubRepo":"https://github.com/rlhflow/self-rewarding-reasoning-llm","githubRepoAddedBy":"auto","ai_summary":"Self-rewarding reasoning large language models independently generate and correct their outputs during inference using a two-stage algorithmic framework, enhancing performance without external feedback.","ai_keywords":["self-rewarding reasoning","large language models","chain-of-thought trajectories","sequential rejection sampling","intrinsic self-correction","reinforcement learning","rule-based signals"],"githubStars":231},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","isPro":false,"fullname":"Wei Xiong","user":"weqweasdas","type":"user"},{"_id":"666204677f4d9731a724aa0f","avatarUrl":"/avatars/73c4161cc6c2665b0784e4ff924f014e.svg","isPro":false,"fullname":"dasqw","user":"1231czx","type":"user"},{"_id":"65d41143406bdae630826d87","avatarUrl":"/avatars/ed8259ed64a13b1ed43c004b741e1401.svg","isPro":false,"fullname":"Zihao Li","user":"Violet24K","type":"user"},{"_id":"6667dbf4d837c23f9ce44648","avatarUrl":"/avatars/511e62e50d55b1921d4ae77dfb435d31.svg","isPro":false,"fullname":"Chuxuan Hu","user":"chuxuan","type":"user"},{"_id":"67c12c07e5c070664e195b49","avatarUrl":"/avatars/54e1d99b172dbda0522dc76cc845efca.svg","isPro":false,"fullname":"fsfggmet","user":"fsfggmet","type":"user"},{"_id":"63a3ff69f91ad3ea5703841d","avatarUrl":"/avatars/69227c4bce01d33747c1377b6f9672db.svg","isPro":false,"fullname":"Hanze Dong","user":"hendrydong","type":"user"},{"_id":"646def60df618b303b419323","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646def60df618b303b419323/JLJGYen4-5M8ivsLsSk0w.jpeg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},{"_id":"64ee4cb1d5f2899811155639","avatarUrl":"/avatars/ee6b65614db00f22cbe21f8cae69a15a.svg","isPro":false,"fullname":"qiusi zhan","user":"qiusi1943","type":"user"},{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},{"_id":"6726688788f2f9df271830ab","avatarUrl":"/avatars/1b7db87f241b0caa3e5d08298d5ff0c0.svg","isPro":false,"fullname":"Yuxing Liu","user":"yuxing6","type":"user"},{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},{"_id":"64cb1ad1667f4f80852f6050","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cb1ad1667f4f80852f6050/iOn5q_RyyBS99tObrO5Tc.png","isPro":false,"fullname":"Rui Pan","user":"research4pan","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2502.19613

Self-rewarding correction for mathematical reasoning

Published on Feb 26, 2025
· Submitted by
Wei Xiong
on Feb 28, 2025
#1 Paper of the day
Authors:
,

Abstract

Self-rewarding reasoning large language models independently generate and correct their outputs during inference using a two-stage algorithmic framework, enhancing performance without external feedback.

AI-generated summary

We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.

Community

The general idea is to unify the generative reward model and reasoning model into a single LLM. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment.

To enable this, we first sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals.

Paper submitter

461740638119_.pic.jpg

Paper submitter

451740638069_.pic.jpg

Great work! We made a deep dive video for this paper: https://www.youtube.com/watch?v=4U3oUIWyVTI. Happy learning together!
TitleImage.png

Paper submitter

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.19613 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.19613 in a Space README.md to link it from this page.

Collections including this paper 12