Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Self-rewarding correction for mathematical reasoning
\n","updatedAt":"2025-02-28T08:06:12.776Z","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.330601304769516},"editors":["weqweasdas"],"editorAvatarUrls":["/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg"],"reactions":[],"isReport":false}},{"id":"67c20b6f7be4b9efaf411770","author":{"_id":"67818b1fa6b75c5dc3cf430c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png","fullname":"Ribbit Ribbit","name":"ribbitribbit365","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-02-28T19:15:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Great work! We made a deep dive video for this paper: https://www.youtube.com/watch?v=4U3oUIWyVTI. Happy learning together!\n\n","html":"
\n","updatedAt":"2025-02-28T19:15:59.376Z","author":{"_id":"67818b1fa6b75c5dc3cf430c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png","fullname":"Ribbit Ribbit","name":"ribbitribbit365","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5195221900939941},"editors":["ribbitribbit365"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png"],"reactions":[],"isReport":false}},{"id":"67c2646a9e95defcfb24cb0a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2025-03-01T01:35:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search](https://huggingface.co/papers/2502.02508) (2025)\n* [Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://huggingface.co/papers/2501.11651) (2025)\n* [S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning](https://huggingface.co/papers/2502.12853) (2025)\n* [ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization](https://huggingface.co/papers/2502.05605) (2025)\n* [AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification](https://huggingface.co/papers/2502.11520) (2025)\n* [Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models](https://huggingface.co/papers/2502.08922) (2025)\n* [rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking](https://huggingface.co/papers/2501.04519) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n","updatedAt":"2025-03-01T04:46:38.823Z","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3013138175010681},"editors":["weqweasdas"],"editorAvatarUrls":["/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.19613","authors":[{"_id":"67c12987505a88e4a185e0d7","name":"Wei Xiong","hidden":false},{"_id":"67c12987505a88e4a185e0d8","user":{"_id":"6470e0f1cfd57849519033a5","avatarUrl":"/avatars/7ffefee3e36a4e37b9f4510bc6b689d1.svg","isPro":false,"fullname":"Hanning Zhang","user":"HanningZhang","type":"user"},"name":"Hanning Zhang","status":"admin_assigned","statusLastChangedAt":"2025-02-28T12:22:33.128Z","hidden":false},{"_id":"67c12987505a88e4a185e0d9","user":{"_id":"65eec5c1d7d63c2ed0615421","avatarUrl":"/avatars/8c32f5e7d4b1940088bdec73c0b86fab.svg","isPro":false,"fullname":"Chenlu Ye","user":"Chenlu123","type":"user"},"name":"Chenlu Ye","status":"admin_assigned","statusLastChangedAt":"2025-02-28T12:22:38.981Z","hidden":false},{"_id":"67c12987505a88e4a185e0da","user":{"_id":"62323bb408bcea92917e42ee","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62323bb408bcea92917e42ee/2vHxkv-oSROtLteOnqa8P.jpeg","isPro":false,"fullname":"Lichang Chen","user":"Lichang-Chen","type":"user"},"name":"Lichang Chen","status":"claimed_verified","statusLastChangedAt":"2025-02-28T12:14:29.479Z","hidden":false},{"_id":"67c12987505a88e4a185e0db","user":{"_id":"64b8922ca1827cc8d04ae919","avatarUrl":"/avatars/0aaa83e3d09a82434e1d6af724aaa485.svg","isPro":false,"fullname":"Nan Jiang","user":"nanjiang","type":"user"},"name":"Nan Jiang","status":"admin_assigned","statusLastChangedAt":"2025-02-28T12:23:02.992Z","hidden":false},{"_id":"67c12987505a88e4a185e0dc","name":"Tong Zhang","hidden":false}],"publishedAt":"2025-02-26T23:01:16.000Z","submittedOnDailyAt":"2025-02-28T00:45:54.222Z","title":"Self-rewarding correction for mathematical reasoning","submittedOnDailyBy":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","isPro":false,"fullname":"Wei Xiong","user":"weqweasdas","type":"user"},"summary":"We study self-rewarding reasoning large language models (LLMs), which can\nsimultaneously generate step-by-step reasoning and evaluate the correctness of\ntheir outputs during the inference time-without external feedback. This\nintegrated approach allows a single model to independently guide its reasoning\nprocess, offering computational advantages for model deployment. We\nparticularly focus on the representative task of self-correction, where models\nautonomously detect errors in their responses, revise outputs, and decide when\nto terminate iterative refinement loops. To enable this, we propose a\ntwo-staged algorithmic framework for constructing self-rewarding reasoning\nmodels using only self-generated data. In the first stage, we employ sequential\nrejection sampling to synthesize long chain-of-thought trajectories that\nincorporate both self-rewarding and self-correction mechanisms. Fine-tuning\nmodels on these curated data allows them to learn the patterns of\nself-rewarding and self-correction. In the second stage, we further enhance the\nmodels' ability to assess response accuracy and refine outputs through\nreinforcement learning with rule-based signals. Experiments with Llama-3 and\nQwen-2.5 demonstrate that our approach surpasses intrinsic self-correction\ncapabilities and achieves performance comparable to systems that rely on\nexternal reward models.","upvotes":82,"discussionId":"67c12989505a88e4a185e115","githubRepo":"https://github.com/rlhflow/self-rewarding-reasoning-llm","githubRepoAddedBy":"auto","ai_summary":"Self-rewarding reasoning large language models independently generate and correct their outputs during inference using a two-stage algorithmic framework, enhancing performance without external feedback.","ai_keywords":["self-rewarding reasoning","large language models","chain-of-thought trajectories","sequential rejection sampling","intrinsic self-correction","reinforcement learning","rule-based signals"],"githubStars":231},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","isPro":false,"fullname":"Wei Xiong","user":"weqweasdas","type":"user"},{"_id":"666204677f4d9731a724aa0f","avatarUrl":"/avatars/73c4161cc6c2665b0784e4ff924f014e.svg","isPro":false,"fullname":"dasqw","user":"1231czx","type":"user"},{"_id":"65d41143406bdae630826d87","avatarUrl":"/avatars/ed8259ed64a13b1ed43c004b741e1401.svg","isPro":false,"fullname":"Zihao Li","user":"Violet24K","type":"user"},{"_id":"6667dbf4d837c23f9ce44648","avatarUrl":"/avatars/511e62e50d55b1921d4ae77dfb435d31.svg","isPro":false,"fullname":"Chuxuan Hu","user":"chuxuan","type":"user"},{"_id":"67c12c07e5c070664e195b49","avatarUrl":"/avatars/54e1d99b172dbda0522dc76cc845efca.svg","isPro":false,"fullname":"fsfggmet","user":"fsfggmet","type":"user"},{"_id":"63a3ff69f91ad3ea5703841d","avatarUrl":"/avatars/69227c4bce01d33747c1377b6f9672db.svg","isPro":false,"fullname":"Hanze Dong","user":"hendrydong","type":"user"},{"_id":"646def60df618b303b419323","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646def60df618b303b419323/JLJGYen4-5M8ivsLsSk0w.jpeg","isPro":false,"fullname":"Lei Wang","user":"demolei","type":"user"},{"_id":"64ee4cb1d5f2899811155639","avatarUrl":"/avatars/ee6b65614db00f22cbe21f8cae69a15a.svg","isPro":false,"fullname":"qiusi zhan","user":"qiusi1943","type":"user"},{"_id":"64d45451c34a346181b130dd","avatarUrl":"/avatars/9bb8205b889337df5d321539c9b5d69d.svg","isPro":true,"fullname":"Rui Yang","user":"Ray2333","type":"user"},{"_id":"6726688788f2f9df271830ab","avatarUrl":"/avatars/1b7db87f241b0caa3e5d08298d5ff0c0.svg","isPro":false,"fullname":"Yuxing Liu","user":"yuxing6","type":"user"},{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},{"_id":"64cb1ad1667f4f80852f6050","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cb1ad1667f4f80852f6050/iOn5q_RyyBS99tObrO5Tc.png","isPro":false,"fullname":"Rui Pan","user":"research4pan","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Self-rewarding reasoning large language models independently generate and correct their outputs during inference using a two-stage algorithmic framework, enhancing performance without external feedback.
AI-generated summary
We study self-rewarding reasoninglarge language models (LLMs), which can
simultaneously generate step-by-step reasoning and evaluate the correctness of
their outputs during the inference time-without external feedback. This
integrated approach allows a single model to independently guide its reasoning
process, offering computational advantages for model deployment. We
particularly focus on the representative task of self-correction, where models
autonomously detect errors in their responses, revise outputs, and decide when
to terminate iterative refinement loops. To enable this, we propose a
two-staged algorithmic framework for constructing self-rewarding reasoning
models using only self-generated data. In the first stage, we employ sequential
rejection sampling to synthesize long chain-of-thought trajectories that
incorporate both self-rewarding and self-correction mechanisms. Fine-tuning
models on these curated data allows them to learn the patterns of
self-rewarding and self-correction. In the second stage, we further enhance the
models' ability to assess response accuracy and refine outputs through
reinforcement learning with rule-based signals. Experiments with Llama-3 and
Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction
capabilities and achieves performance comparable to systems that rely on
external reward models.
The general idea is to unify the generative reward model and reasoning model into a single LLM. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment.
To enable this, we first sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals.