Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement
Learning
https://github.com/NineAbyss/S2R\n","updatedAt":"2025-02-21T10:00:18.657Z","author":{"_id":"648294b2eb4befee378951c1","avatarUrl":"/avatars/da5d8bf9d8662cc2ffa2c0de49bd66a3.svg","fullname":"Ruotian Ma","name":"vvibt","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4162423610687256},"editors":["vvibt"],"editorAvatarUrls":["/avatars/da5d8bf9d8662cc2ffa2c0de49bd66a3.svg"],"reactions":[{"reaction":"👍","users":["sugatoray","S2R-data"],"count":2}],"isReport":false}},{"id":"67b9297ead55467770b6bb35","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2025-02-22T01:33:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search](https://huggingface.co/papers/2502.02508) (2025)\n* [Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://huggingface.co/papers/2501.11651) (2025)\n* [rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking](https://huggingface.co/papers/2501.04519) (2025)\n* [SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling](https://huggingface.co/papers/2501.19306) (2025)\n* [AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification](https://huggingface.co/papers/2502.11520) (2025)\n* [Process Reinforcement through Implicit Rewards](https://huggingface.co/papers/2502.01456) (2025)\n* [HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs](https://huggingface.co/papers/2412.18925) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-02-22T01:33:50.858Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7597923278808594},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.12853","authors":[{"_id":"67b69b6717ccb022c6a95b38","user":{"_id":"648294b2eb4befee378951c1","avatarUrl":"/avatars/da5d8bf9d8662cc2ffa2c0de49bd66a3.svg","isPro":false,"fullname":"Ruotian Ma","user":"vvibt","type":"user"},"name":"Ruotian Ma","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:58:55.028Z","hidden":false},{"_id":"67b69b6717ccb022c6a95b39","user":{"_id":"626f98528a894872cfbf620c","avatarUrl":"/avatars/fe31d20313e6ca85e96bc249424c5383.svg","isPro":false,"fullname":"Peisong Wang","user":"duke1852022","type":"user"},"name":"Peisong Wang","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:59:00.273Z","hidden":false},{"_id":"67b69b6717ccb022c6a95b3a","user":{"_id":"6500234be0c94282ab38cd00","avatarUrl":"/avatars/90fc160919cdbb28cfa82becf720b062.svg","isPro":false,"fullname":"soso","user":"chengliu","type":"user"},"name":"Cheng Liu","status":"claimed_verified","statusLastChangedAt":"2025-02-24T09:25:27.094Z","hidden":false},{"_id":"67b69b6717ccb022c6a95b3b","name":"Xingyan Liu","hidden":false},{"_id":"67b69b6717ccb022c6a95b3c","user":{"_id":"61b859ddbdf1fac5ed499992","avatarUrl":"/avatars/2387fb9b8a46840bfc75248462f0a410.svg","isPro":false,"fullname":"Jiaqi Chen","user":"judge","type":"user"},"name":"Jiaqi Chen","status":"claimed_verified","statusLastChangedAt":"2025-04-29T07:59:24.130Z","hidden":false},{"_id":"67b69b6717ccb022c6a95b3d","name":"Bang Zhang","hidden":false},{"_id":"67b69b6717ccb022c6a95b3e","name":"Xin Zhou","hidden":false},{"_id":"67b69b6717ccb022c6a95b3f","name":"Nan Du","hidden":false},{"_id":"67b69b6717ccb022c6a95b40","name":"Jia Li","hidden":false}],"publishedAt":"2025-02-18T13:40:22.000Z","submittedOnDailyAt":"2025-02-21T07:30:18.645Z","title":"S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement\n Learning","submittedOnDailyBy":{"_id":"648294b2eb4befee378951c1","avatarUrl":"/avatars/da5d8bf9d8662cc2ffa2c0de49bd66a3.svg","isPro":false,"fullname":"Ruotian Ma","user":"vvibt","type":"user"},"summary":"Recent studies have demonstrated the effectiveness of LLM test-time scaling.\nHowever, existing approaches to incentivize LLMs' deep thinking abilities\ngenerally require large-scale data or significant training efforts. Meanwhile,\nit remains unclear how to improve the thinking abilities of less powerful base\nmodels. In this work, we introduce S^2R, an efficient framework that enhances\nLLM reasoning by teaching models to self-verify and self-correct during\ninference. Specifically, we first initialize LLMs with iterative\nself-verification and self-correction behaviors through supervised fine-tuning\non carefully curated data. The self-verification and self-correction skills are\nthen further strengthened by both outcome-level and process-level reinforcement\nlearning, with minimized resource requirements, enabling the model to\nadaptively refine its reasoning process during inference. Our results\ndemonstrate that, with only 3.1k self-verifying and self-correcting behavior\ninitialization samples, Qwen2.5-math-7B achieves an accuracy improvement from\n51.0\\% to 81.6\\%, outperforming models trained on an equivalent amount of\nlong-CoT distilled data. Extensive experiments and analysis based on three base\nmodels across both in-domain and out-of-domain benchmarks validate the\neffectiveness of S^2R. Our code and data are available at\nhttps://github.com/NineAbyss/S2R.","upvotes":29,"discussionId":"67b69b6817ccb022c6a95b6e","githubRepo":"https://github.com/nineabyss/s2r","githubRepoAddedBy":"auto","ai_summary":"S$^2$R is a framework that enhances LLM reasoning through iterative self-verification and self-correction using supervised fine-tuning and reinforcement learning with minimal resources.","ai_keywords":["LLM","self-verification","self-correction","supervised fine-tuning","reinforcement learning","Qwen2.5-math-7B","long-CoT","in-domain","out-of-domain benchmarks"],"githubStars":74},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648294b2eb4befee378951c1","avatarUrl":"/avatars/da5d8bf9d8662cc2ffa2c0de49bd66a3.svg","isPro":false,"fullname":"Ruotian Ma","user":"vvibt","type":"user"},{"_id":"64be4408c05a0df0d2b6012e","avatarUrl":"/avatars/09d8427505a418090391dc5a3f8bfef2.svg","isPro":false,"fullname":"PSWang","user":"CedarWang","type":"user"},{"_id":"65185161a1a5e5d6179c209f","avatarUrl":"/avatars/e654e89fcfe0d921ada8ce3ab3b4beb8.svg","isPro":false,"fullname":"Jotion Joestar","user":"Jotion","type":"user"},{"_id":"6500234be0c94282ab38cd00","avatarUrl":"/avatars/90fc160919cdbb28cfa82becf720b062.svg","isPro":false,"fullname":"soso","user":"chengliu","type":"user"},{"_id":"67b8524cda0def68c5f80510","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/bOAWQrejli0FwtdDM3wWW.png","isPro":false,"fullname":"Xiu Bu","user":"Gua3","type":"user"},{"_id":"650e5c0ebe0fdd6ffe5b999a","avatarUrl":"/avatars/3e13cb129494fe8ddda8e3a265bf672c.svg","isPro":false,"fullname":"xiaoleiWang","user":"xiaoleiWang","type":"user"},{"_id":"67b856428ff8782a980fd605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b856428ff8782a980fd605/733rYZUuHO3dhU8ad9nl-.jpeg","isPro":false,"fullname":"Tannnnntan","user":"Tantanneverstudy","type":"user"},{"_id":"631ca44d6e623740a78e2b30","avatarUrl":"/avatars/aa2a11c72dbf8d200dba029226b281a2.svg","isPro":false,"fullname":"zhang","user":"prvmax","type":"user"},{"_id":"67b85ea5fedfe971273c3469","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/tM9fri5k2j_GfzERGFmmt.png","isPro":false,"fullname":"ll204205","user":"jack204205","type":"user"},{"_id":"64fe780614636d417af95e10","avatarUrl":"/avatars/f172b2ad3f66d38205cf9589f3e43585.svg","isPro":false,"fullname":"gerald hewes","user":"gerald29","type":"user"},{"_id":"61b859ddbdf1fac5ed499992","avatarUrl":"/avatars/2387fb9b8a46840bfc75248462f0a410.svg","isPro":false,"fullname":"Jiaqi Chen","user":"judge","type":"user"},{"_id":"651c80a26ba9ab9b9582c273","avatarUrl":"/avatars/e963452eafd21f517d800f2e58e0f918.svg","isPro":false,"fullname":"siyeng feng","user":"siyengfeng","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
S$^2$R is a framework that enhances LLM reasoning through iterative self-verification and self-correction using supervised fine-tuning and reinforcement learning with minimal resources.
AI-generated summary
Recent studies have demonstrated the effectiveness of LLM test-time scaling.
However, existing approaches to incentivize LLMs' deep thinking abilities
generally require large-scale data or significant training efforts. Meanwhile,
it remains unclear how to improve the thinking abilities of less powerful base
models. In this work, we introduce S^2R, an efficient framework that enhances
LLM reasoning by teaching models to self-verify and self-correct during
inference. Specifically, we first initialize LLMs with iterative
self-verification and self-correction behaviors through supervised fine-tuning
on carefully curated data. The self-verification and self-correction skills are
then further strengthened by both outcome-level and process-level reinforcement
learning, with minimized resource requirements, enabling the model to
adaptively refine its reasoning process during inference. Our results
demonstrate that, with only 3.1k self-verifying and self-correcting behavior
initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from
51.0\% to 81.6\%, outperforming models trained on an equivalent amount of
long-CoT distilled data. Extensive experiments and analysis based on three base
models across both in-domain and out-of-domain benchmarks validate the
effectiveness of S^2R. Our code and data are available at
https://github.com/NineAbyss/S2R.