Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
[go: Go Back, main page]

\"Screenshot

\n","updatedAt":"2025-02-21T03:19:05.910Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9179,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3832639157772064},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"67b9249385a42bcaf4cdc089","author":{"_id":"65ebe4fba69aaabb4304def8","avatarUrl":"/avatars/770e420d76d6dea19d57f51475711b8e.svg","fullname":"Dave Wilson ","name":"Creekside","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2025-02-22T01:12:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Generating a 50,000 point dataset of lambda calculus at this very moment. ","html":"

Generating a 50,000 point dataset of lambda calculus at this very moment.

\n","updatedAt":"2025-02-22T01:12:51.283Z","author":{"_id":"65ebe4fba69aaabb4304def8","avatarUrl":"/avatars/770e420d76d6dea19d57f51475711b8e.svg","fullname":"Dave Wilson ","name":"Creekside","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8329448103904724},"editors":["Creekside"],"editorAvatarUrls":["/avatars/770e420d76d6dea19d57f51475711b8e.svg"],"reactions":[],"isReport":false}},{"id":"67b9295739abe55fa98fc7a7","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-02-22T01:33:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948) (2025)\n* [Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies](https://huggingface.co/papers/2501.17030) (2025)\n* [Rule-Bottleneck Reinforcement Learning: Joint Explanation and Decision Optimization for Resource Allocation with Language Agents](https://huggingface.co/papers/2502.10732) (2025)\n* [Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization](https://huggingface.co/papers/2412.18279) (2024)\n* [On the Emergence of Thinking in LLMs I: Searching for the Right Intuition](https://huggingface.co/papers/2502.06773) (2025)\n* [Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling](https://huggingface.co/papers/2501.11651) (2025)\n* [Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs](https://huggingface.co/papers/2502.04357) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-02-22T01:33:11.890Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7396917343139648},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"67bfab35d218e6d54122ba6e","author":{"_id":"67818b1fa6b75c5dc3cf430c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png","fullname":"Ribbit Ribbit","name":"ribbitribbit365","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-02-27T00:00:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We made a deep dive video for this paper: https://www.youtube.com/watch?v=IsfG3r1car0. DeepSeek R1 Reproduced & Upgraded!\n![TitleImage.png](https://cdn-uploads.huggingface.co/production/uploads/67818b1fa6b75c5dc3cf430c/cmKz132huwifx00g8eSwg.png)\n","html":"

We made a deep dive video for this paper: https://www.youtube.com/watch?v=IsfG3r1car0. DeepSeek R1 Reproduced & Upgraded!
\"TitleImage.png\"

\n","updatedAt":"2025-02-27T00:00:53.546Z","author":{"_id":"67818b1fa6b75c5dc3cf430c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png","fullname":"Ribbit Ribbit","name":"ribbitribbit365","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5825157761573792},"editors":["ribbitribbit365"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67818b1fa6b75c5dc3cf430c/5aA0gP8ZvIkMndNA7CqqE.png"],"reactions":[],"isReport":false}},{"id":"67c0d6a667e68677cf2ff414","author":{"_id":"6444fb661cfc9ae6bb3b4b07","avatarUrl":"/avatars/d9b707e26f05d9edec708feb44d8c6c8.svg","fullname":"Hanyin Wang","name":"why217","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-02-27T21:18:30.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the amazing work! May I ask a quick question on the global batch size and total training steps? In the paper it was mentioned the training set has about 5k samples, and we have a training batch size of 8 with roll out of 8. I am wondering how did we get 3600 training steps with this set up? Did we use additional gradient accumulation? Many thanks. ","html":"

Thanks for the amazing work! May I ask a quick question on the global batch size and total training steps? In the paper it was mentioned the training set has about 5k samples, and we have a training batch size of 8 with roll out of 8. I am wondering how did we get 3600 training steps with this set up? Did we use additional gradient accumulation? Many thanks.

\n","updatedAt":"2025-02-27T21:18:30.641Z","author":{"_id":"6444fb661cfc9ae6bb3b4b07","avatarUrl":"/avatars/d9b707e26f05d9edec708feb44d8c6c8.svg","fullname":"Hanyin Wang","name":"why217","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9758556485176086},"editors":["why217"],"editorAvatarUrls":["/avatars/d9b707e26f05d9edec708feb44d8c6c8.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2502.14768","authors":[{"_id":"67b7f08c357c2729ac20a81b","name":"Tian Xie","hidden":false},{"_id":"67b7f08c357c2729ac20a81c","user":{"_id":"641ddac5be3bd3a5a06ed4a4","avatarUrl":"/avatars/14969dff861d53b0a75305606495eca7.svg","isPro":false,"fullname":"Zitian Gao","user":"zgao3186","type":"user"},"name":"Zitian Gao","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:50:06.783Z","hidden":false},{"_id":"67b7f08c357c2729ac20a81d","name":"Qingnan Ren","hidden":false},{"_id":"67b7f08c357c2729ac20a81e","user":{"_id":"6501535970b6b05c5af84383","avatarUrl":"/avatars/a827dfa11589cabd6868c617eeecbbba.svg","isPro":false,"fullname":"Haoming Luo","user":"Resnet-340","type":"user"},"name":"Haoming Luo","status":"claimed_verified","statusLastChangedAt":"2025-02-24T09:23:21.219Z","hidden":false},{"_id":"67b7f08c357c2729ac20a81f","name":"Yuqian Hong","hidden":false},{"_id":"67b7f08c357c2729ac20a820","user":{"_id":"67090a395f4ed2ff1f0b3658","avatarUrl":"/avatars/03744a2fcbcb0e8074c04bad83b3e34c.svg","isPro":false,"fullname":"Zhenbang Dai","user":"BryanDai","type":"user"},"name":"Bryan Dai","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:49:20.858Z","hidden":false},{"_id":"67b7f08c357c2729ac20a821","name":"Joey Zhou","hidden":false},{"_id":"67b7f08c357c2729ac20a822","name":"Kai Qiu","hidden":false},{"_id":"67b7f08c357c2729ac20a823","user":{"_id":"67034414f2b11c7dd251e232","avatarUrl":"/avatars/6b741ac2eab48c6f72185342f9af7d1f.svg","isPro":false,"fullname":"wzr","user":"wuzhirong","type":"user"},"name":"Zhirong Wu","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:48:55.865Z","hidden":false},{"_id":"67b7f08c357c2729ac20a824","user":{"_id":"676a328148d749b7086782d0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Tt7u8l8f_1oVBWmBp7tkm.png","isPro":false,"fullname":"Chong Luo","user":"cluo-ms","type":"user"},"name":"Chong Luo","status":"admin_assigned","statusLastChangedAt":"2025-02-21T14:48:44.370Z","hidden":false}],"publishedAt":"2025-02-20T17:49:26.000Z","submittedOnDailyAt":"2025-02-21T00:49:05.902Z","title":"Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement\n Learning","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Inspired by the success of DeepSeek-R1, we explore the potential of\nrule-based reinforcement learning (RL) in large reasoning models. To analyze\nreasoning dynamics, we use synthetic logic puzzles as training data due to\ntheir controllable complexity and straightforward answer verification. We make\nsome key technical contributions that lead to effective and stable RL training:\na system prompt that emphasizes the thinking and answering process, a stringent\nformat reward function that penalizes outputs for taking shortcuts, and a\nstraightforward training recipe that achieves stable convergence. Our 7B model\ndevelops advanced reasoning skills-such as reflection, verification, and\nsummarization-that are absent from the logic corpus. Remarkably, after training\non just 5K logic problems, it demonstrates generalization abilities to the\nchallenging math benchmarks AIME and AMC.","upvotes":47,"discussionId":"67b7f08e357c2729ac20a88f","githubRepo":"https://github.com/Unakar/Logic-RL","githubRepoAddedBy":"auto","ai_summary":"A rule-based reinforcement learning system for large reasoning models uses synthetic logic puzzles to develop advanced reasoning skills and demonstrates generalization on complex math benchmarks.","ai_keywords":["rule-based reinforcement learning","RL","synthetic logic puzzles","format reward function","stable convergence","reflection","verification","summarization","math benchmarks","AIME","AMC"],"githubStars":2436},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62f847d692950415b63c6011","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1660437733795-noauth.png","isPro":false,"fullname":"Yassine Ennaour","user":"Lyte","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63e087b6a98d931aa90c1b9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63e087b6a98d931aa90c1b9c/4ZnfL0U8rrj3cNhj7WTgo.jpeg","isPro":false,"fullname":"Hyunwoo Ko","user":"Cartinoe5930","type":"user"},{"_id":"6762b881f47f60b73c78548e","avatarUrl":"/avatars/ff5215f38b4bb83e1ec64d324efd13ad.svg","isPro":false,"fullname":"Yash Thube","user":"thubZ9","type":"user"},{"_id":"66f612b934b8ac9ffa44f084","avatarUrl":"/avatars/6836c122e19c66c90f1673f28b30d7f0.svg","isPro":false,"fullname":"Tang","user":"tommysally","type":"user"},{"_id":"674721e182af6605f8b5bfd8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/T0aQqFUdKfYfFjO192ZfP.png","isPro":false,"fullname":"sun","user":"sunmaxim","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"641ddac5be3bd3a5a06ed4a4","avatarUrl":"/avatars/14969dff861d53b0a75305606495eca7.svg","isPro":false,"fullname":"Zitian Gao","user":"zgao3186","type":"user"},{"_id":"65e2ee514ca1449815ddd450","avatarUrl":"/avatars/34b8f7aeb2ca36f84ad39d32c720b29b.svg","isPro":false,"fullname":"ShadeCloak","user":"ShadeCloak","type":"user"},{"_id":"64492a9e81c086e0a4228c1f","avatarUrl":"/avatars/4c4433fee3b9e54381d9cf618213872b.svg","isPro":false,"fullname":"Yasuhiro Morioka","user":"ymorioka","type":"user"},{"_id":"62abdf657b037eafffc48808","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655430982462-noauth.jpeg","isPro":false,"fullname":"Jiahang Xu","user":"Jiahang","type":"user"},{"_id":"64b0a5037a475fba70a7260d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b0a5037a475fba70a7260d/MauBbb6raMA23yrR1Zq21.jpeg","isPro":false,"fullname":"Zhen Fang","user":"CostaliyA","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2502.14768

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Published on Feb 20, 2025
· Submitted by
AK
on Feb 21, 2025
Authors:
,
,
,
,
,

Abstract

A rule-based reinforcement learning system for large reasoning models uses synthetic logic puzzles to develop advanced reasoning skills and demonstrates generalization on complex math benchmarks.

AI-generated summary

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

Community

Paper submitter

Screenshot 2025-02-20 at 10.18.21 PM.png

Generating a 50,000 point dataset of lambda calculus at this very moment.

We made a deep dive video for this paper: https://www.youtube.com/watch?v=IsfG3r1car0. DeepSeek R1 Reproduced & Upgraded!
TitleImage.png

Thanks for the amazing work! May I ask a quick question on the global batch size and total training steps? In the paper it was mentioned the training set has about 5k samples, and we have a training batch size of 8 with roll out of 8. I am wondering how did we get 3600 training steps with this set up? Did we use additional gradient accumulation? Many thanks.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.14768 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.14768 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.14768 in a Space README.md to link it from this page.

Collections including this paper 15