Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-06-21T01:34:19.075Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7305417060852051},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2506.14965","authors":[{"_id":"68538be099bf39f9665c79b9","name":"Zhoujun Cheng","hidden":false},{"_id":"68538be099bf39f9665c79ba","name":"Shibo Hao","hidden":false},{"_id":"68538be099bf39f9665c79bb","user":{"_id":"629e2bcc46b4826be2c57fe3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629e2bcc46b4826be2c57fe3/41BiA52XlZi31ABsljFiq.jpeg","isPro":false,"fullname":"Tianyang Liu","user":"tianyang","type":"user"},"name":"Tianyang Liu","status":"claimed_verified","statusLastChangedAt":"2025-06-20T12:19:48.511Z","hidden":false},{"_id":"68538be099bf39f9665c79bc","user":{"_id":"628f6e5ab90dde28ef57d293","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/AxNzR2nvrND6Rf3RPkYMk.jpeg","isPro":false,"fullname":"Fan Zhou","user":"koalazf99","type":"user"},"name":"Fan Zhou","status":"claimed_verified","statusLastChangedAt":"2025-06-20T12:19:46.579Z","hidden":false},{"_id":"68538be099bf39f9665c79bd","name":"Yutao Xie","hidden":false},{"_id":"68538be099bf39f9665c79be","user":{"_id":"64c8b2c5c547ed5243e14a6e","avatarUrl":"/avatars/96d4a9010f96001c8cff235915926390.svg","isPro":false,"fullname":"Feng Yao","user":"fengyao1909","type":"user"},"name":"Feng Yao","status":"claimed_verified","statusLastChangedAt":"2025-06-20T12:19:50.740Z","hidden":false},{"_id":"68538be099bf39f9665c79bf","name":"Yuexin Bian","hidden":false},{"_id":"68538be099bf39f9665c79c0","user":{"_id":"64ab987073790912c7ad79e6","avatarUrl":"/avatars/bfa00418c46da0fd1e522d6561944ec9.svg","isPro":false,"fullname":"Yonghao Zhuang","user":"ZYHowell","type":"user"},"name":"Yonghao Zhuang","status":"claimed_verified","statusLastChangedAt":"2025-10-24T16:15:53.778Z","hidden":false},{"_id":"68538be099bf39f9665c79c1","name":"Nilabjo Dey","hidden":false},{"_id":"68538be099bf39f9665c79c2","user":{"_id":"6170b94f3d7d0662ac33a776","avatarUrl":"/avatars/e8f8c49deb72e418b1ed943bfb6ecaf5.svg","isPro":false,"fullname":"y_zha","user":"yzha","type":"user"},"name":"Yuheng Zha","status":"claimed_verified","statusLastChangedAt":"2025-09-29T06:06:32.250Z","hidden":false},{"_id":"68538be099bf39f9665c79c3","name":"Yi Gu","hidden":false},{"_id":"68538be099bf39f9665c79c4","name":"Kun Zhou","hidden":false},{"_id":"68538be099bf39f9665c79c5","name":"Yuqi Wang","hidden":false},{"_id":"68538be099bf39f9665c79c6","user":{"_id":"65a55718680cb2eb94473619","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65a55718680cb2eb94473619/wtywmAsTLpvdXgq3sFtQL.jpeg","isPro":false,"fullname":"Yuan Li","user":"realYuanLi","type":"user"},"name":"Yuan Li","status":"claimed_verified","statusLastChangedAt":"2025-06-23T08:17:20.373Z","hidden":false},{"_id":"68538be099bf39f9665c79c7","name":"Richard Fan","hidden":false},{"_id":"68538be099bf39f9665c79c8","user":{"_id":"659d3b57e520cf08924a42d5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/659d3b57e520cf08924a42d5/EAyB9aq8RSNSv1Cfgalxi.jpeg","isPro":true,"fullname":"Jianshu She","user":"Jianshu001","type":"user"},"name":"Jianshu She","status":"claimed_verified","statusLastChangedAt":"2025-06-23T08:17:24.551Z","hidden":false},{"_id":"68538be099bf39f9665c79c9","name":"Chengqian Gao","hidden":false},{"_id":"68538be099bf39f9665c79ca","name":"Abulhair Saparov","hidden":false},{"_id":"68538be099bf39f9665c79cb","name":"Haonan Li","hidden":false},{"_id":"68538be099bf39f9665c79cc","user":{"_id":"66624cec580bb3ee34c52049","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66624cec580bb3ee34c52049/Ju5dmLw7ZellMJoEWIENm.jpeg","isPro":false,"fullname":"Taylor W. Killian","user":"twkillian","type":"user"},"name":"Taylor W. Killian","status":"claimed_verified","statusLastChangedAt":"2025-06-23T08:17:18.012Z","hidden":false},{"_id":"68538be099bf39f9665c79cd","user":{"_id":"6552e5121c6e798ec0d3d52c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6552e5121c6e798ec0d3d52c/y30R447PkY2F1_T4HuyMp.jpeg","isPro":false,"fullname":"Mikhail Yurochkin","user":"moonfolk","type":"user"},"name":"Mikhail Yurochkin","status":"claimed_verified","statusLastChangedAt":"2025-06-23T08:17:22.389Z","hidden":false},{"_id":"68538be099bf39f9665c79ce","name":"Zhengzhong Liu","hidden":false},{"_id":"68538be099bf39f9665c79cf","name":"Eric P. Xing","hidden":false},{"_id":"68538be099bf39f9665c79d0","name":"Zhiting Hu","hidden":false}],"publishedAt":"2025-06-17T20:24:00.000Z","submittedOnDailyAt":"2025-06-20T06:25:47.447Z","title":"Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain\n Perspective","submittedOnDailyBy":{"_id":"6083902e1e36b13a64497d91","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6083902e1e36b13a64497d91/h4rGHMn2c6z5GesF0F6VU.png","isPro":false,"fullname":"cheng","user":"zhoujun","type":"user"},"summary":"Reinforcement learning (RL) has emerged as a promising approach to improve\nlarge language model (LLM) reasoning, yet most open efforts focus narrowly on\nmath and code, limiting our understanding of its broader applicability to\ngeneral reasoning. A key challenge lies in the lack of reliable, scalable RL\nreward signals across diverse reasoning domains. We introduce Guru, a curated\nRL reasoning corpus of 92K verifiable examples spanning six reasoning\ndomains--Math, Code, Science, Logic, Simulation, and Tabular--each built\nthrough domain-specific reward design, deduplication, and filtering to ensure\nreliability and effectiveness for RL training. Based on Guru, we systematically\nrevisit established findings in RL for LLM reasoning and observe significant\nvariation across domains. For example, while prior work suggests that RL\nprimarily elicits existing knowledge from pretrained models, our results reveal\na more nuanced pattern: domains frequently seen during pretraining (Math, Code,\nScience) easily benefit from cross-domain RL training, while domains with\nlimited pretraining exposure (Logic, Simulation, and Tabular) require in-domain\ntraining to achieve meaningful performance gains, suggesting that RL is likely\nto facilitate genuine skill acquisition. Finally, we present Guru-7B and\nGuru-32B, two models that achieve state-of-the-art performance among open\nmodels RL-trained with publicly available data, outperforming best baselines by\n7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We\nalso show that our models effectively improve the Pass@k performance of their\nbase models, particularly on complex tasks less likely to appear in pretraining\ndata. We release data, models, training and evaluation code to facilitate\ngeneral-purpose reasoning at: https://github.com/LLM360/Reasoning360","upvotes":50,"discussionId":"68538be099bf39f9665c79d1","projectPage":"https://guru-reasoning.github.io/","githubRepo":"https://github.com/LLM360/Reasoning360","githubRepoAddedBy":"user","ai_summary":"Guru, a diverse RL reasoning corpus, highlights domain-specific training needs and demonstrates improved performance in complex tasks for RL-enhanced LLMs.","ai_keywords":["reinforcement learning","large language model","RL reasoning","curated RL reasoning corpus","domain-specific reward design","dereplication","filtering","cross-domain RL training","in-domain training","Guru-7B","Guru-32B","Pass@k performance"],"githubStars":139},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6083902e1e36b13a64497d91","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6083902e1e36b13a64497d91/h4rGHMn2c6z5GesF0F6VU.png","isPro":false,"fullname":"cheng","user":"zhoujun","type":"user"},{"_id":"6365a7e2f31ef76df4028612","avatarUrl":"/avatars/e2748304724b94b398216557fa0237c1.svg","isPro":false,"fullname":"Ber666","user":"SDSB","type":"user"},{"_id":"660ee5df35d092e3fc2a3685","avatarUrl":"/avatars/a7e0472fb7ea49973f74e3eea13dc964.svg","isPro":false,"fullname":"Shibo Hao","user":"Shibo-UCSD","type":"user"},{"_id":"64c8b2c5c547ed5243e14a6e","avatarUrl":"/avatars/96d4a9010f96001c8cff235915926390.svg","isPro":false,"fullname":"Feng Yao","user":"fengyao1909","type":"user"},{"_id":"629e2bcc46b4826be2c57fe3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629e2bcc46b4826be2c57fe3/41BiA52XlZi31ABsljFiq.jpeg","isPro":false,"fullname":"Tianyang Liu","user":"tianyang","type":"user"},{"_id":"628f6e5ab90dde28ef57d293","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/AxNzR2nvrND6Rf3RPkYMk.jpeg","isPro":false,"fullname":"Fan Zhou","user":"koalazf99","type":"user"},{"_id":"684d57f26e04c265777ead3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cuOj-bQqukSZreXgUJlfm.png","isPro":false,"fullname":"Joakim Lee","user":"Reinforcement4All","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/23gZ_lBEwyoqjexFy9QLD.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"62cbeb2d72dfd24b86bdf977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cbeb2d72dfd24b86bdf977/UcGYYSBNrCvPM5K9v-sro.png","isPro":false,"fullname":"Zengzhi Wang","user":"SinclairWang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"624ae12dc04d55ec0f43c089","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649074448411-noauth.png","isPro":false,"fullname":"Varad Pimpalkhute","user":"DaoistKalki","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">
Papers
arxiv:2506.14965

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Published on Jun 17, 2025
· Submitted by
cheng
on Jun 20, 2025
#1 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Guru, a diverse RL reasoning corpus, highlights domain-specific training needs and demonstrates improved performance in complex tasks for RL-enhanced LLMs.

AI-generated summary

Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360

Community

Paper submitter

Submit

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 5

Browse 5 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.14965 in a Space README.md to link it from this page.

Collections including this paper 6