Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - JudgeLRM: Large Reasoning Models as a Judge
[go: Go Back, main page]

https://huggingface.co/spaces/nuojohnchen/JudgeLRMDemo
Model: https://huggingface.co/nuojohnchen/JudgeLRM-7B
Code: https://github.com/NuoJohnChen/JudgeLRM

\n","updatedAt":"2025-04-02T07:24:33.698Z","author":{"_id":"641ac2207c21ab946bf036e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png","fullname":"Nuo Chen","name":"nuojohnchen","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7737244367599487},"editors":["nuojohnchen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png"],"reactions":[{"reaction":"🔥","users":["AdinaY"],"count":1}],"isReport":false}},{"id":"67ede63c27a8898b21e209cf","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-03T01:37:00.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Improve LLM-as-a-Judge Ability as a General Ability](https://huggingface.co/papers/2502.11689) (2025)\n* [Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL](https://huggingface.co/papers/2503.23157) (2025)\n* [Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning](https://huggingface.co/papers/2503.16252) (2025)\n* [Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance](https://huggingface.co/papers/2502.08127) (2025)\n* [ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning](https://huggingface.co/papers/2503.19470) (2025)\n* [Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't](https://huggingface.co/papers/2503.16219) (2025)\n* [Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning](https://huggingface.co/papers/2503.09516) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-03T01:37:00.539Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8007856011390686},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"67ee2d49d7d16dd8d9ddbf17","author":{"_id":"62d668939cf7596385b2a667","avatarUrl":"/avatars/8e768319009e7ae07e6fb8a56146dd74.svg","fullname":"AJAY ASAITHAMBI","name":"ajaiasai2022","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-04-03T06:40:09.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"How about trying conditional length reward instead of absolute length reward. Increased reasoning length for those with lower | s*1 - s*2| and otherwise.","html":"

How about trying conditional length reward instead of absolute length reward. Increased reasoning length for those with lower | s1 - s2| and otherwise.

\n","updatedAt":"2025-04-03T06:42:23.441Z","author":{"_id":"62d668939cf7596385b2a667","avatarUrl":"/avatars/8e768319009e7ae07e6fb8a56146dd74.svg","fullname":"AJAY ASAITHAMBI","name":"ajaiasai2022","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8890633583068848},"editors":["ajaiasai2022"],"editorAvatarUrls":["/avatars/8e768319009e7ae07e6fb8a56146dd74.svg"],"reactions":[],"isReport":false},"replies":[{"id":"67ee345d9ed8b47d665e8433","author":{"_id":"641ac2207c21ab946bf036e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png","fullname":"Nuo Chen","name":"nuojohnchen","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2025-04-03T07:10:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for the suggestion and your interest! We’ll further explore reward design.","html":"

Thanks for the suggestion and your interest! We’ll further explore reward design.

\n","updatedAt":"2025-04-03T07:10:21.803Z","author":{"_id":"641ac2207c21ab946bf036e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png","fullname":"Nuo Chen","name":"nuojohnchen","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9108960032463074},"editors":["nuojohnchen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png"],"reactions":[],"isReport":false,"parentCommentId":"67ee2d49d7d16dd8d9ddbf17"}}]},{"id":"67ee4ed56f4001c05f6748f5","author":{"_id":"65531d975ddf46c78579861a","avatarUrl":"/avatars/ef48e0db00c10e08533f930f6d7713bd.svg","fullname":"YantaoLIU","name":"YantaoLIU","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-04-03T09:03:17.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"amazing work, how about test your model on recent benchmark like judgebench, rmbench or rewardbench?\ni believe it will bring more insights","html":"

amazing work, how about test your model on recent benchmark like judgebench, rmbench or rewardbench?
i believe it will bring more insights

\n","updatedAt":"2025-04-03T09:53:26.414Z","author":{"_id":"65531d975ddf46c78579861a","avatarUrl":"/avatars/ef48e0db00c10e08533f930f6d7713bd.svg","fullname":"YantaoLIU","name":"YantaoLIU","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8705323338508606},"editors":["YantaoLIU"],"editorAvatarUrls":["/avatars/ef48e0db00c10e08533f930f6d7713bd.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.00050","authors":[{"_id":"67ec9bf3e58745dc7d652587","user":{"_id":"641ac2207c21ab946bf036e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png","isPro":true,"fullname":"Nuo Chen","user":"nuojohnchen","type":"user"},"name":"Nuo Chen","status":"claimed_verified","statusLastChangedAt":"2025-04-03T08:28:54.107Z","hidden":false},{"_id":"67ec9bf3e58745dc7d652588","user":{"_id":"64351475901c5734bcb64248","avatarUrl":"/avatars/12346d4301c1bfb00ce0ea128a93cc15.svg","isPro":false,"fullname":"Zhiyuan Hu","user":"zhiyuanhucs","type":"user"},"name":"Zhiyuan Hu","status":"admin_assigned","statusLastChangedAt":"2025-04-02T09:44:43.089Z","hidden":false},{"_id":"67ec9bf3e58745dc7d652589","user":{"_id":"6730a1fed66bf1b6378cd451","avatarUrl":"/avatars/5ec9b7313213a951b7c325d35ca26692.svg","isPro":false,"fullname":"qy","user":"qingyun777yes","type":"user"},"name":"Qingyun Zou","status":"admin_assigned","statusLastChangedAt":"2025-04-02T09:44:49.663Z","hidden":false},{"_id":"67ec9bf3e58745dc7d65258a","name":"Jiaying Wu","hidden":false},{"_id":"67ec9bf3e58745dc7d65258b","name":"Qian Wang","hidden":false},{"_id":"67ec9bf3e58745dc7d65258c","user":{"_id":"651d8032c50012d33e914f2f","avatarUrl":"/avatars/0a44c9f51fc50ce86582e328c361ea00.svg","isPro":false,"fullname":"Bryan Hooi","user":"bhooi","type":"user"},"name":"Bryan Hooi","status":"admin_assigned","statusLastChangedAt":"2025-04-02T09:45:24.844Z","hidden":false},{"_id":"67ec9bf3e58745dc7d65258d","name":"Bingsheng He","hidden":false}],"publishedAt":"2025-03-31T02:18:51.000Z","submittedOnDailyAt":"2025-04-02T04:52:20.239Z","title":"JudgeLRM: Large Reasoning Models as a Judge","submittedOnDailyBy":{"_id":"64351475901c5734bcb64248","avatarUrl":"/avatars/12346d4301c1bfb00ce0ea128a93cc15.svg","isPro":false,"fullname":"Zhiyuan Hu","user":"zhiyuanhucs","type":"user"},"summary":"The rise of Large Language Models (LLMs) as evaluators offers a scalable\nalternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for\njudges approaches often fall short in domains requiring complex reasoning. In\nthis work, we investigate whether LLM judges truly benefit from enhanced\nreasoning capabilities. Through a detailed analysis of reasoning requirements\nacross evaluation tasks, we reveal a negative correlation between SFT\nperformance gains and the proportion of reasoning-demanding samples -\nhighlighting the limitations of SFT in such scenarios. To address this, we\nintroduce JudgeLRM, a family of judgment-oriented LLMs trained using\nreinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM\nmodels consistently outperform both SFT-tuned and state-of-the-art reasoning\nmodels. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms\nDeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks\nrequiring deep reasoning.","upvotes":62,"discussionId":"67ec9bf4e58745dc7d6525c1","projectPage":"https://huggingface.co/spaces/nuojohnchen/JudgeLRMDemo","githubRepo":"https://github.com/NuoJohnChen/JudgeLRM","githubRepoAddedBy":"user","ai_summary":"JudgeLRM models trained with reinforcement learning outperform existing models, especially in tasks requiring deep reasoning.","ai_keywords":["Large Language Models","Supervised Fine-Tuning","reinforcement learning","judge-wise","outcome-driven rewards","F1 score","DeepSeek-R1"],"githubStars":41},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"641ac2207c21ab946bf036e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png","isPro":true,"fullname":"Nuo Chen","user":"nuojohnchen","type":"user"},{"_id":"67adf0e4ddc93068ce959a94","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Qegxp02TIcBnr9r8oVUQX.png","isPro":false,"fullname":"Alan Xian","user":"AlanXian","type":"user"},{"_id":"64351475901c5734bcb64248","avatarUrl":"/avatars/12346d4301c1bfb00ce0ea128a93cc15.svg","isPro":false,"fullname":"Zhiyuan Hu","user":"zhiyuanhucs","type":"user"},{"_id":"644784e219538c015b2531f5","avatarUrl":"/avatars/748507a8164d99a126ff0b21990240bd.svg","isPro":false,"fullname":"Shiyun","user":"Christinexx","type":"user"},{"_id":"61a4a4743205e107691e0d68","avatarUrl":"/avatars/df900a465a244dc749c007009336b4d6.svg","isPro":false,"fullname":"Jiaying","user":"Judit","type":"user"},{"_id":"6653f759b6316cf6f63fafec","avatarUrl":"/avatars/23770cd11e885305c3626d5596308770.svg","isPro":false,"fullname":"Yucheng Wang","user":"Echoandland","type":"user"},{"_id":"6650c77a74664a42ddfb9187","avatarUrl":"/avatars/92001bbe0ae9b14309730316b639cede.svg","isPro":false,"fullname":"yueliu1999","user":"yueliu1999","type":"user"},{"_id":"6422d52b320bd601bfef6d77","avatarUrl":"/avatars/8235bf8c2b7f1796dcbba2b5a50b8263.svg","isPro":false,"fullname":"Yulin Chen","user":"Yulin123","type":"user"},{"_id":"65e177d5fa26d4ea20d247af","avatarUrl":"/avatars/023db3d2d0219eccc82aa244a31d905f.svg","isPro":false,"fullname":"Jessy Luo","user":"jessy1110","type":"user"},{"_id":"64a2b7903d16ab72f153340a","avatarUrl":"/avatars/a9eb651ecc03886d259438c2f17e982d.svg","isPro":false,"fullname":"Yuan Li","user":"yuanlics","type":"user"},{"_id":"679750a6a59233c1dddf6830","avatarUrl":"/avatars/e3dd02c85396366b86b5088b8f8d4172.svg","isPro":false,"fullname":"Moming Duan","user":"morningD","type":"user"},{"_id":"656bffff0bbc114fe60444b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656bffff0bbc114fe60444b2/nT_C96x5Z84UXXZ1CeqhG.jpeg","isPro":false,"fullname":"Yifei Sun","user":"EthanSYF","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
Papers
arxiv:2504.00050

JudgeLRM: Large Reasoning Models as a Judge

Published on Mar 31, 2025
· Submitted by
Zhiyuan Hu
on Apr 2, 2025
#2 Paper of the day
Authors:
,
,

Abstract

JudgeLRM models trained with reinforcement learning outperform existing models, especially in tasks requiring deep reasoning.

AI-generated summary

The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

Community

Paper author Paper submitter

Large Reasoning Model for Judge

Paper author

Welcome to use JudgeLRM! Compare any Hugging Face language models by asking your own questions, and explore JudgeLRM’s reasoning and detailed comparisons!
Demo: https://huggingface.co/spaces/nuojohnchen/JudgeLRMDemo
Model: https://huggingface.co/nuojohnchen/JudgeLRM-7B
Code: https://github.com/NuoJohnChen/JudgeLRM

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

How about trying conditional length reward instead of absolute length reward. Increased reasoning length for those with lower | s1 - s2| and otherwise.

·
Paper author

Thanks for the suggestion and your interest! We’ll further explore reward design.

amazing work, how about test your model on recent benchmark like judgebench, rmbench or rewardbench?
i believe it will bring more insights

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.00050 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 10