Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Inference-Time Scaling for Generalist Reward Modeling
[go: Go Back, main page]

https://arxiv.org/abs/2504.02495

\n","updatedAt":"2025-04-04T10:38:22.656Z","author":{"_id":"63468720dd6d90d82ccf3450","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63468720dd6d90d82ccf3450/tVBFlmZNz8FRMkOrDaDID.jpeg","fullname":"YSH","name":"BestWishYsh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":64,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.26901349425315857},"editors":["BestWishYsh"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63468720dd6d90d82ccf3450/tVBFlmZNz8FRMkOrDaDID.jpeg"],"reactions":[],"isReport":false}},{"id":"67f08900d20a095ba7510ab1","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-04-05T01:36:00.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems](https://huggingface.co/papers/2502.19328) (2025)\n* [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://huggingface.co/papers/2504.00891) (2025)\n* [Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1](https://huggingface.co/papers/2503.24376) (2025)\n* [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://huggingface.co/papers/2503.21620) (2025)\n* [AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification](https://huggingface.co/papers/2502.11520) (2025)\n* [IPO: Your Language Model is Secretly a Preference Classifier](https://huggingface.co/papers/2502.16182) (2025)\n* [VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data](https://huggingface.co/papers/2502.06737) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-05T01:36:00.704Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7187604308128357},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["davanstrien","jackphone"],"count":2}],"isReport":false}},{"id":"67f49aca86c85af3af7e6360","author":{"_id":"65642d7401de72cb63165d22","avatarUrl":"/avatars/1f4417c4ac5e781ce73eae1060e3f7f2.svg","fullname":"ytaewon","name":"hamzzi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-04-08T03:40:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Awesome..","html":"

Awesome..

\n","updatedAt":"2025-04-08T03:40:58.172Z","author":{"_id":"65642d7401de72cb63165d22","avatarUrl":"/avatars/1f4417c4ac5e781ce73eae1060e3f7f2.svg","fullname":"ytaewon","name":"hamzzi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5224337577819824},"editors":["hamzzi"],"editorAvatarUrls":["/avatars/1f4417c4ac5e781ce73eae1060e3f7f2.svg"],"reactions":[],"isReport":false}},{"id":"67f611bfbbf680358a728676","author":{"_id":"609cafdfc0c7abcc86e98e48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/609cafdfc0c7abcc86e98e48/cEUiQ8Zquhl4fveWVqQ3S.png","fullname":"Harsh Singhal","name":"Harsh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false},"createdAt":"2025-04-09T06:20:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Do checkout the NotebookLM AI generated podcast of this paper.\n(prompted to focus on an AI/ML audience and strictly instructed to not make jokes)\n\nhttps://youtu.be/AUPkMDlQ8ZM?si=nEkvapG6xeVyVKEn ","html":"

Do checkout the NotebookLM AI generated podcast of this paper.
(prompted to focus on an AI/ML audience and strictly instructed to not make jokes)

\n

https://youtu.be/AUPkMDlQ8ZM?si=nEkvapG6xeVyVKEn

\n","updatedAt":"2025-04-09T06:20:47.919Z","author":{"_id":"609cafdfc0c7abcc86e98e48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/609cafdfc0c7abcc86e98e48/cEUiQ8Zquhl4fveWVqQ3S.png","fullname":"Harsh Singhal","name":"Harsh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.557738721370697},"editors":["Harsh"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/609cafdfc0c7abcc86e98e48/cEUiQ8Zquhl4fveWVqQ3S.png"],"reactions":[],"isReport":false}},{"id":"68074a1e9b53ecbb80679760","author":{"_id":"641ac2207c21ab946bf036e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png","fullname":"Nuo Chen","name":"nuojohnchen","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2025-04-22T07:49:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Great work! Are there any timelines for opening the code?","html":"

Great work! Are there any timelines for opening the code?

\n","updatedAt":"2025-04-22T07:49:50.412Z","author":{"_id":"641ac2207c21ab946bf036e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png","fullname":"Nuo Chen","name":"nuojohnchen","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8329507112503052},"editors":["nuojohnchen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/641ac2207c21ab946bf036e8/r6c9gpOrul0eC59d9e2Mo.png"],"reactions":[{"reaction":"👀","users":["zhliu"],"count":1}],"isReport":false},"replies":[{"id":"68075d2ece610617438b74ec","author":{"_id":"609cafdfc0c7abcc86e98e48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/609cafdfc0c7abcc86e98e48/cEUiQ8Zquhl4fveWVqQ3S.png","fullname":"Harsh Singhal","name":"Harsh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false},"createdAt":"2025-04-22T09:11:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I would like to open the code but the main workhorse is NotebookLM that converts a document into a podcast. The rest of the code I've developed is to create the video overlay using free videos on pexels. The main script is an ffmpeg command that does most of the work. A lot of the code is to manage files, convert pexels videos into a standard format, remove background noise, finding length of each file to match the audio input and so on. \n\nThe code will be open sourced when I can find the time to clean up the codebase. Until them I'm focused on creating podcasts that I'd love to listen to. I'm currently creating podcasts on interesting codebases - check out the one on bleve, a golang search index library https://youtu.be/Fq60EK6c_H0","html":"

I would like to open the code but the main workhorse is NotebookLM that converts a document into a podcast. The rest of the code I've developed is to create the video overlay using free videos on pexels. The main script is an ffmpeg command that does most of the work. A lot of the code is to manage files, convert pexels videos into a standard format, remove background noise, finding length of each file to match the audio input and so on.

\n

The code will be open sourced when I can find the time to clean up the codebase. Until them I'm focused on creating podcasts that I'd love to listen to. I'm currently creating podcasts on interesting codebases - check out the one on bleve, a golang search index library https://youtu.be/Fq60EK6c_H0

\n","updatedAt":"2025-04-22T09:11:10.198Z","author":{"_id":"609cafdfc0c7abcc86e98e48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/609cafdfc0c7abcc86e98e48/cEUiQ8Zquhl4fveWVqQ3S.png","fullname":"Harsh Singhal","name":"Harsh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.883460521697998},"editors":["Harsh"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/609cafdfc0c7abcc86e98e48/cEUiQ8Zquhl4fveWVqQ3S.png"],"reactions":[],"isReport":false,"parentCommentId":"68074a1e9b53ecbb80679760"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2504.02495","authors":[{"_id":"67efb66889b2961e6d95eb5b","user":{"_id":"6468c76bff18750165a64df3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6468c76bff18750165a64df3/dHhE62SHOSJZjyU60vgh7.jpeg","isPro":true,"fullname":"Zijun Liu","user":"BBQGOD","type":"user"},"name":"Zijun Liu","status":"claimed_verified","statusLastChangedAt":"2025-04-08T06:54:11.123Z","hidden":false},{"_id":"67efb66889b2961e6d95eb5c","name":"Peiyi Wang","hidden":false},{"_id":"67efb66889b2961e6d95eb5d","name":"Runxin Xu","hidden":false},{"_id":"67efb66889b2961e6d95eb5e","name":"Shirong Ma","hidden":false},{"_id":"67efb66889b2961e6d95eb5f","name":"Chong Ruan","hidden":false},{"_id":"67efb66889b2961e6d95eb60","name":"Peng Li","hidden":false},{"_id":"67efb66889b2961e6d95eb61","name":"Yang Liu","hidden":false},{"_id":"67efb66889b2961e6d95eb62","name":"Yu Wu","hidden":false}],"publishedAt":"2025-04-03T11:19:49.000Z","submittedOnDailyAt":"2025-04-04T09:08:22.628Z","title":"Inference-Time Scaling for Generalist Reward Modeling","submittedOnDailyBy":{"_id":"63468720dd6d90d82ccf3450","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63468720dd6d90d82ccf3450/tVBFlmZNz8FRMkOrDaDID.jpeg","isPro":false,"fullname":"YSH","user":"BestWishYsh","type":"user"},"summary":"Reinforcement learning (RL) has been widely adopted in post-training for\nlarge language models (LLMs) at scale. Recently, the incentivization of\nreasoning capabilities in LLMs from RL indicates that proper learning\nmethods could enable effective inference-time scalability. A key challenge of\nRL is to obtain accurate reward signals for LLMs in various domains beyond\nverifiable questions or artificial rules. In this work, we investigate how to\nimprove reward modeling (RM) with more inference compute for general queries,\ni.e. the inference-time scalability of generalist RM, and further,\nhow to improve the effectiveness of performance-compute scaling with proper\nlearning methods. For the RM approach, we adopt pointwise generative reward\nmodeling (GRM) to enable flexibility for different input types and potential\nfor inference-time scaling. For the learning method, we propose Self-Principled\nCritique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs\nthrough online RL, to generate principles adaptively and critiques accurately,\nresulting in DeepSeek-GRM models. Furthermore, for effective\ninference-time scaling, we use parallel sampling to expand compute usage, and\nintroduce a meta RM to guide voting process for better scaling performance.\nEmpirically, we show that SPCT significantly improves the quality and\nscalability of GRMs, outperforming existing methods and models in various RM\nbenchmarks without severe biases, and could achieve better performance compared\nto training-time scaling. DeepSeek-GRM still meets challenges in some tasks,\nwhich we believe can be addressed by future efforts in generalist reward\nsystems. The models will be released and open-sourced.","upvotes":58,"discussionId":"67efb66989b2961e6d95ebc5","ai_summary":"Self-Principled Critique Tuning enhances pointwise generative reward modeling for large language models, improving scalability and quality compared to existing methods.","ai_keywords":["reinforcement learning","reward modeling","pointwise generative reward modeling","Self-Principled Critique Tuning","online RL","parallel sampling","meta RM"],"organization":{"_id":"652faff917096ceb6bf53f3f","name":"deepseek-ai","fullname":"DeepSeek","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6538815d1bdb3c40db94fbfa/xMBly9PUMphrFVMxLX4kq.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63468720dd6d90d82ccf3450","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63468720dd6d90d82ccf3450/tVBFlmZNz8FRMkOrDaDID.jpeg","isPro":false,"fullname":"YSH","user":"BestWishYsh","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"621ff334fa5492893dc03d82","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/621ff334fa5492893dc03d82/EAIr-l3O4OeM10f1boLux.jpeg","isPro":false,"fullname":"Xabier de Zuazo","user":"zuazo","type":"user"},{"_id":"65c20ee58aedd6edd2b89000","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65c20ee58aedd6edd2b89000/LtS4YTbmxiCFqHSGHfdC8.png","isPro":false,"fullname":"Chmielewski","user":"Eryk-Chmielewski","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"655e4c26d5c0d3db535cdd66","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655e4c26d5c0d3db535cdd66/7gUJ8urq7mEZ4OE4ppQCj.png","isPro":false,"fullname":"Lincoln","user":"Presidentlin","type":"user"},{"_id":"67efd8cec6e596095453b53d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Tos6IGxc2vkSW12fl8M2w.png","isPro":false,"fullname":"Mengyan Jia","user":"myjia96","type":"user"},{"_id":"64cb54da1af278541d663708","avatarUrl":"/avatars/c44507cc92bb2e83154bad31b90ce6dd.svg","isPro":false,"fullname":"Xiaoye Qu","user":"Xiaoye08","type":"user"},{"_id":"64e3282156b920ef00cf7d94","avatarUrl":"/avatars/8a76c27fe6c71eb98a28b4ebfb90336d.svg","isPro":false,"fullname":"Guo","user":"YiDuo1999","type":"user"},{"_id":"677533cab4b2778e653d35bc","avatarUrl":"/avatars/57f1df3d543fe6dc853b4c62546b4e48.svg","isPro":false,"fullname":"MinJui Sung","user":"HF-Ermine","type":"user"},{"_id":"674c11d135e938b05b7ccae1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/674c11d135e938b05b7ccae1/TAgsgVTxwbXrEj-KyVjKn.png","isPro":false,"fullname":"Lambda Go","user":"lambda-technologies-limited","type":"user"},{"_id":"6639a9df7c0ab4fd9df0a5b5","avatarUrl":"/avatars/38e901a06c04256f97d5d6947657c58c.svg","isPro":false,"fullname":"Mark Washington","user":"Mdubbya","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"652faff917096ceb6bf53f3f","name":"deepseek-ai","fullname":"DeepSeek","avatar":"https://cdn-uploads.huggingface.co/production/uploads/6538815d1bdb3c40db94fbfa/xMBly9PUMphrFVMxLX4kq.png"}}">
Papers
arxiv:2504.02495

Inference-Time Scaling for Generalist Reward Modeling

Published on Apr 3, 2025
· Submitted by
YSH
on Apr 4, 2025
Authors:
,
,
,
,
,
,

Abstract

Self-Principled Critique Tuning enhances pointwise generative reward modeling for large language models, improving scalability and quality compared to existing methods.

AI-generated summary

Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that proper learning methods could enable effective inference-time scalability. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in DeepSeek-GRM models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.

Community

Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Awesome..

Do checkout the NotebookLM AI generated podcast of this paper.
(prompted to focus on an AI/ML audience and strictly instructed to not make jokes)

https://youtu.be/AUPkMDlQ8ZM?si=nEkvapG6xeVyVKEn

Great work! Are there any timelines for opening the code?

·

I would like to open the code but the main workhorse is NotebookLM that converts a document into a podcast. The rest of the code I've developed is to create the video overlay using free videos on pexels. The main script is an ffmpeg command that does most of the work. A lot of the code is to manage files, convert pexels videos into a standard format, remove background noise, finding length of each file to match the audio input and so on.

The code will be open sourced when I can find the time to clean up the codebase. Until them I'm focused on creating podcasts that I'd love to listen to. I'm currently creating podcasts on interesting codebases - check out the one on bleve, a golang search index library https://youtu.be/Fq60EK6c_H0

Sign up or log in to comment

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.02495 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.02495 in a Space README.md to link it from this page.

Collections including this paper 23