arxiv:2504.02495

Inference-Time Scaling for Generalist Reward Modeling

Published on Apr 3, 2025

· Submitted by

YSH on Apr 4, 2025

DeepSeek

Upvote

Authors:

Zijun Liu ,

Abstract

Self-Principled Critique Tuning enhances pointwise generative reward modeling for large language models, improving scalability and quality compared to existing methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that proper learning methods could enable effective inference-time scalability. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in DeepSeek-GRM models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.

View arXiv page View PDF Add to collection

Community

BestWishYsh

Paper submitter Apr 4, 2025

Paper: https://arxiv.org/abs/2504.02495

librarian-bot

Apr 5, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

hamzzi

Apr 8, 2025

Awesome..

Harsh

Apr 9, 2025

Do checkout the NotebookLM AI generated podcast of this paper.
(prompted to focus on an AI/ML audience and strictly instructed to not make jokes)

https://youtu.be/AUPkMDlQ8ZM?si=nEkvapG6xeVyVKEn

nuojohnchen

Apr 22, 2025

Great work! Are there any timelines for opening the code?

Harsh

Apr 22, 2025

I would like to open the code but the main workhorse is NotebookLM that converts a document into a podcast. The rest of the code I've developed is to create the video overlay using free videos on pexels. The main script is an ffmpeg command that does most of the work. A lot of the code is to manage files, convert pexels videos into a standard format, remove background noise, finding length of each file to match the audio input and so on.

The code will be open sourced when I can find the time to clean up the codebase. Until them I'm focused on creating podcasts that I'd love to listen to. I'm currently creating podcasts on interesting codebases - check out the one on bleve, a golang search index library https://youtu.be/Fq60EK6c_H0