Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
[go: Go Back, main page]

Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-04T01:38:47.333Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7687508463859558},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.01511","authors":[{"_id":"698183a2ce18b18628096306","user":{"_id":"6621dd7623518bf2898788da","avatarUrl":"/avatars/837aca71471a19ea80ae55d619781c5a.svg","isPro":false,"fullname":"Ran Xu","user":"ritaranx","type":"user"},"name":"Ran Xu","status":"claimed_verified","statusLastChangedAt":"2026-02-04T12:32:04.129Z","hidden":false},{"_id":"698183a2ce18b18628096307","user":{"_id":"64bf811d76a6e2efcceabc00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bf811d76a6e2efcceabc00/0p3zSIVqzoME25Zmfh7SD.png","isPro":false,"fullname":"Tianci Liu","user":"lliutianc","type":"user"},"name":"Tianci Liu","status":"claimed_verified","statusLastChangedAt":"2026-02-04T12:32:08.667Z","hidden":false},{"_id":"698183a2ce18b18628096308","name":"Zihan Dong","hidden":false},{"_id":"698183a2ce18b18628096309","name":"Tony You","hidden":false},{"_id":"698183a2ce18b1862809630a","name":"Ilgee Hong","hidden":false},{"_id":"698183a2ce18b1862809630b","name":"Carl Yang","hidden":false},{"_id":"698183a2ce18b1862809630c","name":"Linjun Zhang","hidden":false},{"_id":"698183a2ce18b1862809630d","name":"Tao Zhao","hidden":false},{"_id":"698183a2ce18b1862809630e","user":{"_id":"641a92bc4182690729c9324b","avatarUrl":"/avatars/f5d3de7f04fe77d0cfced51b5431c114.svg","isPro":false,"fullname":"haoyu wang","user":"haoyuw","type":"user"},"name":"Haoyu Wang","status":"claimed_verified","statusLastChangedAt":"2026-02-03T10:03:47.466Z","hidden":false}],"publishedAt":"2026-02-02T00:50:53.000Z","submittedOnDailyAt":"2026-02-03T02:42:36.655Z","title":"Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training","submittedOnDailyBy":{"_id":"64bf811d76a6e2efcceabc00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bf811d76a6e2efcceabc00/0p3zSIVqzoME25Zmfh7SD.png","isPro":false,"fullname":"Tianci Liu","user":"lliutianc","type":"user"},"summary":"Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.","upvotes":14,"discussionId":"698183a2ce18b1862809630f","ai_summary":"Rubric-ARM framework jointly optimizes rubric generation and judging through reinforcement learning to improve response quality assessment in creative and open-ended tasks.","ai_keywords":["reward models","reinforcement learning","preference feedback","rubric generator","judge","alternating optimization","non-stationarity","gradient variance","policy alignment","offline reinforcement learning","online reinforcement learning"],"organization":{"_id":"68e706da311f55603f9b6f2f","name":"OpenRubrics","fullname":"OpenRubrics"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64bf811d76a6e2efcceabc00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bf811d76a6e2efcceabc00/0p3zSIVqzoME25Zmfh7SD.png","isPro":false,"fullname":"Tianci Liu","user":"lliutianc","type":"user"},{"_id":"641a92bc4182690729c9324b","avatarUrl":"/avatars/f5d3de7f04fe77d0cfced51b5431c114.svg","isPro":false,"fullname":"haoyu wang","user":"haoyuw","type":"user"},{"_id":"6358c9d90e4fef21982b6b87","avatarUrl":"/avatars/12def86ed68b74aaea0b6593c867a274.svg","isPro":false,"fullname":"Yue Yu","user":"yyu","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"698210893dedf3db6ddc45cd","avatarUrl":"/avatars/6fe1510b3178458782ba87770128371b.svg","isPro":false,"fullname":"bo h","user":"mohw12","type":"user"},{"_id":"68bf90fe449c9a0248625005","avatarUrl":"/avatars/bb6112bc1cfd8a30beebd58a9e57280f.svg","isPro":false,"fullname":"Shiyang Wang","user":"testbed2","type":"user"},{"_id":"665881b031d241b7a609cc8c","avatarUrl":"/avatars/62fd259fd5c9bbadd523c5c195ab764f.svg","isPro":false,"fullname":"Tianchun Li","user":"tchunli","type":"user"},{"_id":"63f05d5e4a788ed1dd8a8086","avatarUrl":"/avatars/320b75ecfd9bf132b62c85b58a55ebf2.svg","isPro":false,"fullname":"Xingchen Wang","user":"wangxingchen2930","type":"user"},{"_id":"649dd28547b275ac4832d43c","avatarUrl":"/avatars/cf399945ee8d1c817fe75c55a0f27f6f.svg","isPro":false,"fullname":"Zichen Miao","user":"ZichenMiao","type":"user"},{"_id":"6621dd7623518bf2898788da","avatarUrl":"/avatars/837aca71471a19ea80ae55d619781c5a.svg","isPro":false,"fullname":"Ran Xu","user":"ritaranx","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"64614f3096259bec21d34247","avatarUrl":"/avatars/340f06070c2d4649e3a1972432b72274.svg","isPro":false,"fullname":"Qiming","user":"Qiming1998","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"68e706da311f55603f9b6f2f","name":"OpenRubrics","fullname":"OpenRubrics"}}">
Papers
arxiv:2602.01511

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Published on Feb 2
· Submitted by
Tianci Liu
on Feb 3
Authors:
Ran Xu ,
,
,
,
,
,
,

Abstract

Rubric-ARM framework jointly optimizes rubric generation and judging through reinforcement learning to improve response quality assessment in creative and open-ended tasks.

AI-generated summary

Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.

Community

Paper author Paper submitter

Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.01511 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.01511 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.