Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
[go: Go Back, main page]

https://github.com/yinyueqin/DenseRewardRLHF-PPO

\n","updatedAt":"2025-01-08T16:02:00.400Z","author":{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","fullname":"yueqin yin","name":"yyqoni","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7393710017204285},"editors":["yyqoni"],"editorAvatarUrls":["/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg"],"reactions":[],"isReport":false}},{"id":"677f2bf230491e3e0b5e0096","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-01-09T01:52:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment](https://huggingface.co/papers/2411.09341) (2024)\n* [Self-Generated Critiques Boost Reward Modeling for Language Models](https://huggingface.co/papers/2411.16646) (2024)\n* [T-REG: Preference Optimization with Token-Level Reward Regularization](https://huggingface.co/papers/2412.02685) (2024)\n* [Reinforcement Learning Enhanced LLMs: A Survey](https://huggingface.co/papers/2412.10400) (2024)\n* [Multimodal Preference Data Synthetic Alignment with Reward Model](https://huggingface.co/papers/2412.17417) (2024)\n* [RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment](https://huggingface.co/papers/2412.13746) (2024)\n* [Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models](https://huggingface.co/papers/2412.12865) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-01-09T01:52:50.000Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7142888903617859},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2501.02790","authors":[{"_id":"677dea23e86d0754dc6e3f09","user":{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","isPro":true,"fullname":"yueqin yin","user":"yyqoni","type":"user"},"name":"Yueqin Yin","status":"claimed_verified","statusLastChangedAt":"2025-01-08T09:44:41.339Z","hidden":false},{"_id":"677dea23e86d0754dc6e3f0a","user":{"_id":"677ebaf3fcaae73ddd6c8475","avatarUrl":"/avatars/98db20fe3bbbf7caffb5821a7242dc54.svg","isPro":false,"fullname":"Shentao Yang","user":"shentaoyang","type":"user"},"name":"Shentao Yang","status":"claimed_verified","statusLastChangedAt":"2025-01-08T21:08:25.725Z","hidden":false},{"_id":"677dea23e86d0754dc6e3f0b","name":"Yujia Xie","hidden":false},{"_id":"677dea23e86d0754dc6e3f0c","name":"Ziyi Yang","hidden":false},{"_id":"677dea23e86d0754dc6e3f0d","name":"Yuting Sun","hidden":false},{"_id":"677dea23e86d0754dc6e3f0e","name":"Hany Awadalla","hidden":false},{"_id":"677dea23e86d0754dc6e3f0f","name":"Weizhu Chen","hidden":false},{"_id":"677dea23e86d0754dc6e3f10","name":"Mingyuan Zhou","hidden":false}],"publishedAt":"2025-01-06T06:17:56.000Z","submittedOnDailyAt":"2025-01-08T13:32:00.381Z","title":"Segmenting Text and Learning Their Rewards for Improved RLHF in Language\n Model","submittedOnDailyBy":{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","isPro":true,"fullname":"yueqin yin","user":"yyqoni","type":"user"},"summary":"Reinforcement learning from human feedback (RLHF) has been widely adopted to\nalign language models (LMs) with human preference. Prior RLHF works typically\ntake a bandit formulation, which, though intuitive, ignores the sequential\nnature of LM generation and can suffer from the sparse reward issue. While\nrecent works propose dense token-level RLHF, treating each token as an action\nmay be oversubtle to proper reward assignment. In this paper, we seek to get\nthe best of both by training and utilizing a segment-level reward model, which\nassigns a reward to each semantically complete text segment that spans over a\nshort sequence of tokens. For reward learning, our method allows dynamic text\nsegmentation and compatibility with standard sequence-preference datasets. For\neffective RL-based LM training against segment reward, we generalize the\nclassical scalar bandit reward normalizers into location-aware normalizer\nfunctions and interpolate the segment reward for further densification. With\nthese designs, our method performs competitively on three popular RLHF\nbenchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation\nstudies are conducted to further demonstrate our method.","upvotes":8,"discussionId":"677dea24e86d0754dc6e3f49","githubRepo":"https://github.com/yinyueqin/DenseRewardRLHF-PPO","githubRepoAddedBy":"user","ai_summary":"A segment-level reward model for reinforcement learning from human feedback addresses sparse reward issues by assigning rewards to semantically complete text segments during language model training.","ai_keywords":["reinforcement learning from human feedback","bandit formulation","sparse reward","dense token-level RLHF","segment-level reward model","dynamic text segmentation","sequence-preference datasets","scalar bandit reward normalizers","location-aware normalizer functions","segment reward","AlpacaEval 2.0","Arena-Hard","MT-Bench","ablation studies"],"githubStars":19},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"677ebaf3fcaae73ddd6c8475","avatarUrl":"/avatars/98db20fe3bbbf7caffb5821a7242dc54.svg","isPro":false,"fullname":"Shentao Yang","user":"shentaoyang","type":"user"},{"_id":"645c22967d655680b57cb304","avatarUrl":"/avatars/2dda79ac689b8dd21d15e8b780db16f8.svg","isPro":false,"fullname":"Klein Morrow","user":"kmorrow1","type":"user"},{"_id":"64aac6984135aae75f3b99c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yVR88SbW9TlJIrRHlJby-.jpeg","isPro":false,"fullname":"Yumin Kim","user":"YuminKim","type":"user"},{"_id":"64c1c77c245c55a21c6f5a13","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c1c77c245c55a21c6f5a13/d9zlSksf3TxWpBbb-r0fd.jpeg","isPro":false,"fullname":"Reza Sayar","user":"Reza2kn","type":"user"},{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","isPro":true,"fullname":"yueqin yin","user":"yyqoni","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2501.02790

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

Published on Jan 6, 2025
· Submitted by
yueqin yin
on Jan 8, 2025
Authors:
,
,
,
,
,

Abstract

A segment-level reward model for reinforcement learning from human feedback addresses sparse reward issues by assigning rewards to semantically complete text segments during language model training.

AI-generated summary

Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.

Community

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 18

Browse 18 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.02790 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.02790 in a Space README.md to link it from this page.

Collections including this paper 2