Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Segmenting Text and Learning Their Rewards for Improved RLHF in Language
Model
https://github.com/yinyueqin/DenseRewardRLHF-PPO\n","updatedAt":"2025-01-08T16:02:00.400Z","author":{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","fullname":"yueqin yin","name":"yyqoni","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7393710017204285},"editors":["yyqoni"],"editorAvatarUrls":["/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg"],"reactions":[],"isReport":false}},{"id":"677f2bf230491e3e0b5e0096","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-01-09T01:52:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment](https://huggingface.co/papers/2411.09341) (2024)\n* [Self-Generated Critiques Boost Reward Modeling for Language Models](https://huggingface.co/papers/2411.16646) (2024)\n* [T-REG: Preference Optimization with Token-Level Reward Regularization](https://huggingface.co/papers/2412.02685) (2024)\n* [Reinforcement Learning Enhanced LLMs: A Survey](https://huggingface.co/papers/2412.10400) (2024)\n* [Multimodal Preference Data Synthetic Alignment with Reward Model](https://huggingface.co/papers/2412.17417) (2024)\n* [RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment](https://huggingface.co/papers/2412.13746) (2024)\n* [Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models](https://huggingface.co/papers/2412.12865) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-01-09T01:52:50.000Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7142888903617859},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2501.02790","authors":[{"_id":"677dea23e86d0754dc6e3f09","user":{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","isPro":true,"fullname":"yueqin yin","user":"yyqoni","type":"user"},"name":"Yueqin Yin","status":"claimed_verified","statusLastChangedAt":"2025-01-08T09:44:41.339Z","hidden":false},{"_id":"677dea23e86d0754dc6e3f0a","user":{"_id":"677ebaf3fcaae73ddd6c8475","avatarUrl":"/avatars/98db20fe3bbbf7caffb5821a7242dc54.svg","isPro":false,"fullname":"Shentao Yang","user":"shentaoyang","type":"user"},"name":"Shentao Yang","status":"claimed_verified","statusLastChangedAt":"2025-01-08T21:08:25.725Z","hidden":false},{"_id":"677dea23e86d0754dc6e3f0b","name":"Yujia Xie","hidden":false},{"_id":"677dea23e86d0754dc6e3f0c","name":"Ziyi Yang","hidden":false},{"_id":"677dea23e86d0754dc6e3f0d","name":"Yuting Sun","hidden":false},{"_id":"677dea23e86d0754dc6e3f0e","name":"Hany Awadalla","hidden":false},{"_id":"677dea23e86d0754dc6e3f0f","name":"Weizhu Chen","hidden":false},{"_id":"677dea23e86d0754dc6e3f10","name":"Mingyuan Zhou","hidden":false}],"publishedAt":"2025-01-06T06:17:56.000Z","submittedOnDailyAt":"2025-01-08T13:32:00.381Z","title":"Segmenting Text and Learning Their Rewards for Improved RLHF in Language\n Model","submittedOnDailyBy":{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","isPro":true,"fullname":"yueqin yin","user":"yyqoni","type":"user"},"summary":"Reinforcement learning from human feedback (RLHF) has been widely adopted to\nalign language models (LMs) with human preference. Prior RLHF works typically\ntake a bandit formulation, which, though intuitive, ignores the sequential\nnature of LM generation and can suffer from the sparse reward issue. While\nrecent works propose dense token-level RLHF, treating each token as an action\nmay be oversubtle to proper reward assignment. In this paper, we seek to get\nthe best of both by training and utilizing a segment-level reward model, which\nassigns a reward to each semantically complete text segment that spans over a\nshort sequence of tokens. For reward learning, our method allows dynamic text\nsegmentation and compatibility with standard sequence-preference datasets. For\neffective RL-based LM training against segment reward, we generalize the\nclassical scalar bandit reward normalizers into location-aware normalizer\nfunctions and interpolate the segment reward for further densification. With\nthese designs, our method performs competitively on three popular RLHF\nbenchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation\nstudies are conducted to further demonstrate our method.","upvotes":8,"discussionId":"677dea24e86d0754dc6e3f49","githubRepo":"https://github.com/yinyueqin/DenseRewardRLHF-PPO","githubRepoAddedBy":"user","ai_summary":"A segment-level reward model for reinforcement learning from human feedback addresses sparse reward issues by assigning rewards to semantically complete text segments during language model training.","ai_keywords":["reinforcement learning from human feedback","bandit formulation","sparse reward","dense token-level RLHF","segment-level reward model","dynamic text segmentation","sequence-preference datasets","scalar bandit reward normalizers","location-aware normalizer functions","segment reward","AlpacaEval 2.0","Arena-Hard","MT-Bench","ablation studies"],"githubStars":19},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"677ebaf3fcaae73ddd6c8475","avatarUrl":"/avatars/98db20fe3bbbf7caffb5821a7242dc54.svg","isPro":false,"fullname":"Shentao Yang","user":"shentaoyang","type":"user"},{"_id":"645c22967d655680b57cb304","avatarUrl":"/avatars/2dda79ac689b8dd21d15e8b780db16f8.svg","isPro":false,"fullname":"Klein Morrow","user":"kmorrow1","type":"user"},{"_id":"64aac6984135aae75f3b99c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yVR88SbW9TlJIrRHlJby-.jpeg","isPro":false,"fullname":"Yumin Kim","user":"YuminKim","type":"user"},{"_id":"64c1c77c245c55a21c6f5a13","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c1c77c245c55a21c6f5a13/d9zlSksf3TxWpBbb-r0fd.jpeg","isPro":false,"fullname":"Reza Sayar","user":"Reza2kn","type":"user"},{"_id":"605e8dfd5abeb13e714c4c18","avatarUrl":"/avatars/bc27a0ed17b2bd4311e89d3028fa327b.svg","isPro":true,"fullname":"yueqin yin","user":"yyqoni","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
A segment-level reward model for reinforcement learning from human feedback addresses sparse reward issues by assigning rewards to semantically complete text segments during language model training.
AI-generated summary
Reinforcement learning from human feedback (RLHF) has been widely adopted to
align language models (LMs) with human preference. Prior RLHF works typically
take a bandit formulation, which, though intuitive, ignores the sequential
nature of LM generation and can suffer from the sparse reward issue. While
recent works propose dense token-level RLHF, treating each token as an action
may be oversubtle to proper reward assignment. In this paper, we seek to get
the best of both by training and utilizing a segment-level reward model, which
assigns a reward to each semantically complete text segment that spans over a
short sequence of tokens. For reward learning, our method allows dynamic text
segmentation and compatibility with standard sequence-preference datasets. For
effective RL-based LM training against segment reward, we generalize the
classical scalar bandit reward normalizers into location-aware normalizer
functions and interpolate the segment reward for further densification. With
these designs, our method performs competitively on three popular RLHF
benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation
studies are conducted to further demonstrate our method.