https://huggingface.co/Salesforce/SFR-Iterative-DPO-LLaMA-3-8B-R

\n","updatedAt":"2024-05-14T06:36:54.807Z","author":{"_id":"638fb8cf2380ffd99caf8c2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638fb8cf2380ffd99caf8c2a/7juUomHJ4gH1HIJj43qE6.jpeg","fullname":"Haoxiang Wang","name":"Haoxiang-Wang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.790412962436676},"editors":["Haoxiang-Wang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/638fb8cf2380ffd99caf8c2a/7juUomHJ4gH1HIJj43qE6.jpeg"],"reactions":[{"reaction":"👍","users":["weqweasdas","AdinaY"],"count":2},{"reaction":"❤️","users":["weqweasdas","AdinaY"],"count":2}],"isReport":false},"replies":[{"id":"66471113962640f413c556f1","author":{"_id":"630920925a5c889aaedc7f33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630920925a5c889aaedc7f33/w00N19M21l2FXe6ZasSYc.jpeg","fullname":"Kristaller486","name":"kristaller486","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":22,"isUserFollowing":false},"createdAt":"2024-05-17T08:10:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Why was the repository deleted?","html":"

Why was the repository deleted?

\n","updatedAt":"2024-05-17T08:10:59.144Z","author":{"_id":"630920925a5c889aaedc7f33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630920925a5c889aaedc7f33/w00N19M21l2FXe6ZasSYc.jpeg","fullname":"Kristaller486","name":"kristaller486","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":22,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9951857924461365},"editors":["kristaller486"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/630920925a5c889aaedc7f33/w00N19M21l2FXe6ZasSYc.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66430686f057cd9308915c8b"}},{"id":"66476e9129eadb3ab4b9879e","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false},"createdAt":"2024-05-17T14:49:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi! You may use RLHFlow/LLaMA3-iterative-DPO-final instead.","html":"

Hi! You may use RLHFlow/LLaMA3-iterative-DPO-final instead.

\n","updatedAt":"2024-05-17T14:49:53.928Z","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6057695746421814},"editors":["weqweasdas"],"editorAvatarUrls":["/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg"],"reactions":[],"isReport":false,"parentCommentId":"66430686f057cd9308915c8b"}}]},{"id":"66431131b0c1e83df3526997","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false},"createdAt":"2024-05-14T07:22:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Nice repo👍 https://huggingface.co/RLHFlow. ","html":"

Nice repo👍 https://huggingface.co/RLHFlow.

\n","updatedAt":"2024-05-14T07:22:25.639Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5515032410621643},"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[],"isReport":false}},{"id":"6643ba541fc4659c9f9b8a4a","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false},"createdAt":"2024-05-14T19:24:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Here's a plain-english summary of the paper - feedback from the authors is welcome!\n\n[https://www.aimodels.fyi/papers/arxiv/what-matters-when-building-vision-language-models](https://www.aimodels.fyi/papers/arxiv/rlhf-workflow-from-reward-modeling-to-online)","html":"

Here's a plain-english summary of the paper - feedback from the authors is welcome!

https://www.aimodels.fyi/papers/arxiv/what-matters-when-building-vision-language-models

\n","updatedAt":"2024-05-14T19:24:04.117Z","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6940851211547852},"editors":["mikelabs"],"editorAvatarUrls":["/avatars/0bc16a7447cd71ac18828a678313bd83.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2405.07863","authors":[{"_id":"6642e6abe4a7476619c240f1","user":{"_id":"63a3ff69f91ad3ea5703841d","avatarUrl":"/avatars/69227c4bce01d33747c1377b6f9672db.svg","isPro":false,"fullname":"Hanze Dong","user":"hendrydong","type":"user"},"name":"Hanze Dong","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:12:07.320Z","hidden":false},{"_id":"6642e6abe4a7476619c240f2","user":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","isPro":false,"fullname":"Wei Xiong","user":"weqweasdas","type":"user"},"name":"Wei Xiong","status":"claimed_verified","statusLastChangedAt":"2024-05-14T07:53:00.588Z","hidden":false},{"_id":"6642e6abe4a7476619c240f3","user":{"_id":"63e08acbf351dc0745749d56","avatarUrl":"/avatars/8e2d5ce9db5bd8008ac2ad80f6025553.svg","isPro":false,"fullname":"Bo Pang","user":"bpucla","type":"user"},"name":"Bo Pang","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:12:38.677Z","hidden":false},{"_id":"6642e6abe4a7476619c240f4","user":{"_id":"638fb8cf2380ffd99caf8c2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638fb8cf2380ffd99caf8c2a/7juUomHJ4gH1HIJj43qE6.jpeg","isPro":false,"fullname":"Haoxiang Wang","user":"Haoxiang-Wang","type":"user"},"name":"Haoxiang Wang","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:13:11.114Z","hidden":false},{"_id":"6642e6abe4a7476619c240f5","name":"Han Zhao","hidden":false},{"_id":"6642e6abe4a7476619c240f6","user":{"_id":"649bc93758d8b19de0c7785f","avatarUrl":"/avatars/3ed9473aee23d99f4ee949d3705089ea.svg","isPro":false,"fullname":"Yingbo Zhou","user":"yingbozhou","type":"user"},"name":"Yingbo Zhou","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:13:32.770Z","hidden":false},{"_id":"6642e6abe4a7476619c240f7","user":{"_id":"64b8922ca1827cc8d04ae919","avatarUrl":"/avatars/0aaa83e3d09a82434e1d6af724aaa485.svg","isPro":false,"fullname":"Nan Jiang","user":"nanjiang","type":"user"},"name":"Nan Jiang","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:14:02.032Z","hidden":false},{"_id":"6642e6abe4a7476619c240f8","user":{"_id":"65f84fd980481173afd91233","avatarUrl":"/avatars/6ac7bd6beba24d1476c5179b88c9e3fa.svg","isPro":false,"fullname":"Doyen","user":"doyensahoo","type":"user"},"name":"Doyen Sahoo","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:14:08.581Z","hidden":false},{"_id":"6642e6abe4a7476619c240f9","user":{"_id":"649dbcc4e0fff1ed099dc80a","avatarUrl":"/avatars/c87c273ca628dbcddccbf1ee19b2ce33.svg","isPro":false,"fullname":"Caiming Xiong","user":"cxiong","type":"user"},"name":"Caiming Xiong","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:14:15.835Z","hidden":false},{"_id":"6642e6abe4a7476619c240fa","name":"Tong Zhang","hidden":false}],"publishedAt":"2024-05-13T15:50:39.000Z","submittedOnDailyAt":"2024-05-14T02:51:00.092Z","title":"RLHF Workflow: From Reward Modeling to Online RLHF","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We present the workflow of Online Iterative Reinforcement Learning from Human\nFeedback (RLHF) in this technical report, which is widely reported to\noutperform its offline counterpart by a large margin in the recent large\nlanguage model (LLM) literature. However, existing open-source RLHF projects\nare still largely confined to the offline learning setting. In this technical\nreport, we aim to fill in this gap and provide a detailed recipe that is easy\nto reproduce for online iterative RLHF. In particular, since online human\nfeedback is usually infeasible for open-source communities with limited\nresources, we start by constructing preference models using a diverse set of\nopen-source datasets and use the constructed proxy preference model to\napproximate human feedback. Then, we discuss the theoretical insights and\nalgorithmic principles behind online iterative RLHF, followed by a detailed\npractical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R,\nachieves impressive performance on LLM chatbot benchmarks, including\nAlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks\nsuch as HumanEval and TruthfulQA. We have shown that supervised fine-tuning\n(SFT) and iterative RLHF can obtain state-of-the-art performance with fully\nopen-source datasets. Further, we have made our models, curated datasets, and\ncomprehensive step-by-step code guidebooks publicly available. Please refer to\nhttps://github.com/RLHFlow/RLHF-Reward-Modeling and\nhttps://github.com/RLHFlow/Online-RLHF for more detailed information.","upvotes":71,"discussionId":"6642e6ace4a7476619c2419d","githubRepo":"https://github.com/rlhflow/online-rlhf","githubRepoAddedBy":"auto","ai_summary":"Online iterative reinforcement learning from human feedback achieves state-of-the-art performance in large language models using open-source datasets and proxy preference models.","ai_keywords":["RLHF","Online Iterative RLHF","preference models","proxy preference model","SFR-Iterative-DPO-LLaMA-3-8B-R","LLM chatbot benchmarks","AlpacaEval-2","Arena-Hard","MT-Bench","HumanEval","TruthfulQA","supervised fine-tuning","SFT"],"githubStars":539},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63a3ff69f91ad3ea5703841d","avatarUrl":"/avatars/69227c4bce01d33747c1377b6f9672db.svg","isPro":false,"fullname":"Hanze Dong","user":"hendrydong","type":"user"},{"_id":"64cb1ad1667f4f80852f6050","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cb1ad1667f4f80852f6050/iOn5q_RyyBS99tObrO5Tc.png","isPro":false,"fullname":"Rui Pan","user":"research4pan","type":"user"},{"_id":"638fb8cf2380ffd99caf8c2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638fb8cf2380ffd99caf8c2a/7juUomHJ4gH1HIJj43qE6.jpeg","isPro":false,"fullname":"Haoxiang Wang","user":"Haoxiang-Wang","type":"user"},{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","isPro":false,"fullname":"Wei Xiong","user":"weqweasdas","type":"user"},{"_id":"65eec5c1d7d63c2ed0615421","avatarUrl":"/avatars/8c32f5e7d4b1940088bdec73c0b86fab.svg","isPro":false,"fullname":"Chenlu Ye","user":"Chenlu123","type":"user"},{"_id":"63aeb7effcca84593e643d59","avatarUrl":"/avatars/cd928a730fddb0af7bc3f19c246a9bce.svg","isPro":false,"fullname":"Charan H U","user":"charanhu","type":"user"},{"_id":"66430ed6ab89e3a3a887194b","avatarUrl":"/avatars/e5e4988e81c5719dc85eb460e4916e11.svg","isPro":false,"fullname":"yifanhao","user":"yifanhao","type":"user"},{"_id":"66272f88ed26e475558ac76c","avatarUrl":"/avatars/55bd73c1ca99fb52b519c0cf3ca0f6a8.svg","isPro":false,"fullname":"wx13","user":"wx13","type":"user"},{"_id":"65d72b9ddea7fb17961a6ed4","avatarUrl":"/avatars/77ac20069def2b35cc6426a81fc070d6.svg","isPro":false,"fullname":"Yong Lin","user":"linyongver","type":"user"},{"_id":"64ff03621499bba2b52fc76a","avatarUrl":"/avatars/389e474cdf8883810ec676cca60b0aef.svg","isPro":false,"fullname":"weizhonz","user":"weizhonz","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">

Papers

arxiv:2405.07863

RLHF Workflow: From Reward Modeling to Online RLHF

Published on May 13, 2024

· Submitted by

AK on May 14, 2024

#2 Paper of the day

Upvote

Authors:

Hanze Dong ,

Wei Xiong ,

Bo Pang ,

Haoxiang Wang ,

Yingbo Zhou ,

Nan Jiang ,

Doyen Sahoo ,

Caiming Xiong ,

Abstract

Online iterative reinforcement learning from human feedback achieves state-of-the-art performance in large language models using open-source datasets and proxy preference models.

AI-generated summary

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.