Why was the repository deleted?
\n","updatedAt":"2024-05-17T08:10:59.144Z","author":{"_id":"630920925a5c889aaedc7f33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/630920925a5c889aaedc7f33/w00N19M21l2FXe6ZasSYc.jpeg","fullname":"Kristaller486","name":"kristaller486","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":22,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9951857924461365},"editors":["kristaller486"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/630920925a5c889aaedc7f33/w00N19M21l2FXe6ZasSYc.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66430686f057cd9308915c8b"}},{"id":"66476e9129eadb3ab4b9879e","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false},"createdAt":"2024-05-17T14:49:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi! You may use RLHFlow/LLaMA3-iterative-DPO-final instead.","html":"Hi! You may use RLHFlow/LLaMA3-iterative-DPO-final instead.
\n","updatedAt":"2024-05-17T14:49:53.928Z","author":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","fullname":"Wei Xiong","name":"weqweasdas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6057695746421814},"editors":["weqweasdas"],"editorAvatarUrls":["/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg"],"reactions":[],"isReport":false,"parentCommentId":"66430686f057cd9308915c8b"}}]},{"id":"66431131b0c1e83df3526997","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false},"createdAt":"2024-05-14T07:22:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Nice repoπ https://huggingface.co/RLHFlow. ","html":"Nice repoπ https://huggingface.co/RLHFlow.
\n","updatedAt":"2024-05-14T07:22:25.639Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5515032410621643},"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[],"isReport":false}},{"id":"6643ba541fc4659c9f9b8a4a","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false},"createdAt":"2024-05-14T19:24:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Here's a plain-english summary of the paper - feedback from the authors is welcome!\n\n[https://www.aimodels.fyi/papers/arxiv/what-matters-when-building-vision-language-models](https://www.aimodels.fyi/papers/arxiv/rlhf-workflow-from-reward-modeling-to-online)","html":"Here's a plain-english summary of the paper - feedback from the authors is welcome!
\nhttps://www.aimodels.fyi/papers/arxiv/what-matters-when-building-vision-language-models
\n","updatedAt":"2024-05-14T19:24:04.117Z","author":{"_id":"6486638da4cf2081f20c40ec","avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6940851211547852},"editors":["mikelabs"],"editorAvatarUrls":["/avatars/0bc16a7447cd71ac18828a678313bd83.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2405.07863","authors":[{"_id":"6642e6abe4a7476619c240f1","user":{"_id":"63a3ff69f91ad3ea5703841d","avatarUrl":"/avatars/69227c4bce01d33747c1377b6f9672db.svg","isPro":false,"fullname":"Hanze Dong","user":"hendrydong","type":"user"},"name":"Hanze Dong","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:12:07.320Z","hidden":false},{"_id":"6642e6abe4a7476619c240f2","user":{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","isPro":false,"fullname":"Wei Xiong","user":"weqweasdas","type":"user"},"name":"Wei Xiong","status":"claimed_verified","statusLastChangedAt":"2024-05-14T07:53:00.588Z","hidden":false},{"_id":"6642e6abe4a7476619c240f3","user":{"_id":"63e08acbf351dc0745749d56","avatarUrl":"/avatars/8e2d5ce9db5bd8008ac2ad80f6025553.svg","isPro":false,"fullname":"Bo Pang","user":"bpucla","type":"user"},"name":"Bo Pang","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:12:38.677Z","hidden":false},{"_id":"6642e6abe4a7476619c240f4","user":{"_id":"638fb8cf2380ffd99caf8c2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638fb8cf2380ffd99caf8c2a/7juUomHJ4gH1HIJj43qE6.jpeg","isPro":false,"fullname":"Haoxiang Wang","user":"Haoxiang-Wang","type":"user"},"name":"Haoxiang Wang","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:13:11.114Z","hidden":false},{"_id":"6642e6abe4a7476619c240f5","name":"Han Zhao","hidden":false},{"_id":"6642e6abe4a7476619c240f6","user":{"_id":"649bc93758d8b19de0c7785f","avatarUrl":"/avatars/3ed9473aee23d99f4ee949d3705089ea.svg","isPro":false,"fullname":"Yingbo Zhou","user":"yingbozhou","type":"user"},"name":"Yingbo Zhou","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:13:32.770Z","hidden":false},{"_id":"6642e6abe4a7476619c240f7","user":{"_id":"64b8922ca1827cc8d04ae919","avatarUrl":"/avatars/0aaa83e3d09a82434e1d6af724aaa485.svg","isPro":false,"fullname":"Nan Jiang","user":"nanjiang","type":"user"},"name":"Nan Jiang","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:14:02.032Z","hidden":false},{"_id":"6642e6abe4a7476619c240f8","user":{"_id":"65f84fd980481173afd91233","avatarUrl":"/avatars/6ac7bd6beba24d1476c5179b88c9e3fa.svg","isPro":false,"fullname":"Doyen","user":"doyensahoo","type":"user"},"name":"Doyen Sahoo","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:14:08.581Z","hidden":false},{"_id":"6642e6abe4a7476619c240f9","user":{"_id":"649dbcc4e0fff1ed099dc80a","avatarUrl":"/avatars/c87c273ca628dbcddccbf1ee19b2ce33.svg","isPro":false,"fullname":"Caiming Xiong","user":"cxiong","type":"user"},"name":"Caiming Xiong","status":"admin_assigned","statusLastChangedAt":"2024-05-14T07:14:15.835Z","hidden":false},{"_id":"6642e6abe4a7476619c240fa","name":"Tong Zhang","hidden":false}],"publishedAt":"2024-05-13T15:50:39.000Z","submittedOnDailyAt":"2024-05-14T02:51:00.092Z","title":"RLHF Workflow: From Reward Modeling to Online RLHF","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"We present the workflow of Online Iterative Reinforcement Learning from Human\nFeedback (RLHF) in this technical report, which is widely reported to\noutperform its offline counterpart by a large margin in the recent large\nlanguage model (LLM) literature. However, existing open-source RLHF projects\nare still largely confined to the offline learning setting. In this technical\nreport, we aim to fill in this gap and provide a detailed recipe that is easy\nto reproduce for online iterative RLHF. In particular, since online human\nfeedback is usually infeasible for open-source communities with limited\nresources, we start by constructing preference models using a diverse set of\nopen-source datasets and use the constructed proxy preference model to\napproximate human feedback. Then, we discuss the theoretical insights and\nalgorithmic principles behind online iterative RLHF, followed by a detailed\npractical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R,\nachieves impressive performance on LLM chatbot benchmarks, including\nAlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks\nsuch as HumanEval and TruthfulQA. We have shown that supervised fine-tuning\n(SFT) and iterative RLHF can obtain state-of-the-art performance with fully\nopen-source datasets. Further, we have made our models, curated datasets, and\ncomprehensive step-by-step code guidebooks publicly available. Please refer to\nhttps://github.com/RLHFlow/RLHF-Reward-Modeling and\nhttps://github.com/RLHFlow/Online-RLHF for more detailed information.","upvotes":71,"discussionId":"6642e6ace4a7476619c2419d","githubRepo":"https://github.com/rlhflow/online-rlhf","githubRepoAddedBy":"auto","ai_summary":"Online iterative reinforcement learning from human feedback achieves state-of-the-art performance in large language models using open-source datasets and proxy preference models.","ai_keywords":["RLHF","Online Iterative RLHF","preference models","proxy preference model","SFR-Iterative-DPO-LLaMA-3-8B-R","LLM chatbot benchmarks","AlpacaEval-2","Arena-Hard","MT-Bench","HumanEval","TruthfulQA","supervised fine-tuning","SFT"],"githubStars":539},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63a3ff69f91ad3ea5703841d","avatarUrl":"/avatars/69227c4bce01d33747c1377b6f9672db.svg","isPro":false,"fullname":"Hanze Dong","user":"hendrydong","type":"user"},{"_id":"64cb1ad1667f4f80852f6050","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64cb1ad1667f4f80852f6050/iOn5q_RyyBS99tObrO5Tc.png","isPro":false,"fullname":"Rui Pan","user":"research4pan","type":"user"},{"_id":"638fb8cf2380ffd99caf8c2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638fb8cf2380ffd99caf8c2a/7juUomHJ4gH1HIJj43qE6.jpeg","isPro":false,"fullname":"Haoxiang Wang","user":"Haoxiang-Wang","type":"user"},{"_id":"643e59806db6ba8c5ee123f3","avatarUrl":"/avatars/4052f2a250107f43b3634c3ee3cc30a1.svg","isPro":false,"fullname":"Wei Xiong","user":"weqweasdas","type":"user"},{"_id":"65eec5c1d7d63c2ed0615421","avatarUrl":"/avatars/8c32f5e7d4b1940088bdec73c0b86fab.svg","isPro":false,"fullname":"Chenlu Ye","user":"Chenlu123","type":"user"},{"_id":"63aeb7effcca84593e643d59","avatarUrl":"/avatars/cd928a730fddb0af7bc3f19c246a9bce.svg","isPro":false,"fullname":"Charan H U","user":"charanhu","type":"user"},{"_id":"66430ed6ab89e3a3a887194b","avatarUrl":"/avatars/e5e4988e81c5719dc85eb460e4916e11.svg","isPro":false,"fullname":"yifanhao","user":"yifanhao","type":"user"},{"_id":"66272f88ed26e475558ac76c","avatarUrl":"/avatars/55bd73c1ca99fb52b519c0cf3ca0f6a8.svg","isPro":false,"fullname":"wx13","user":"wx13","type":"user"},{"_id":"65d72b9ddea7fb17961a6ed4","avatarUrl":"/avatars/77ac20069def2b35cc6426a81fc070d6.svg","isPro":false,"fullname":"Yong Lin","user":"linyongver","type":"user"},{"_id":"64ff03621499bba2b52fc76a","avatarUrl":"/avatars/389e474cdf8883810ec676cca60b0aef.svg","isPro":false,"fullname":"weizhonz","user":"weizhonz","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">Abstract
Online iterative reinforcement learning from human feedback achieves state-of-the-art performance in large language models using open-source datasets and proxy preference models.
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.
Community
The aligned LLM is officially released at:
https://huggingface.co/Salesforce/SFR-Iterative-DPO-LLaMA-3-8B-R
Why was the repository deleted?
Here's a plain-english summary of the paper - feedback from the authors is welcome!
https://www.aimodels.fyi/papers/arxiv/what-matters-when-building-vision-language-models