@piergs\n\t,

Congrats on this new work, would be cool to have it implemented in TRL (similar to DPO and other human preference tuning algorithms): https://github.com/huggingface/trl.

Let me know if I need to connect you with the team!

Cheers,
Niels
Open-source @ HF

\n","updatedAt":"2024-07-25T19:32:15.941Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8937240242958069},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"669f6d6d43a9a8c8554d8732"}}]},{"id":"66a059dd52e938fff6d3235c","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-07-24T01:33:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Adversarial Moment-Matching Distillation of Large Language Models](https://huggingface.co/papers/2406.02959) (2024)\n* [WARP: On the Benefits of Weight Averaged Rewarded Policies](https://huggingface.co/papers/2406.16768) (2024)\n* [SAIL: Self-Improving Efficient Online Alignment of Large Language Models](https://huggingface.co/papers/2406.15567) (2024)\n* [BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling](https://huggingface.co/papers/2406.00832) (2024)\n* [Robust Preference Optimization through Reward Model Distillation](https://huggingface.co/papers/2405.19316) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-07-24T01:33:17.317Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7662854194641113},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.14622","authors":[{"_id":"669f50b24b821f5697c3a895","user":{"_id":"667a7876becec8fc5112cf1f","avatarUrl":"/avatars/6874f0e5a7edca4145c57d7908f89840.svg","isPro":false,"fullname":"Pier Giuseppe Sessa","user":"piergs","type":"user"},"name":"Pier Giuseppe Sessa","status":"admin_assigned","statusLastChangedAt":"2024-07-23T08:46:21.684Z","hidden":false},{"_id":"669f50b24b821f5697c3a896","user":{"_id":"65fc7c0da57b88a765aea493","avatarUrl":"/avatars/440f88769da1113d6158fe7e0514ead3.svg","isPro":false,"fullname":"Robert Dadashi","user":"ddsh","type":"user"},"name":"Robert Dadashi","status":"admin_assigned","statusLastChangedAt":"2024-07-23T08:46:27.635Z","hidden":false},{"_id":"669f50b24b821f5697c3a897","user":{"_id":"667a7b2e1764931987803b46","avatarUrl":"/avatars/da2ee162f33c681e3d1c4d1e6d44dcbd.svg","isPro":false,"fullname":"Léonard Hussenot","user":"leonardhussenot","type":"user"},"name":"Léonard Hussenot","status":"admin_assigned","statusLastChangedAt":"2024-07-23T08:46:36.305Z","hidden":false},{"_id":"669f50b24b821f5697c3a898","user":{"_id":"65afb7dbdd6bdfd73cd8e609","avatarUrl":"/avatars/b21069bc2d7ee4cc1508008e3c8ade64.svg","isPro":false,"fullname":"Johan Ferret","user":"ferretj","type":"user"},"name":"Johan Ferret","status":"admin_assigned","statusLastChangedAt":"2024-07-23T08:46:42.102Z","hidden":false},{"_id":"669f50b24b821f5697c3a899","name":"Nino Vieillard","hidden":false},{"_id":"669f50b24b821f5697c3a89a","user":{"_id":"63c94ede00104ea998de19a6","avatarUrl":"/avatars/273959d87f0c67747588cf0700d64039.svg","isPro":false,"fullname":"Alexandre Rame","user":"alexrame","type":"user"},"name":"Alexandre Ramé","status":"admin_assigned","statusLastChangedAt":"2024-07-23T08:46:55.023Z","hidden":false},{"_id":"669f50b24b821f5697c3a89b","name":"Bobak Shariari","hidden":false},{"_id":"669f50b24b821f5697c3a89c","user":{"_id":"66328157b270ae503e91339b","avatarUrl":"/avatars/ea7a52060f5360f523ca28e137e85e33.svg","isPro":false,"fullname":"Sarah Perrin","user":"Sper42","type":"user"},"name":"Sarah Perrin","status":"admin_assigned","statusLastChangedAt":"2024-07-23T10:04:32.674Z","hidden":false},{"_id":"669f50b24b821f5697c3a89d","name":"Abe Friesen","hidden":false},{"_id":"669f50b24b821f5697c3a89e","name":"Geoffrey Cideron","hidden":false},{"_id":"669f50b24b821f5697c3a89f","name":"Sertan Girgin","hidden":false},{"_id":"669f50b24b821f5697c3a8a0","name":"Piotr Stanczyk","hidden":false},{"_id":"669f50b24b821f5697c3a8a1","user":{"_id":"66466ef3b080e0f563dd2372","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66466ef3b080e0f563dd2372/wOMZgsd8r28iwAvqmYQRI.jpeg","isPro":false,"fullname":"Andrea Michi","user":"andreamichi","type":"user"},"name":"Andrea Michi","status":"admin_assigned","statusLastChangedAt":"2024-07-23T10:08:27.975Z","hidden":false},{"_id":"669f50b24b821f5697c3a8a2","name":"Danila Sinopalnikov","hidden":false},{"_id":"669f50b24b821f5697c3a8a3","name":"Sabela Ramos","hidden":false},{"_id":"669f50b24b821f5697c3a8a4","name":"Amélie Héliou","hidden":false},{"_id":"669f50b24b821f5697c3a8a5","name":"Aliaksei Severyn","hidden":false},{"_id":"669f50b24b821f5697c3a8a6","name":"Matt Hoffman","hidden":false},{"_id":"669f50b24b821f5697c3a8a7","name":"Nikola Momchev","hidden":false},{"_id":"669f50b24b821f5697c3a8a8","user":{"_id":"667aac871bffb68706b4f62c","avatarUrl":"/avatars/2e2f2431f03a368f604c992d0d7ca57a.svg","isPro":false,"fullname":"Olivier Bachem","user":"bachem","type":"user"},"name":"Olivier Bachem","status":"admin_assigned","statusLastChangedAt":"2024-07-23T10:04:40.875Z","hidden":false}],"publishedAt":"2024-07-19T18:38:25.000Z","submittedOnDailyAt":"2024-07-23T07:14:29.141Z","title":"BOND: Aligning LLMs with Best-of-N Distillation","submittedOnDailyBy":{"_id":"667a7876becec8fc5112cf1f","avatarUrl":"/avatars/6874f0e5a7edca4145c57d7908f89840.svg","isPro":false,"fullname":"Pier Giuseppe Sessa","user":"piergs","type":"user"},"summary":"Reinforcement learning from human feedback (RLHF) is a key driver of quality\nand safety in state-of-the-art large language models. Yet, a surprisingly\nsimple and strong inference-time strategy is Best-of-N sampling that selects\nthe best generation among N candidates. In this paper, we propose Best-of-N\nDistillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but\nwithout its significant computational overhead at inference time. Specifically,\nBOND is a distribution matching algorithm that forces the distribution of\ngenerations from the policy to get closer to the Best-of-N distribution. We use\nthe Jeffreys divergence (a linear combination of forward and backward KL) to\nbalance between mode-covering and mode-seeking behavior, and derive an\niterative formulation that utilizes a moving anchor for efficiency. We\ndemonstrate the effectiveness of our approach and several design choices\nthrough experiments on abstractive summarization and Gemma models. Aligning\nGemma policies with BOND outperforms other RLHF algorithms by improving results\non several benchmarks.","upvotes":20,"discussionId":"669f50b24b821f5697c3a8f2","ai_summary":"BOND, a novel RLHF algorithm, distills the Best-of-N sampling strategy without significant computational overhead, enhancing the quality of generative language models.","ai_keywords":["Reinforcement learning from human feedback","RLHF","Best-of-N sampling","Best-of-N Distillation","BOND","distribution matching","Jeffreys divergence","forward KL","backward KL","mode-covering","mode-seeking","abstractive summarization","Gemma models"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"667a7876becec8fc5112cf1f","avatarUrl":"/avatars/6874f0e5a7edca4145c57d7908f89840.svg","isPro":false,"fullname":"Pier Giuseppe Sessa","user":"piergs","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"5ff5d596f244529b3ec0fb89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624629516652-5ff5d596f244529b3ec0fb89.png","isPro":false,"fullname":"Philipp Schmid","user":"philschmid","type":"user"},{"_id":"622792366303bf1dc304f49f","avatarUrl":"/avatars/975c1cc3eb2f97cf8e848162056d5bea.svg","isPro":false,"fullname":"Arthur Douillard","user":"ArthurDouillard","type":"user"},{"_id":"66466ef3b080e0f563dd2372","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66466ef3b080e0f563dd2372/wOMZgsd8r28iwAvqmYQRI.jpeg","isPro":false,"fullname":"Andrea Michi","user":"andreamichi","type":"user"},{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"},{"_id":"65c4063740d617a14238f3df","avatarUrl":"/avatars/726b1470e46ad71c9ec233f3f0f396ec.svg","isPro":false,"fullname":"Zikun Li","user":"zikun-li","type":"user"},{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"6689849413da7de7c57b7900","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6689849413da7de7c57b7900/bVDEEiBLYi-A-GoegIpdE.jpeg","isPro":false,"fullname":"Lana Cain","user":"lanacain","type":"user"},{"_id":"6689821cbcbb6192a42f6a34","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6689821cbcbb6192a42f6a34/blB7dpsf5H3uh4lRhpgkT.jpeg","isPro":false,"fullname":"Luciano Prutt","user":"thebluehedgehog","type":"user"},{"_id":"66897694aafa84bf3c0e5048","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66897694aafa84bf3c0e5048/LiFZk1pMIPkG2lEAUkBqf.jpeg","isPro":false,"fullname":"Haley Buck","user":"haleybuck","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2407.14622

BOND: Aligning LLMs with Best-of-N Distillation

Published on Jul 19, 2024

· Submitted by

Pier Giuseppe Sessa on Jul 23, 2024

Upvote

Authors:

Pier Giuseppe Sessa ,

Robert Dadashi ,

Léonard Hussenot ,

Johan Ferret ,

Alexandre Ramé ,

Sarah Perrin ,

Andrea Michi ,

Olivier Bachem

Abstract

BOND, a novel RLHF algorithm, distills the Best-of-N sampling strategy without significant computational overhead, enhancing the quality of generative language models.

AI-generated summary

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

View arXiv page View PDF Add to collection

Community

piergs

Paper author Paper submitter Jul 23, 2024

We present J-BOND 🕴️, a novel alignment method that steers the LLM towards the Best-of-N distribution via online distillation. This allows inheriting the strong properties of Best-of-N sampling, while requiring only a single sample at inference time.

To achieve this, J-BOND minimizes the Jeffreys divergence between the training policy and the Best-of-N distribution, trading off mode covering (forward KL) and mode seeking (backward KL) achieving the best of both divergences. Moreover, it implements an iterative distillation approach aiming at distilling the Best-of-N version of an Exponential Moving Average (EMA) anchor policy. This allows keeping reduced sample complexity and stable optimization, while the policy continuously improves its performance.
We demonstrate our design choices and overall approach on an abstractive summarization task and for the fine tuning of Gemma. Aligning Gemma policies with J-BOND led to superior performance than standard RLHF baselines, with improvements on several benchmarks.

nielsr

Jul 25, 2024

Hi @piergs ,

Congrats on this new work, would be cool to have it implemented in TRL (similar to DPO and other human preference tuning algorithms): https://github.com/huggingface/trl.

Let me know if I need to connect you with the team!

Cheers,
Niels
Open-source @ HF