Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - WARP: On the Benefits of Weight Averaged Rewarded Policies
[go: Go Back, main page]

https://arxiv.org/abs/2401.12187), we now use 3 variants of weight averaging at 3 different stages of the policy optimization procedure, applied iteratively. First use the exponential moving average as the anchor in the KL, then merge the independently fine-tuned policies, and linearly interpolate towards the init. Applying this procedure iteratively consistently improves the KL-reward Pareto front of solutions.

\n","updatedAt":"2024-06-25T06:37:51.168Z","author":{"_id":"63c94ede00104ea998de19a6","avatarUrl":"/avatars/273959d87f0c67747588cf0700d64039.svg","fullname":"Alexandre Rame","name":"alexrame","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.9302998185157776},"editors":["alexrame"],"editorAvatarUrls":["/avatars/273959d87f0c67747588cf0700d64039.svg"],"reactions":[{"reaction":"🔥","users":["ArthurDouillard","piergs","AdinaY","chcorbi","bachem"],"count":5},{"reaction":"🚀","users":["ArthurDouillard","piergs","AdinaY","bachem"],"count":4},{"reaction":"❤️","users":["ArthurDouillard","piergs","bachem"],"count":3},{"reaction":"👀","users":["ArthurDouillard","piergs","bachem"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2406.16768","authors":[{"_id":"667a598613c37a0fe4a8b06e","user":{"_id":"63c94ede00104ea998de19a6","avatarUrl":"/avatars/273959d87f0c67747588cf0700d64039.svg","isPro":false,"fullname":"Alexandre Rame","user":"alexrame","type":"user"},"name":"Alexandre Ramé","status":"admin_assigned","statusLastChangedAt":"2024-06-25T08:17:55.519Z","hidden":false},{"_id":"667a598613c37a0fe4a8b06f","user":{"_id":"65afb7dbdd6bdfd73cd8e609","avatarUrl":"/avatars/b21069bc2d7ee4cc1508008e3c8ade64.svg","isPro":false,"fullname":"Johan Ferret","user":"ferretj","type":"user"},"name":"Johan Ferret","status":"admin_assigned","statusLastChangedAt":"2024-06-25T08:18:03.169Z","hidden":false},{"_id":"667a598613c37a0fe4a8b070","name":"Nino Vieillard","hidden":false},{"_id":"667a598613c37a0fe4a8b071","user":{"_id":"65fc7c0da57b88a765aea493","avatarUrl":"/avatars/440f88769da1113d6158fe7e0514ead3.svg","isPro":false,"fullname":"Robert Dadashi","user":"ddsh","type":"user"},"name":"Robert Dadashi","status":"admin_assigned","statusLastChangedAt":"2024-06-25T08:18:17.076Z","hidden":false},{"_id":"667a598613c37a0fe4a8b072","user":{"_id":"667a7b2e1764931987803b46","avatarUrl":"/avatars/da2ee162f33c681e3d1c4d1e6d44dcbd.svg","isPro":false,"fullname":"Léonard Hussenot","user":"leonardhussenot","type":"user"},"name":"Léonard Hussenot","status":"admin_assigned","statusLastChangedAt":"2024-06-25T08:18:24.900Z","hidden":false},{"_id":"667a598613c37a0fe4a8b073","user":{"_id":"64dbdd39e7bc8544f9ac4c1a","avatarUrl":"/avatars/7aface0db75500a2b0ede75179be35ce.svg","isPro":false,"fullname":"Pierre-Louis Cedoz","user":"plcedoz","type":"user"},"name":"Pierre-Louis Cedoz","status":"admin_assigned","statusLastChangedAt":"2024-06-25T08:18:30.765Z","hidden":false},{"_id":"667a598613c37a0fe4a8b074","user":{"_id":"667a7876becec8fc5112cf1f","avatarUrl":"/avatars/6874f0e5a7edca4145c57d7908f89840.svg","isPro":false,"fullname":"Pier Giuseppe Sessa","user":"piergs","type":"user"},"name":"Pier Giuseppe Sessa","status":"admin_assigned","statusLastChangedAt":"2024-06-25T08:18:37.543Z","hidden":false},{"_id":"667a598613c37a0fe4a8b075","name":"Sertan Girgin","hidden":false},{"_id":"667a598613c37a0fe4a8b076","user":{"_id":"622792366303bf1dc304f49f","avatarUrl":"/avatars/975c1cc3eb2f97cf8e848162056d5bea.svg","isPro":false,"fullname":"Arthur Douillard","user":"ArthurDouillard","type":"user"},"name":"Arthur Douillard","status":"claimed_verified","statusLastChangedAt":"2024-06-25T07:32:35.814Z","hidden":false},{"_id":"667a598613c37a0fe4a8b077","name":"Olivier Bachem","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/63c94ede00104ea998de19a6/bSbC8D0TMhQT__RYpfM04.jpeg"],"publishedAt":"2024-06-24T16:24:34.000Z","submittedOnDailyAt":"2024-06-25T04:17:41.085Z","title":"WARP: On the Benefits of Weight Averaged Rewarded Policies","submittedOnDailyBy":{"_id":"63c94ede00104ea998de19a6","avatarUrl":"/avatars/273959d87f0c67747588cf0700d64039.svg","isPro":false,"fullname":"Alexandre Rame","user":"alexrame","type":"user"},"summary":"Reinforcement learning from human feedback (RLHF) aligns large language\nmodels (LLMs) by encouraging their generations to have high rewards, using a\nreward model trained on human preferences. To prevent the forgetting of\npre-trained knowledge, RLHF usually incorporates a KL regularization; this\nforces the policy to remain close to its supervised fine-tuned initialization,\nthough it hinders the reward optimization. To tackle the trade-off between KL\nand reward, in this paper we introduce a novel alignment strategy named Weight\nAveraged Rewarded Policies (WARP). WARP merges policies in the weight space at\nthree distinct stages. First, it uses the exponential moving average of the\npolicy as a dynamic anchor in the KL regularization. Second, it applies\nspherical interpolation to merge independently fine-tuned policies into a new\nenhanced one. Third, it linearly interpolates between this merged model and the\ninitialization, to recover features from pre-training. This procedure is then\napplied iteratively, with each iteration's final model used as an advanced\ninitialization for the next, progressively refining the KL-reward Pareto front,\nachieving superior rewards at fixed KL. Experiments with GEMMA policies\nvalidate that WARP improves their quality and alignment, outperforming other\nopen-source LLMs.","upvotes":23,"discussionId":"667a598713c37a0fe4a8b0e4","githubRepo":"https://github.com/zokost/warp_implementation","githubRepoAddedBy":"auto","ai_summary":"A new alignment strategy named Weight Averaged Rewarded Policies (WARP) enhances reinforcement learning from human feedback by merging policies at multiple stages to balance KL regularization and reward optimization, improving large language model quality and alignment.","ai_keywords":["reinforcement learning from human feedback","large language models","reward model","KL regularization","exponential moving average","spherical interpolation","linear interpolation","KL-reward Pareto front","LLMs","GEMMA policies"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6093a02dc4a92d63a91c5236","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6093a02dc4a92d63a91c5236/yUte6V0FU0BvVFAbON-9n.jpeg","isPro":true,"fullname":"Diwank Tomer","user":"diwank","type":"user"},{"_id":"63c94ede00104ea998de19a6","avatarUrl":"/avatars/273959d87f0c67747588cf0700d64039.svg","isPro":false,"fullname":"Alexandre Rame","user":"alexrame","type":"user"},{"_id":"622792366303bf1dc304f49f","avatarUrl":"/avatars/975c1cc3eb2f97cf8e848162056d5bea.svg","isPro":false,"fullname":"Arthur Douillard","user":"ArthurDouillard","type":"user"},{"_id":"667a6ddc400b4b8da62488f1","avatarUrl":"/avatars/65772bd92bbbe832eeb51298a231f57a.svg","isPro":false,"fullname":"Danila Sinopalnikov","user":"sinopalnikov","type":"user"},{"_id":"667a7876becec8fc5112cf1f","avatarUrl":"/avatars/6874f0e5a7edca4145c57d7908f89840.svg","isPro":false,"fullname":"Pier Giuseppe Sessa","user":"piergs","type":"user"},{"_id":"667a7b2e1764931987803b46","avatarUrl":"/avatars/da2ee162f33c681e3d1c4d1e6d44dcbd.svg","isPro":false,"fullname":"Léonard Hussenot","user":"leonardhussenot","type":"user"},{"_id":"65afb7dbdd6bdfd73cd8e609","avatarUrl":"/avatars/b21069bc2d7ee4cc1508008e3c8ade64.svg","isPro":false,"fullname":"Johan Ferret","user":"ferretj","type":"user"},{"_id":"667a81a161a163396e1d31ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/5cWDkc35LFs7ExcpyD0oL.png","isPro":false,"fullname":"Charles Corbière","user":"chcorbi","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"6177322d37f32ecb1e2d4cdf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635201569275-noauth.jpeg","isPro":false,"fullname":"Hugo Laurençon","user":"HugoLaurencon","type":"user"},{"_id":"667aac871bffb68706b4f62c","avatarUrl":"/avatars/2e2f2431f03a368f604c992d0d7ca57a.svg","isPro":false,"fullname":"Olivier Bachem","user":"bachem","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2406.16768

WARP: On the Benefits of Weight Averaged Rewarded Policies

Published on Jun 24, 2024
· Submitted by
Alexandre Rame
on Jun 25, 2024
Authors:
,
,

Abstract

A new alignment strategy named Weight Averaged Rewarded Policies (WARP) enhances reinforcement learning from human feedback by merging policies at multiple stages to balance KL regularization and reward optimization, improving large language model quality and alignment.

AI-generated summary

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.

Community

Paper author Paper submitter
edited Jun 25, 2024

Introducing Google Deepmind’s WARP (Weight Averaged Rewarded Policies), a novel alignement procedure using model merging to optimize the reward while mitigating forgetting/hacking. WARP boosts RLHF and allows the training of a Gemma LLM surpassing all previous releases.

Following WARM (Weight Averaged Reward Models, https://arxiv.org/abs/2401.12187), we now use 3 variants of weight averaging at 3 different stages of the policy optimization procedure, applied iteratively. First use the exponential moving average as the anchor in the KL, then merge the independently fine-tuned policies, and linearly interpolate towards the init. Applying this procedure iteratively consistently improves the KL-reward Pareto front of solutions.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.16768 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.16768 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.16768 in a Space README.md to link it from this page.

Collections including this paper 3