Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
[go: Go Back, main page]

@Lucywang720\n\t congrats on this work!

\n

Wondering if it's possible to make the probes (or the modified checkpoints) available on the hub?

\n

Let me know if you need any help!

\n

Cheers,

\n

Niels

\n","updatedAt":"2024-07-15T15:09:55.217Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1096,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7971818447113037},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6694e6c1a6b36c9e78e8f6f5"}},{"id":"6695c7b563f6176df6fa4486","author":{"_id":"64ec52887e69909b1ad1ff0b","avatarUrl":"/avatars/30bb6fda51093fc71e80d0313c13d15e.svg","fullname":"Lucy Wang","name":"Lucywang720","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2024-07-16T01:07:01.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @nielsr ,\n\nThank you! We have provided probe examples on GitHub. We are also conducting some follow-up research, so we may upload modified checkpoints on the hub in the future.\n\nLucy","html":"

Hi \n\n@nielsr\n\t ,

\n

Thank you! We have provided probe examples on GitHub. We are also conducting some follow-up research, so we may upload modified checkpoints on the hub in the future.

\n

Lucy

\n","updatedAt":"2024-07-16T01:07:01.702Z","author":{"_id":"64ec52887e69909b1ad1ff0b","avatarUrl":"/avatars/30bb6fda51093fc71e80d0313c13d15e.svg","fullname":"Lucy Wang","name":"Lucywang720","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9383978247642517},"editors":["Lucywang720"],"editorAvatarUrls":["/avatars/30bb6fda51093fc71e80d0313c13d15e.svg"],"reactions":[],"isReport":false,"parentCommentId":"6694e6c1a6b36c9e78e8f6f5"}}]},{"id":"66960696aa9d9fb60a282604","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-07-16T05:35:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Decoupled Alignment for Robust Plug-and-Play Adaptation](https://huggingface.co/papers/2406.01514) (2024)\n* [LoFiT: Localized Fine-tuning on LLM Representations](https://huggingface.co/papers/2406.01563) (2024)\n* [Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs](https://huggingface.co/papers/2406.10216) (2024)\n* [Aligning Large Language Models with Representation Editing: A Control Perspective](https://huggingface.co/papers/2406.05954) (2024)\n* [Interpretable Catastrophic Forgetting of Large Language Model Fine-tuning via Instruction Vector](https://huggingface.co/papers/2406.12227) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-07-16T05:35:18.494Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7501602172851562},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2407.08770","authors":[{"_id":"6694bd9b64dc88cf5e83fb5a","user":{"_id":"64ec52887e69909b1ad1ff0b","avatarUrl":"/avatars/30bb6fda51093fc71e80d0313c13d15e.svg","isPro":false,"fullname":"Lucy Wang","user":"Lucywang720","type":"user"},"name":"Huanqian Wang","status":"claimed_verified","statusLastChangedAt":"2024-07-15T08:58:14.171Z","hidden":false},{"_id":"6694bd9b64dc88cf5e83fb5b","user":{"_id":"649d475111592b1a765ac1a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649d475111592b1a765ac1a3/rjORJjErJq-mthghan08U.jpeg","isPro":false,"fullname":"Yang Yue","user":"Yang130","type":"user"},"name":"Yang Yue","status":"claimed_verified","statusLastChangedAt":"2024-11-05T07:59:01.337Z","hidden":false},{"_id":"6694bd9b64dc88cf5e83fb5c","user":{"_id":"64b77916964f6e7bf32bb428","avatarUrl":"/avatars/631b0324b4d713d351abbedb468658dc.svg","isPro":false,"fullname":"Rui Lu","user":"RayLuTHU","type":"user"},"name":"Rui Lu","status":"claimed_verified","statusLastChangedAt":"2025-04-22T10:17:38.911Z","hidden":false},{"_id":"6694bd9b64dc88cf5e83fb5d","name":"Jingxin Shi","hidden":false},{"_id":"6694bd9b64dc88cf5e83fb5e","user":{"_id":"630482fbce6b12280b18971d","avatarUrl":"/avatars/b07f31fd970d736bdf574d56da7a5634.svg","isPro":false,"fullname":"Andrew Zhao","user":"andrewzh","type":"user"},"name":"Andrew Zhao","status":"claimed_verified","statusLastChangedAt":"2024-07-16T20:16:22.404Z","hidden":false},{"_id":"6694bd9b64dc88cf5e83fb5f","user":{"_id":"6486dde1f74857df3f1a5828","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6486dde1f74857df3f1a5828/FgE80CpalBO5qqArdfwxA.jpeg","isPro":false,"fullname":"Shenzhi Wang","user":"shenzhi-wang","type":"user"},"name":"Shenzhi Wang","status":"claimed_verified","statusLastChangedAt":"2024-07-18T09:08:06.670Z","hidden":false},{"_id":"6694bd9b64dc88cf5e83fb60","name":"Shiji Song","hidden":false},{"_id":"6694bd9b64dc88cf5e83fb61","name":"Gao Huang","hidden":false}],"publishedAt":"2024-07-11T17:52:03.000Z","submittedOnDailyAt":"2024-07-15T07:37:13.056Z","title":"Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing","submittedOnDailyBy":{"_id":"64ec52887e69909b1ad1ff0b","avatarUrl":"/avatars/30bb6fda51093fc71e80d0313c13d15e.svg","isPro":false,"fullname":"Lucy Wang","user":"Lucywang720","type":"user"},"summary":"Large Language Models (LLMs) have demonstrated great potential as generalist\nassistants, showcasing powerful task understanding and problem-solving\ncapabilities. To deploy LLMs as AI assistants, it is crucial that these models\nexhibit desirable behavioral traits, such as non-toxicity and resilience\nagainst jailbreak attempts. Current methods for detoxification or preventing\njailbreaking usually involve Supervised Fine-Tuning (SFT) or Reinforcement\nLearning from Human Feedback (RLHF), which requires finetuning billions of\nparameters through gradient descent with substantial computation cost.\nFurthermore, models modified through SFT and RLHF may deviate from the\npretrained models, potentially leading to a degradation in foundational LLM\ncapabilities. In this paper, we observe that surprisingly, directly editing a\nsmall subset of parameters can effectively modulate specific behaviors of LLMs,\nsuch as detoxification and resistance to jailbreaking. Specifically, for a\nbehavior that we aim to avoid, we employ a linear classifier, which we term the\nbehavior probe, to classify binary behavior labels within the hidden state\nspace of the LLM. Using this probe, we introduce an algorithm to identify a\ncritical subset of LLM parameters that significantly influence this targeted\nbehavior. Then we directly edit these selected parameters by shifting them\ntowards the behavior probe. Such a direct parameter editing method necessitates\nonly inference-level computational resources. Experiments demonstrate that in\nthe representative detoxification task, our approach achieves reductions of up\nto 90.0\\% in toxicity on the RealToxicityPrompts dataset and 49.2\\% on ToxiGen,\nwhile maintaining the LLM's general capabilities in areas such as common sense,\nquestion answering, and mathematics. Our code is available at\nhttps://github.com/lucywang720/model-surgery.","upvotes":21,"discussionId":"6694bd9c64dc88cf5e83fbc6","githubRepo":"https://github.com/lucywang720/model-surgery","githubRepoAddedBy":"auto","ai_summary":"Directly editing a subset of LLM parameters can effectively modulate specific behaviors like detoxification and jailbreak resistance with minimal computational cost while maintaining general LLM capabilities.","ai_keywords":["Large Language Models","LLMS","Supervised Fine-Tuning","SFT","Reinforcement Learning from Human Feedback","RLHF","behavior probe","hidden state space","parameter editing","RealToxicityPrompts dataset","ToxiGen dataset"],"githubStars":31},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"668cd4bbe990292e5f6974d3","avatarUrl":"/avatars/d1747b2372e94500ecb5fb56809b482d.svg","isPro":false,"fullname":"Jinyeong Kim","user":"rubatoyeong","type":"user"},{"_id":"64ec52887e69909b1ad1ff0b","avatarUrl":"/avatars/30bb6fda51093fc71e80d0313c13d15e.svg","isPro":false,"fullname":"Lucy Wang","user":"Lucywang720","type":"user"},{"_id":"649d475111592b1a765ac1a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649d475111592b1a765ac1a3/rjORJjErJq-mthghan08U.jpeg","isPro":false,"fullname":"Yang Yue","user":"Yang130","type":"user"},{"_id":"602e9e2a60e3dd96631c9070","avatarUrl":"/avatars/1700de0def81fad9d8987101ee6704d6.svg","isPro":false,"fullname":"Sunny Gonna","user":"sunnyg","type":"user"},{"_id":"630482fbce6b12280b18971d","avatarUrl":"/avatars/b07f31fd970d736bdf574d56da7a5634.svg","isPro":false,"fullname":"Andrew Zhao","user":"andrewzh","type":"user"},{"_id":"6342796a0875f2c99cfd313b","avatarUrl":"/avatars/98575092404c4197b20c929a6499a015.svg","isPro":false,"fullname":"Yuseung \"Phillip\" Lee","user":"phillipinseoul","type":"user"},{"_id":"665d4b515fdfe8f923e347a7","avatarUrl":"/avatars/d114b24c02dadfca0a8aee104755a8ec.svg","isPro":false,"fullname":"Zhaokai Wang","user":"wzk1015","type":"user"},{"_id":"6447843530fa4ecb85ddc889","avatarUrl":"/avatars/a97c4970a1a179ee8a2e2e6ab8f995f6.svg","isPro":false,"fullname":"Youliang Yuan","user":"Youliang","type":"user"},{"_id":"663b5c74a2da73482b206a79","avatarUrl":"/avatars/0fe7a2f2c82629c09363ffc23dfde51b.svg","isPro":false,"fullname":"Claudio Ceruti","user":"olceru23","type":"user"},{"_id":"6329ab0bde18e8b2d96157ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6329ab0bde18e8b2d96157ff/alQVvAaeb0zh54b4XGVIK.png","isPro":false,"fullname":"Evan","user":"evdcush","type":"user"},{"_id":"6486dde1f74857df3f1a5828","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6486dde1f74857df3f1a5828/FgE80CpalBO5qqArdfwxA.jpeg","isPro":false,"fullname":"Shenzhi Wang","user":"shenzhi-wang","type":"user"},{"_id":"642fef28a043f0ac7defa8a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642fef28a043f0ac7defa8a9/RwOEkuj3fOnOA54tGR7Ea.png","isPro":false,"fullname":"Yaowei Zheng","user":"hiyouga","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2407.08770

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

Published on Jul 11, 2024
· Submitted by
Lucy Wang
on Jul 15, 2024
Authors:
Rui Lu ,
,
,

Abstract

Directly editing a subset of LLM parameters can effectively modulate specific behaviors like detoxification and jailbreak resistance with minimal computational cost while maintaining general LLM capabilities.

AI-generated summary

Large Language Models (LLMs) have demonstrated great potential as generalist assistants, showcasing powerful task understanding and problem-solving capabilities. To deploy LLMs as AI assistants, it is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. Current methods for detoxification or preventing jailbreaking usually involve Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), which requires finetuning billions of parameters through gradient descent with substantial computation cost. Furthermore, models modified through SFT and RLHF may deviate from the pretrained models, potentially leading to a degradation in foundational LLM capabilities. In this paper, we observe that surprisingly, directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs, such as detoxification and resistance to jailbreaking. Specifically, for a behavior that we aim to avoid, we employ a linear classifier, which we term the behavior probe, to classify binary behavior labels within the hidden state space of the LLM. Using this probe, we introduce an algorithm to identify a critical subset of LLM parameters that significantly influence this targeted behavior. Then we directly edit these selected parameters by shifting them towards the behavior probe. Such a direct parameter editing method necessitates only inference-level computational resources. Experiments demonstrate that in the representative detoxification task, our approach achieves reductions of up to 90.0\% in toxicity on the RealToxicityPrompts dataset and 49.2\% on ToxiGen, while maintaining the LLM's general capabilities in areas such as common sense, question answering, and mathematics. Our code is available at https://github.com/lucywang720/model-surgery.

Community

Paper author Paper submitter

We propose a novel approach to modulate LLM behaviors through direct parameter editing, offering an alternative to traditional alignment methods. Our new approach achieves efficient modulation with inference-level computational cost! Achieve up to 90% detoxification with inference-level computational cost!

·

Hi @Lucywang720 congrats on this work!

Wondering if it's possible to make the probes (or the modified checkpoints) available on the hub?

Let me know if you need any help!

Cheers,

Niels

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.08770 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.08770 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.08770 in a Space README.md to link it from this page.

Collections including this paper 2