@leegao19\n\t ,

Your intuition is correct. I think about it as trying to perturb all the directions in the model that are going to be important. I've been pleasantly surprised by how few samples are needed to get a representative set. Just 2k samples from Alpaca seems to be enough to preserve llm-eval metrics.

We have not experimented with steering! Feel free to check out the code and pull out the Principal components to see what they each do. I'd be interested to hear what happens.

\n","updatedAt":"2024-01-29T17:05:41.128Z","author":{"_id":"65af77603db2280ece8a6644","avatarUrl":"/avatars/37fac7c73e0190f933201398c74bcc50.svg","fullname":"James Hensman","name":"Jameshensman","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9759540557861328},"editors":["Jameshensman"],"editorAvatarUrls":["/avatars/37fac7c73e0190f933201398c74bcc50.svg"],"reactions":[{"reaction":"❤️","users":["imryanxu"],"count":1},{"reaction":"👍","users":["leegao19"],"count":1}],"isReport":false}},{"id":"65b84f42e0bde92c175b5dfb","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-01-30T01:22:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification](https://huggingface.co/papers/2312.10365) (2023)\n* [Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models](https://huggingface.co/papers/2312.07046) (2023)\n* [The LLM Surgeon](https://huggingface.co/papers/2312.17244) (2023)\n* [PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation](https://huggingface.co/papers/2401.11316) (2024)\n* [DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization](https://huggingface.co/papers/2312.13211) (2023)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-01-30T01:22:10.704Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7760599851608276},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["shrey-singla"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2401.15024","authors":[{"_id":"65b70e8c4839ccc503a60510","user":{"_id":"610d7ec84539844d377af78a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/610d7ec84539844d377af78a/-9H0bIv5pjI9hJamg9P0x.jpeg","isPro":false,"fullname":"Ashkboos","user":"Saleh","type":"user"},"name":"Saleh Ashkboos","status":"admin_assigned","statusLastChangedAt":"2024-01-29T09:32:52.773Z","hidden":false},{"_id":"65b70e8c4839ccc503a60511","user":{"_id":"63eb657213a3eb9b0dc6d628","avatarUrl":"/avatars/c4a9f974dd5d2ea67c318d8496e6248d.svg","isPro":false,"fullname":"Max Croci","user":"nailimixaM","type":"user"},"name":"Maximilian L. Croci","status":"claimed_verified","statusLastChangedAt":"2024-01-29T11:13:59.051Z","hidden":false},{"_id":"65b70e8c4839ccc503a60512","user":{"_id":"6329e25259950c1d279b4333","avatarUrl":"/avatars/d24a0cf3e09fcfd77bc2a28dd9e2a200.svg","isPro":false,"fullname":"Marcelo Gennari do Nascimento","user":"MarceloGennari","type":"user"},"name":"Marcelo Gennari do Nascimento","status":"extracted_confirmed","statusLastChangedAt":"2024-01-29T09:56:30.666Z","hidden":false},{"_id":"65b70e8c4839ccc503a60513","name":"Torsten Hoefler","hidden":false},{"_id":"65b70e8c4839ccc503a60514","user":{"_id":"65af77603db2280ece8a6644","avatarUrl":"/avatars/37fac7c73e0190f933201398c74bcc50.svg","isPro":false,"fullname":"James Hensman","user":"Jameshensman","type":"user"},"name":"James Hensman","status":"extracted_confirmed","statusLastChangedAt":"2024-04-04T06:18:21.485Z","hidden":false}],"publishedAt":"2024-01-26T17:35:45.000Z","submittedOnDailyAt":"2024-01-29T00:05:05.514Z","title":"SliceGPT: Compress Large Language Models by Deleting Rows and Columns","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Large language models have become the cornerstone of natural language\nprocessing, but their use comes with substantial costs in terms of compute and\nmemory resources. Sparsification provides a solution to alleviate these\nresource constraints, and recent works have shown that trained models can be\nsparsified post-hoc. Existing sparsification techniques face challenges as they\nneed additional data structures and offer constrained speedup with current\nhardware. In this paper we present SliceGPT, a new post-training sparsification\nscheme which replaces each weight matrix with a smaller (dense) matrix,\nreducing the embedding dimension of the network. Through extensive\nexperimentation, we show that SliceGPT can remove up to 25% of the model\nparameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models\nwhile maintaining 99%, 99% and 90% zero-shot task performance of the dense\nmodel respectively. Our sliced models run on fewer GPUs and run faster without\nany additional code optimization: on 24GB consumer GPUs we reduce the total\ncompute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB\nA100 GPUs we reduce it to 66%. We offer a new insight, computational invariance\nin transformer networks, which enables SliceGPT and we hope it will inspire and\nenable future avenues to reduce memory and computation demands for pre-trained\nmodels. Code is available at:\nhttps://github.com/microsoft/TransformerCompression","upvotes":73,"discussionId":"65b70e8d4839ccc503a60528","githubRepo":"https://github.com/microsoft/transformercompression","githubRepoAddedBy":"auto","ai_summary":"SliceGPT, a novel post-training sparsification method, reduces model parameters by up to 25% without significant performance decrease, optimizing inference speed and GPU usage on large language models.","ai_keywords":["sparsification","post-training","weight matrix","embedding dimension","transformer networks","computational invariance"],"githubStars":455},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62f74387ebf2bab9932564b7","avatarUrl":"/avatars/20accb6d5780bae134e8b266068c4eae.svg","isPro":false,"fullname":"krishna praveen","user":"krishnapraveen","type":"user"},{"_id":"632f2242cc0b3661318ad340","avatarUrl":"/avatars/5ac4a4d4c3ba40bbf419f38a2ca7f9ee.svg","isPro":false,"fullname":"Syuskin","user":"Mrdaffy","type":"user"},{"_id":"652c30c3d78452c4742d73ba","avatarUrl":"/avatars/06f2286dedad41ba6c8d129a97de5178.svg","isPro":false,"fullname":"Ali Dadsetan","user":"dadsetan","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"6425f149a5ec4a5cbc50f239","avatarUrl":"/avatars/d9c0fde3bc3b15a9df1263a67e12983a.svg","isPro":false,"fullname":"Someone13574","user":"someone13574","type":"user"},{"_id":"649b184ab9368db7f5bcab69","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649b184ab9368db7f5bcab69/UOSWgMd1bgM0ksuVRCNjx.jpeg","isPro":false,"fullname":"Bharat","user":"bsbarkur","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"5f0988ad19cb630495b8147a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f0988ad19cb630495b8147a/W9PMu6cURwe_RkwovKjdR.jpeg","isPro":false,"fullname":"Sayantan Das","user":"ucalyptus","type":"user"},{"_id":"64c09684e56520a63d35ec87","avatarUrl":"/avatars/86a338e8d122a66a94143bbb9bf3ebf8.svg","isPro":false,"fullname":"Chenyang Song","user":"Raincleared","type":"user"},{"_id":"63d4c8ce13ae45b780792f32","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d4c8ce13ae45b780792f32/QasegimoxBqfZwDzorukz.png","isPro":false,"fullname":"Ohenenoo","user":"PeepDaSlan9","type":"user"},{"_id":"61f44bab7eba274ea80b74ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61f44bab7eba274ea80b74ce/BRbKX1jephdZ7D44FATl4.jpeg","isPro":false,"fullname":"Hyoung-Kyu Song","user":"deepkyu","type":"user"},{"_id":"64900366f277550705ae03cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64900366f277550705ae03cf/CFqkP6hysQUHdYhZqbNNs.jpeg","isPro":false,"fullname":"三洋三洋","user":"OleehyO","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1}">

Papers

arxiv:2401.15024

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Published on Jan 26, 2024

· Submitted by

AK on Jan 29, 2024

#1 Paper of the day

Upvote

Authors:

Saleh Ashkboos ,

Maximilian L. Croci ,

Marcelo Gennari do Nascimento ,

James Hensman

Abstract

SliceGPT, a novel post-training sparsification method, reduces model parameters by up to 25% without significant performance decrease, optimizing inference speed and GPU usage on large language models.

AI-generated summary

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression

View arXiv page View PDF GitHub 455 auto Add to collection

Community

PixelRage

Jan 29, 2024

Not APril FOols, Rad!

Chat-Error

Jan 29, 2024

Code is 404

Jameshensman

Paper author Jan 29, 2024

•

edited Jan 29, 2024

Code is 404

it's there now!

leegao19

Jan 29, 2024

On the calibration set - it looks like this is running a bunch of samples of Wiki2 and Alpaca truncated to the maximum length of the LLM.

I'm guessing the intuition here is to have a representative set of samples that's either seen during pretraining or are likely to be encountered in day-to-day use.

Have you guys explored the idea of using this for feature activation steering? E.g. prime the model (maybe even using contrastive pairs to get positive and negatives) using calibration prompts that steer behavior to, for e.g., avoid sycophancy (vs emulates typical interactions).

Admittedly, it's really heavy for what can be done, but it'd be interesting to see if the PCA picks up anything interesting (e.g. do they decompose into basic feature directions? can they be interpreted? can they be manipulated to amplify/suppress certain behaviors? do these components become sparser as steering becomes more directed?)

Jameshensman

Paper author Jan 29, 2024

Hi @leegao19 ,

We have not experimented with steering! Feel free to check out the code and pull out the Principal components to see what they each do. I'd be interested to hear what happens.