Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

CBQ: Cross-Block Quantization for Large Language Models (2023)
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks (2023)
LQER: Low-Rank Quantization Error Reconstruction for LLMs (2024)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (2024)
MPTQ-ViT: Mixed-PrecisionPost-TrainingQuantizationforVisionTransformer (2024)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-02-09T01:21:53.074Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7376682162284851},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"66655710da1e6adefec9c278","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":176,"isUserFollowing":false},"createdAt":"2024-06-09T07:17:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"# BiLLM: Supercharge LLMs with 1-Bit Quantization! 🚀\n\nhttps://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/hPxk5dUwrohkA0_4wv-cS.mp4 \n\n## Links 🔗:\n👉 Subscribe: https://www.youtube.com/@Arxflix\n👉 Twitter: https://x.com/arxflix\n👉 LMNT (Partner): https://lmnt.com/\n\n\nBy Arxflix\n![9t4iCUHx_400x400-1.jpg](https://cdn-uploads.huggingface.co/production/uploads/6186ddf6a7717cb375090c01/v4S5zBurs0ouGNwYj1GEd.jpeg)","html":"

\n\t\n\t\t\n\t\n\t\n\t\tBiLLM: Supercharge LLMs with 1-Bit Quantization! 🚀\n\t\n

\n\n

\n\t\n\t\t\n\t\n\t\n\t\tLinks 🔗:\n\t\n

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix
$\"9t4iCUHx_400x400-1.jpg\"$

\n","updatedAt":"2024-06-09T07:17:36.110Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":176,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5152819752693176},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.04291","authors":[{"_id":"65c43a4ff19b126a3cd92d10","user":{"_id":"656db3f53dc1d277e5a64410","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656db3f53dc1d277e5a64410/9kiY2K3MCRcBDk7MrkTBK.png","isPro":false,"fullname":"Wei Huang","user":"AaronHuangWei","type":"user"},"name":"Wei Huang","status":"claimed_verified","statusLastChangedAt":"2024-04-22T06:55:21.980Z","hidden":false},{"_id":"65c43a4ff19b126a3cd92d11","name":"Yangdong Liu","hidden":false},{"_id":"65c43a4ff19b126a3cd92d12","user":{"_id":"65c49589c0b1921e19260a8d","avatarUrl":"/avatars/7ce9af8c627f2a0c3db6bde82290ee1f.svg","isPro":false,"fullname":"Haotong Qin","user":"HaotongQin","type":"user"},"name":"Haotong Qin","status":"extracted_confirmed","statusLastChangedAt":"2025-01-03T00:46:32.543Z","hidden":false},{"_id":"65c43a4ff19b126a3cd92d13","name":"Ying Li","hidden":false},{"_id":"65c43a4ff19b126a3cd92d14","name":"Shiming Zhang","hidden":false},{"_id":"65c43a4ff19b126a3cd92d15","name":"Xianglong Liu","hidden":false},{"_id":"65c43a4ff19b126a3cd92d16","name":"Michele Magno","hidden":false},{"_id":"65c43a4ff19b126a3cd92d17","user":{"_id":"6875266f9cd3191dfddc7071","avatarUrl":"/avatars/64c581910833b111e9a7bae5b8740229.svg","isPro":false,"fullname":"xiaojuan qi","user":"xjqi","type":"user"},"name":"Xiaojuan Qi","status":"claimed_verified","statusLastChangedAt":"2025-07-15T19:15:42.987Z","hidden":false}],"publishedAt":"2024-02-06T09:26:34.000Z","submittedOnDailyAt":"2024-02-07T23:50:01.017Z","title":"BiLLM: Pushing the Limit of Post-Training Quantization for LLMs","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Pretrained large language models (LLMs) exhibit exceptional general language\nprocessing capabilities but come with significant demands on memory and\ncomputational resources. As a powerful compression technology, binarization can\nextremely reduce model weights to a mere 1 bit, lowering the expensive\ncomputation and memory requirements. However, existing quantization techniques\nfall short of maintaining LLM performance under ultra-low bit-widths. In\nresponse to this challenge, we present BiLLM, a groundbreaking 1-bit\npost-training quantization scheme tailored for pretrained LLMs. Based on the\nweight distribution of LLMs, BiLLM first identifies and structurally selects\nsalient weights, and minimizes the compression loss through an effective binary\nresidual approximation strategy. Moreover, considering the bell-shaped\ndistribution of the non-salient weights, we propose an optimal splitting search\nto group and binarize them accurately. BiLLM achieving for the first time\nhigh-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit\nweights across various LLMs families and evaluation metrics, outperforms SOTA\nquantization methods of LLM by significant margins. Moreover, BiLLM enables the\nbinarization process of the LLM with 7 billion weights within 0.5 hours on a\nsingle GPU, demonstrating satisfactory time efficiency.","upvotes":50,"discussionId":"65c43a51f19b126a3cd92d63","githubRepo":"https://github.com/aaronhuang-778/billm","githubRepoAddedBy":"auto","ai_summary":"BiLLM, a 1-bit post-training quantization scheme for pretrained LLMs, achieves high-accuracy inference with reduced computational and memory requirements.","ai_keywords":["binarization","post-training quantization","salient weights","binary residual approximation","optimal splitting search","bell-shaped distribution","perplexity","LLaMA2-70B","SOTA"],"githubStars":228},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"64395f66b9ac1d55f41e5cc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64395f66b9ac1d55f41e5cc4/qhzWbKjN0zyRIlwpg8JRe.png","isPro":false,"fullname":"gunasekar","user":"GunA-SD","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65c49589c0b1921e19260a8d","avatarUrl":"/avatars/7ce9af8c627f2a0c3db6bde82290ee1f.svg","isPro":false,"fullname":"Haotong Qin","user":"HaotongQin","type":"user"},{"_id":"656db3f53dc1d277e5a64410","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656db3f53dc1d277e5a64410/9kiY2K3MCRcBDk7MrkTBK.png","isPro":false,"fullname":"Wei Huang","user":"AaronHuangWei","type":"user"},{"_id":"65c49c515cc8843f4fef1b45","avatarUrl":"/avatars/60293bc1dd9a4c77d3880d283bf81f5f.svg","isPro":false,"fullname":"Tan zhangyao","user":"TanZhangyao","type":"user"},{"_id":"648855a86469f6c5639fb48b","avatarUrl":"/avatars/946c15ade2557d9d931d62ee741bddd4.svg","isPro":false,"fullname":"earsaxcs","user":"earsax","type":"user"},{"_id":"65c4a120fd704b3af2ac609b","avatarUrl":"/avatars/a518e08856ef3e4ce9b507b6907a154e.svg","isPro":false,"fullname":"Chris Wen","user":"chrisleff","type":"user"},{"_id":"65c49fb93b957da6c19618d8","avatarUrl":"/avatars/998daed14b12c2455129a07ac29219f2.svg","isPro":false,"fullname":"hulk","user":"hulk7610","type":"user"},{"_id":"6554b8dcfe564c494fa28a43","avatarUrl":"/avatars/f7e9bfe1544d0a5b01cb43f4f5c3cd4f.svg","isPro":false,"fullname":"lirunyang","user":"lryoung","type":"user"},{"_id":"65c4ad4a885203c0053c3483","avatarUrl":"/avatars/635f7c1b4b4ad04d422384a2b9633807.svg","isPro":false,"fullname":"Ivo Pang","user":"ivopang","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">

Papers

arxiv:2402.04291

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Published on Feb 6, 2024

· Submitted by

AK on Feb 7, 2024

#2 Paper of the day

Upvote

Authors:

Wei Huang ,

Haotong Qin ,

Xiaojuan Qi

Abstract

BiLLM, a 1-bit post-training quantization scheme for pretrained LLMs, achieves high-accuracy inference with reduced computational and memory requirements.

AI-generated summary

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency.