Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
\n","updatedAt":"2024-06-09T07:17:36.110Z","author":{"_id":"6186ddf6a7717cb375090c01","avatarUrl":"/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg","fullname":"Julien BLANCHON","name":"blanchon","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":176,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5152819752693176},"editors":["blanchon"],"editorAvatarUrls":["/avatars/716b6a7d1094c8036b2a8a7b9063e8aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.04291","authors":[{"_id":"65c43a4ff19b126a3cd92d10","user":{"_id":"656db3f53dc1d277e5a64410","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656db3f53dc1d277e5a64410/9kiY2K3MCRcBDk7MrkTBK.png","isPro":false,"fullname":"Wei Huang","user":"AaronHuangWei","type":"user"},"name":"Wei Huang","status":"claimed_verified","statusLastChangedAt":"2024-04-22T06:55:21.980Z","hidden":false},{"_id":"65c43a4ff19b126a3cd92d11","name":"Yangdong Liu","hidden":false},{"_id":"65c43a4ff19b126a3cd92d12","user":{"_id":"65c49589c0b1921e19260a8d","avatarUrl":"/avatars/7ce9af8c627f2a0c3db6bde82290ee1f.svg","isPro":false,"fullname":"Haotong Qin","user":"HaotongQin","type":"user"},"name":"Haotong Qin","status":"extracted_confirmed","statusLastChangedAt":"2025-01-03T00:46:32.543Z","hidden":false},{"_id":"65c43a4ff19b126a3cd92d13","name":"Ying Li","hidden":false},{"_id":"65c43a4ff19b126a3cd92d14","name":"Shiming Zhang","hidden":false},{"_id":"65c43a4ff19b126a3cd92d15","name":"Xianglong Liu","hidden":false},{"_id":"65c43a4ff19b126a3cd92d16","name":"Michele Magno","hidden":false},{"_id":"65c43a4ff19b126a3cd92d17","user":{"_id":"6875266f9cd3191dfddc7071","avatarUrl":"/avatars/64c581910833b111e9a7bae5b8740229.svg","isPro":false,"fullname":"xiaojuan qi","user":"xjqi","type":"user"},"name":"Xiaojuan Qi","status":"claimed_verified","statusLastChangedAt":"2025-07-15T19:15:42.987Z","hidden":false}],"publishedAt":"2024-02-06T09:26:34.000Z","submittedOnDailyAt":"2024-02-07T23:50:01.017Z","title":"BiLLM: Pushing the Limit of Post-Training Quantization for LLMs","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Pretrained large language models (LLMs) exhibit exceptional general language\nprocessing capabilities but come with significant demands on memory and\ncomputational resources. As a powerful compression technology, binarization can\nextremely reduce model weights to a mere 1 bit, lowering the expensive\ncomputation and memory requirements. However, existing quantization techniques\nfall short of maintaining LLM performance under ultra-low bit-widths. In\nresponse to this challenge, we present BiLLM, a groundbreaking 1-bit\npost-training quantization scheme tailored for pretrained LLMs. Based on the\nweight distribution of LLMs, BiLLM first identifies and structurally selects\nsalient weights, and minimizes the compression loss through an effective binary\nresidual approximation strategy. Moreover, considering the bell-shaped\ndistribution of the non-salient weights, we propose an optimal splitting search\nto group and binarize them accurately. BiLLM achieving for the first time\nhigh-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit\nweights across various LLMs families and evaluation metrics, outperforms SOTA\nquantization methods of LLM by significant margins. Moreover, BiLLM enables the\nbinarization process of the LLM with 7 billion weights within 0.5 hours on a\nsingle GPU, demonstrating satisfactory time efficiency.","upvotes":50,"discussionId":"65c43a51f19b126a3cd92d63","githubRepo":"https://github.com/aaronhuang-778/billm","githubRepoAddedBy":"auto","ai_summary":"BiLLM, a 1-bit post-training quantization scheme for pretrained LLMs, achieves high-accuracy inference with reduced computational and memory requirements.","ai_keywords":["binarization","post-training quantization","salient weights","binary residual approximation","optimal splitting search","bell-shaped distribution","perplexity","LLaMA2-70B","SOTA"],"githubStars":228},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"64395f66b9ac1d55f41e5cc4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64395f66b9ac1d55f41e5cc4/qhzWbKjN0zyRIlwpg8JRe.png","isPro":false,"fullname":"gunasekar","user":"GunA-SD","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65c49589c0b1921e19260a8d","avatarUrl":"/avatars/7ce9af8c627f2a0c3db6bde82290ee1f.svg","isPro":false,"fullname":"Haotong Qin","user":"HaotongQin","type":"user"},{"_id":"656db3f53dc1d277e5a64410","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656db3f53dc1d277e5a64410/9kiY2K3MCRcBDk7MrkTBK.png","isPro":false,"fullname":"Wei Huang","user":"AaronHuangWei","type":"user"},{"_id":"65c49c515cc8843f4fef1b45","avatarUrl":"/avatars/60293bc1dd9a4c77d3880d283bf81f5f.svg","isPro":false,"fullname":"Tan zhangyao","user":"TanZhangyao","type":"user"},{"_id":"648855a86469f6c5639fb48b","avatarUrl":"/avatars/946c15ade2557d9d931d62ee741bddd4.svg","isPro":false,"fullname":"earsaxcs","user":"earsax","type":"user"},{"_id":"65c4a120fd704b3af2ac609b","avatarUrl":"/avatars/a518e08856ef3e4ce9b507b6907a154e.svg","isPro":false,"fullname":"Chris Wen","user":"chrisleff","type":"user"},{"_id":"65c49fb93b957da6c19618d8","avatarUrl":"/avatars/998daed14b12c2455129a07ac29219f2.svg","isPro":false,"fullname":"hulk","user":"hulk7610","type":"user"},{"_id":"6554b8dcfe564c494fa28a43","avatarUrl":"/avatars/f7e9bfe1544d0a5b01cb43f4f5c3cd4f.svg","isPro":false,"fullname":"lirunyang","user":"lryoung","type":"user"},{"_id":"65c4ad4a885203c0053c3483","avatarUrl":"/avatars/635f7c1b4b4ad04d422384a2b9633807.svg","isPro":false,"fullname":"Ivo Pang","user":"ivopang","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2}">
BiLLM, a 1-bit post-training quantization scheme for pretrained LLMs, achieves high-accuracy inference with reduced computational and memory requirements.
AI-generated summary
Pretrained large language models (LLMs) exhibit exceptional general language
processing capabilities but come with significant demands on memory and
computational resources. As a powerful compression technology, binarization can
extremely reduce model weights to a mere 1 bit, lowering the expensive
computation and memory requirements. However, existing quantization techniques
fall short of maintaining LLM performance under ultra-low bit-widths. In
response to this challenge, we present BiLLM, a groundbreaking 1-bit
post-training quantization scheme tailored for pretrained LLMs. Based on the
weight distribution of LLMs, BiLLM first identifies and structurally selects
salient weights, and minimizes the compression loss through an effective binary
residual approximation strategy. Moreover, considering the bell-shaped
distribution of the non-salient weights, we propose an optimal splitting search
to group and binarize them accurately. BiLLM achieving for the first time
high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit
weights across various LLMs families and evaluation metrics, outperforms SOTA
quantization methods of LLM by significant margins. Moreover, BiLLM enables the
binarization process of the LLM with 7 billion weights within 0.5 hours on a
single GPU, demonstrating satisfactory time efficiency.