project page-aka.ms/MInference, and code.

\n","updatedAt":"2024-07-03T05:23:47.875Z","author":{"_id":"6278bd42541f3d2dfa77ea70","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg","fullname":"Huiqiang Jiang","name":"iofu728","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7101268768310547},"editors":["iofu728"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg"],"reactions":[{"reaction":"🚀","users":["osanseviero"],"count":1}],"isReport":false}},{"id":"6685000845d8ceb44606fb4e","author":{"_id":"6278bd42541f3d2dfa77ea70","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg","fullname":"Huiqiang Jiang","name":"iofu728","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2024-07-03T07:38:48.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Due to an issue with arXiv, the PDF is currently unavailable there. You can find the paper at this [link](https://export.arxiv.org/pdf/2407.02490).","html":"

Due to an issue with arXiv, the PDF is currently unavailable there. You can find the paper at this link.

\n","updatedAt":"2024-07-03T09:36:08.901Z","author":{"_id":"6278bd42541f3d2dfa77ea70","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg","fullname":"Huiqiang Jiang","name":"iofu728","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7828110456466675},"editors":["iofu728"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg"],"reactions":[],"isReport":false}},{"id":"66856117b3ee06e28083c30a","author":{"_id":"62d1ddfac58f969c1528f1b5","avatarUrl":"/avatars/75c372a831cde3c7c6dce3bc875488a7.svg","fullname":"Kalle Hilsenbek","name":"Bachstelze","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false},"createdAt":"2024-07-03T14:32:55.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"What is the method used to obtain the graphical attention patterns?","html":"

What is the method used to obtain the graphical attention patterns?

\n","updatedAt":"2024-07-03T14:32:55.037Z","author":{"_id":"62d1ddfac58f969c1528f1b5","avatarUrl":"/avatars/75c372a831cde3c7c6dce3bc875488a7.svg","fullname":"Kalle Hilsenbek","name":"Bachstelze","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9171003103256226},"editors":["Bachstelze"],"editorAvatarUrls":["/avatars/75c372a831cde3c7c6dce3bc875488a7.svg"],"reactions":[],"isReport":false},"replies":[{"id":"66856a56236794514b0bca02","author":{"_id":"6278bd42541f3d2dfa77ea70","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg","fullname":"Huiqiang Jiang","name":"iofu728","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2024-07-03T15:12:22.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi @Bachstelze, first, we identified three sparse patterns in attention heads through observation. We determined the optimal sparse pattern for each head using offline search, as described in Section 3.2.1. Subsequently, we utilized online approximate dynamic sparse indexing and sparse calculations to accelerate LLMs inference.","html":"

Hi \n\n@Bachstelze\n\t, first, we identified three sparse patterns in attention heads through observation. We determined the optimal sparse pattern for each head using offline search, as described in Section 3.2.1. Subsequently, we utilized online approximate dynamic sparse indexing and sparse calculations to accelerate LLMs inference.

\n","updatedAt":"2024-07-03T15:12:22.433Z","author":{"_id":"6278bd42541f3d2dfa77ea70","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg","fullname":"Huiqiang Jiang","name":"iofu728","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8374412655830383},"editors":["iofu728"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66856117b3ee06e28083c30a"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2407.02490","authors":[{"_id":"6684aff23780e7f96dc29274","user":{"_id":"6278bd42541f3d2dfa77ea70","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg","isPro":false,"fullname":"Huiqiang Jiang","user":"iofu728","type":"user"},"name":"Huiqiang Jiang","status":"claimed_verified","statusLastChangedAt":"2024-07-03T07:37:55.043Z","hidden":false},{"_id":"6684aff23780e7f96dc29275","user":{"_id":"63d00710645dd8d34ea9bcc6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d00710645dd8d34ea9bcc6/E6YsIsXH57OACL-NZ52fB.jpeg","isPro":false,"fullname":"Yucheng","user":"liyucheng","type":"user"},"name":"Yucheng Li","status":"claimed_verified","statusLastChangedAt":"2024-07-03T07:41:33.975Z","hidden":false},{"_id":"6684aff23780e7f96dc29276","user":{"_id":"64646896884f2e3e1ced3cd5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64646896884f2e3e1ced3cd5/86-t8V8LGMNaPQRXnADiD.png","isPro":false,"fullname":"Zhang","user":"Chengruidong","type":"user"},"name":"Chengruidong Zhang","status":"admin_assigned","statusLastChangedAt":"2024-07-03T08:22:52.134Z","hidden":false},{"_id":"6684aff23780e7f96dc29277","user":{"_id":"63ef330b1e695b35aa484e11","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ef330b1e695b35aa484e11/bXwpGy0dl8JXeJwJ--ilr.jpeg","isPro":false,"fullname":"Qianhui WU","user":"qianhuiwu","type":"user"},"name":"Qianhui Wu","status":"admin_assigned","statusLastChangedAt":"2024-07-03T08:22:59.496Z","hidden":false},{"_id":"6684aff23780e7f96dc29278","user":{"_id":"64b750a2fdb702b3d8619514","avatarUrl":"/avatars/f09181c0825763dff692c4bc65effc4c.svg","isPro":false,"fullname":"Xufang Luo","user":"luoxufang","type":"user"},"name":"Xufang Luo","status":"admin_assigned","statusLastChangedAt":"2024-07-03T08:23:06.765Z","hidden":false},{"_id":"6684aff23780e7f96dc29279","name":"Surin Ahn","hidden":false},{"_id":"6684aff23780e7f96dc2927a","user":{"_id":"64cc9b96e60d2cddfadca2c8","avatarUrl":"/avatars/a07755847ec8d05052221d351a3ae20f.svg","isPro":false,"fullname":"Zhenhua Han","user":"hzhua","type":"user"},"name":"Zhenhua Han","status":"admin_assigned","statusLastChangedAt":"2024-07-03T08:23:18.553Z","hidden":false},{"_id":"6684aff23780e7f96dc2927b","name":"Amir H. Abdi","hidden":false},{"_id":"6684aff23780e7f96dc2927c","name":"Dongsheng Li","hidden":false},{"_id":"6684aff23780e7f96dc2927d","name":"Chin-Yew Lin","hidden":false},{"_id":"6684aff23780e7f96dc2927e","name":"Yuqing Yang","hidden":false},{"_id":"6684aff23780e7f96dc2927f","name":"Lili Qiu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6278bd42541f3d2dfa77ea70/Pmx510-_703vitXUtP-K4.png","https://cdn-uploads.huggingface.co/production/uploads/6278bd42541f3d2dfa77ea70/hHEfqM5awgWk8r3nkjxwJ.mp4"],"publishedAt":"2024-07-02T17:59:56.000Z","submittedOnDailyAt":"2024-07-03T03:53:47.851Z","title":"MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via\n Dynamic Sparse Attention","submittedOnDailyBy":{"_id":"6278bd42541f3d2dfa77ea70","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg","isPro":false,"fullname":"Huiqiang Jiang","user":"iofu728","type":"user"},"summary":"The computational challenges of Large Language Model (LLM) inference remain a\nsignificant barrier to their widespread deployment, especially as prompt\nlengths continue to increase. Due to the quadratic complexity of the attention\ncomputation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens\n(i.e., the pre-filling stage) on a single A100 GPU. Existing methods for\nspeeding up prefilling often fail to maintain acceptable accuracy or efficiency\nwhen applied to long-context LLMs. To address this gap, we introduce MInference\n(Milliontokens Inference), a sparse calculation method designed to accelerate\npre-filling of long-sequence processing. Specifically, we identify three unique\npatterns in long-context attention matrices-the A-shape, Vertical-Slash, and\nBlock-Sparsethat can be leveraged for efficient sparse computation on GPUs. We\ndetermine the optimal pattern for each attention head offline and dynamically\nbuild sparse indices based on the assigned pattern during inference. With the\npattern and sparse indices, we perform efficient sparse attention calculations\nvia our optimized GPU kernels to significantly reduce the latency in the\npre-filling stage of long-context LLMs. Our proposed technique can be directly\napplied to existing LLMs without any modifications to the pre-training setup or\nadditional fine-tuning. By evaluating on a wide range of downstream tasks,\nincluding InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models\nincluding LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we\ndemonstrate that MInference effectively reduces inference latency by up to 10x\nfor pre-filling on an A100, while maintaining accuracy. Our code is available\nat https://aka.ms/MInference.","upvotes":26,"discussionId":"6684aff33780e7f96dc292ea","ai_summary":"MInference, a sparse calculation method, accelerates pre-filling of long-context LLMs by leveraging specific patterns in attention matrices, reducing latency significantly without compromising accuracy.","ai_keywords":["Large Language Model (LLM)","inference","attention computation","pattern recognition","sparse calculation","GPU kernels","InfiniteBench","RULER","PG-19","Needle In A Haystack","LLaMA-3-1M","GLM4-1M","Yi-200K","Phi-3-128K","Qwen2-128K"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6278bd42541f3d2dfa77ea70","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6278bd42541f3d2dfa77ea70/ejn49eapnB3UXQckAYdTd.jpeg","isPro":false,"fullname":"Huiqiang Jiang","user":"iofu728","type":"user"},{"_id":"63d00710645dd8d34ea9bcc6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d00710645dd8d34ea9bcc6/E6YsIsXH57OACL-NZ52fB.jpeg","isPro":false,"fullname":"Yucheng","user":"liyucheng","type":"user"},{"_id":"655ac762cb17ec19ef82719b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ac762cb17ec19ef82719b/1kDncYrGLYS_2SR8cNdAL.png","isPro":false,"fullname":"Welcome to matlok","user":"matlok","type":"user"},{"_id":"6457885a75f8f7d26aa5bc44","avatarUrl":"/avatars/8ce57c4d60a1f1b5afa2c592207a8335.svg","isPro":false,"fullname":"allthingsdisaggregated","user":"lastweek","type":"user"},{"_id":"6374c494958cd71fa7ea0a9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6374c494958cd71fa7ea0a9d/b2SjfvbjYqPCW38LzkzWl.jpeg","isPro":false,"fullname":"yuyijiong","user":"yuyijiong","type":"user"},{"_id":"63c0e2503bdc86f8108da51b","avatarUrl":"/avatars/7d47f11992f030b3d831e45102581d1f.svg","isPro":false,"fullname":"Minsoo Kim","user":"minsoo2333","type":"user"},{"_id":"64e6125705d4773cc89df0f4","avatarUrl":"/avatars/57fe42ed13fe214302dfb91d63c8f560.svg","isPro":false,"fullname":"suyeol lee","user":"95suyeol","type":"user"},{"_id":"64a84de2eb47b3552285ef74","avatarUrl":"/avatars/114e0cc393d0aea9680f3af6d84d6f46.svg","isPro":false,"fullname":"Eni Grand","user":"Enigrand","type":"user"},{"_id":"64b525c865a7e15eac12fcd6","avatarUrl":"/avatars/ee11dabfca63fcf47588a75832509f8e.svg","isPro":false,"fullname":"Fulop Botond","user":"floppster","type":"user"},{"_id":"63ddc7b80f6d2d6c3efe3600","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ddc7b80f6d2d6c3efe3600/RX5q9T80Jl3tn6z03ls0l.jpeg","isPro":false,"fullname":"J","user":"dashfunnydashdash","type":"user"},{"_id":"644fac0ce1d7a97f3b653ab1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644fac0ce1d7a97f3b653ab1/qOpqx_bcy7RKkIrExtKrN.jpeg","isPro":true,"fullname":"Michael","user":"michaelfeil","type":"user"},{"_id":"644e1b1d9b4e87c31bab0a14","avatarUrl":"/avatars/88bb4c4a67dc8958069e9014f5e73a0b.svg","isPro":false,"fullname":"Michael Barry","user":"MichaelBarryUK","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2407.02490

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Published on Jul 2, 2024

· Submitted by

Huiqiang Jiang on Jul 3, 2024

Upvote

Authors:

Huiqiang Jiang ,

Yucheng Li ,

Chengruidong Zhang ,

Qianhui Wu ,

Xufang Luo ,

Zhenhua Han ,

Abstract

MInference, a sparse calculation method, accelerates pre-filling of long-context LLMs by leveraging specific patterns in attention matrices, reducing latency significantly without compromising accuracy.

AI-generated summary

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.

View arXiv page View PDF Add to collection

Community

iofu728

Paper author Paper submitter Jul 3, 2024

MInference 1.0 leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy with 1M tokens.

For more detail please check our project page-aka.ms/MInference, and code.

iofu728

Paper author Paper submitter Jul 3, 2024

•

edited Jul 3, 2024

Due to an issue with arXiv, the PDF is currently unavailable there. You can find the paper at this link.

Bachstelze

Jul 3, 2024

What is the method used to obtain the graphical attention patterns?

iofu728

Paper author Jul 3, 2024

Hi @Bachstelze , first, we identified three sparse patterns in attention heads through observation. We determined the optimal sparse pattern for each head using offline search, as described in Section 3.2.1. Subsequently, we utilized online approximate dynamic sparse indexing and sparse calculations to accelerate LLMs inference.