Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Turning Trash into Treasure: Accelerating Inference of Large Language
Models with Token Recycling
@librarian-bot\n\t recommend\n","updatedAt":"2024-09-24T17:35:44.952Z","author":{"_id":"646b8e6f31968a60a0201a12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg","fullname":")))?!?(((","name":"stereoplegic","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3927,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["stereoplegic"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"66f2f876b0bdb349f55661d2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-09-24T17:35:50.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Learning Harmonized Representations for Speculative Sampling](https://huggingface.co/papers/2408.15766) (2024)\n* [Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding](https://huggingface.co/papers/2408.00264) (2024)\n* [Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion](https://huggingface.co/papers/2408.05636) (2024)\n* [Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model](https://huggingface.co/papers/2408.10764) (2024)\n* [MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding](https://huggingface.co/papers/2408.11049) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-09-24T17:35:50.128Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.756786584854126},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66f2f87026ca679b3ff24658"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2408.08696","authors":[{"_id":"66f2f80c090c382a2e292051","name":"Xianzhen Luo","hidden":false},{"_id":"66f2f80c090c382a2e292052","name":"Yixuan Wang","hidden":false},{"_id":"66f2f80c090c382a2e292053","name":"Qingfu Zhu","hidden":false},{"_id":"66f2f80c090c382a2e292054","name":"Zhiming Zhang","hidden":false},{"_id":"66f2f80c090c382a2e292055","name":"Xuanyu Zhang","hidden":false},{"_id":"66f2f80c090c382a2e292056","name":"Qing Yang","hidden":false},{"_id":"66f2f80c090c382a2e292057","name":"Dongliang Xu","hidden":false},{"_id":"66f2f80c090c382a2e292058","name":"Wanxiang Che","hidden":false}],"publishedAt":"2024-08-16T12:20:56.000Z","title":"Turning Trash into Treasure: Accelerating Inference of Large Language\n Models with Token Recycling","summary":"The rapid growth in the parameters of large language models (LLMs) has made\ninference latency a fundamental bottleneck, limiting broader application of\nLLMs. Speculative decoding represents a lossless approach to accelerate\ninference through a guess-and-verify paradigm, leveraging the parallel\ncapabilities of modern hardware. Some speculative decoding methods rely on\nadditional structures to guess draft tokens, such as small models or\nparameter-efficient architectures, which need extra training before use.\nAlternatively, retrieval-based train-free techniques build libraries from\npre-existing corpora or by n-gram generation. However, they face challenges\nlike large storage requirements, time-consuming retrieval, and limited\nadaptability. Observing that candidate tokens generated during the decoding\nprocess are likely to reoccur in future sequences, we propose Token Recycling.\nThis approach stores candidate tokens in an adjacency matrix and employs a\nbreadth-first search (BFS)-like algorithm on the matrix to construct a draft\ntree. The tree is then validated through tree attention. New candidate tokens\nfrom the decoding process are then used to update the matrix. Token Recycling\nrequires \\textless2MB of additional storage and achieves approximately 2x\nspeedup across all sizes of LLMs. It significantly outperforms existing\ntrain-free methods by 30\\% and even a training method by 25\\%. It can be\ndirectly applied to any existing LLMs and tasks without the need for\nadaptation.","upvotes":0,"discussionId":"66f2f80d090c382a2e2920b0","ai_summary":"Token Recycling accelerates inference of large language models by reusing candidate tokens in an adjacency matrix with minimal storage and improved performance.","ai_keywords":["speculative decoding","draft tokens","small models","parameter-efficient architectures","retrieval-based techniques","adjacency matrix","breadth-first search","tree attention","Token Recycling"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
Token Recycling accelerates inference of large language models by reusing candidate tokens in an adjacency matrix with minimal storage and improved performance.
AI-generated summary
The rapid growth in the parameters of large language models (LLMs) has made
inference latency a fundamental bottleneck, limiting broader application of
LLMs. Speculative decoding represents a lossless approach to accelerate
inference through a guess-and-verify paradigm, leveraging the parallel
capabilities of modern hardware. Some speculative decoding methods rely on
additional structures to guess draft tokens, such as small models or
parameter-efficient architectures, which need extra training before use.
Alternatively, retrieval-based train-free techniques build libraries from
pre-existing corpora or by n-gram generation. However, they face challenges
like large storage requirements, time-consuming retrieval, and limited
adaptability. Observing that candidate tokens generated during the decoding
process are likely to reoccur in future sequences, we propose Token Recycling.
This approach stores candidate tokens in an adjacency matrix and employs a
breadth-first search (BFS)-like algorithm on the matrix to construct a draft
tree. The tree is then validated through tree attention. New candidate tokens
from the decoding process are then used to update the matrix. Token Recycling
requires \textless2MB of additional storage and achieves approximately 2x
speedup across all sizes of LLMs. It significantly outperforms existing
train-free methods by 30\% and even a training method by 25\%. It can be
directly applied to any existing LLMs and tasks without the need for
adaptation.