Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference
[go: Go Back, main page]

@librarian-bot\n\t recommend

\n","updatedAt":"2024-05-13T05:10:46.917Z","author":{"_id":"646b8e6f31968a60a0201a12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg","fullname":")))?!?(((","name":"stereoplegic","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3927,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7918877601623535},"editors":["stereoplegic"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/646b8e6f31968a60a0201a12/SU2Gs1NPuk1zoXHwFHl0U.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"6641a0e54aeb9c177296a35a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-05-13T05:11:01.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LLoCO: Learning Long Contexts Offline](https://huggingface.co/papers/2404.07979) (2024)\n* [Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts](https://huggingface.co/papers/2404.02022) (2024)\n* [Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation](https://huggingface.co/papers/2404.06910) (2024)\n* [XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference](https://huggingface.co/papers/2404.15420) (2024)\n* [Imagination Augmented Generation: Learning to Imagine Richer Context for Question Answering over Large Language Models](https://huggingface.co/papers/2403.15268) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-05-13T05:11:01.555Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7339097261428833},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6641a0d664f847d2f3766c6f"}}]},{"id":"67ac6278145d179367525837","author":{"_id":"6403c20ddbfbea2a0540983b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6403c20ddbfbea2a0540983b/AHxgTaS-FuFNnoGJBDgXI.jpeg","fullname":"Lael","name":"laelhalawani","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2025-02-12T08:57:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"where's the public code?","html":"

where's the public code?

\n","updatedAt":"2025-02-12T08:57:28.835Z","author":{"_id":"6403c20ddbfbea2a0540983b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6403c20ddbfbea2a0540983b/AHxgTaS-FuFNnoGJBDgXI.jpeg","fullname":"Lael","name":"laelhalawani","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8623876571655273},"editors":["laelhalawani"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6403c20ddbfbea2a0540983b/AHxgTaS-FuFNnoGJBDgXI.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2405.04065","authors":[{"_id":"6641a07d1cd6897588516be4","name":"Runheng Liu","hidden":false},{"_id":"6641a07d1cd6897588516be5","name":"Xingchen Xiao","hidden":false},{"_id":"6641a07d1cd6897588516be6","name":"Heyan Huang","hidden":false},{"_id":"6641a07d1cd6897588516be7","user":{"_id":"60f6d61f89b21b8fd2d471c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60f6d61f89b21b8fd2d471c6/RmLFf97vUoXMoCT3rWbhm.jpeg","isPro":false,"fullname":"Zewen Chi","user":"CZWin32768","type":"user"},"name":"Zewen Chi","status":"claimed_verified","statusLastChangedAt":"2025-10-31T14:34:21.129Z","hidden":false},{"_id":"6641a07d1cd6897588516be8","name":"Zhijing Wu","hidden":false}],"publishedAt":"2024-05-07T07:14:38.000Z","title":"FlashBack:Efficient Retrieval-Augmented Language Modeling for Long\n Context Inference","summary":"Retrieval-Augmented Language Modeling (RALM) by integrating large language\nmodels (LLM) with relevant documents from an external corpus is a proven method\nfor enabling the LLM to generate information beyond the scope of its\npre-training corpus. Previous work using utilizing retrieved content by simply\nprepending retrieved contents to the input poses a high runtime issue, which\ndegrades the inference efficiency of the LLMs because they fail to use the\nKey-Value (KV) cache efficiently. In this paper, we propose FlashBack,\na modular RALM designed to improve the inference efficiency of RALM with\nappending context pattern while maintaining decent performance after specific\nfine-tuning without heavily destruct the knowledge integrity of the LLM.\nFlashBack appends retrieved documents at the end of the context for\nefficiently utilizing the KV cache instead of prepending them. Our experiment\nshows that the inference speed of FlashBack is up to 4times faster\nthan the prepending method on a 7B LLM (Llama 2). Via bypassing unnecessary\nre-computation, it demonstrates an advancement by achieving significantly\nfaster inference speed, and this heightened efficiency will substantially\nreduce inferential cost. Our code will be publicly available.","upvotes":0,"discussionId":"6641a07e1cd6897588516c0a","ai_summary":"FlashBack, a modular Retrieval-Augmented Language Model, improves inference efficiency by appending context patterns instead of prepending, achieving up to 4x faster performance on a 7B LLM while maintaining knowledge integrity.","ai_keywords":["Retrieval-Augmented Language Modeling","large language models","Key-Value (KV) cache","FlashBack","efficient utilization","inference efficiency","bench-marking","specific fine-tuning","knowledge integrity"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["*"]}">
Papers
arxiv:2405.04065

FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

Published on May 7, 2024
Authors:
,
,
,

Abstract

FlashBack, a modular Retrieval-Augmented Language Model, improves inference efficiency by appending context patterns instead of prepending, achieving up to 4x faster performance on a 7B LLM while maintaining knowledge integrity.

AI-generated summary

Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work using utilizing retrieved content by simply prepending retrieved contents to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose FlashBack, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after specific fine-tuning without heavily destruct the knowledge integrity of the LLM. FlashBack appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. Our experiment shows that the inference speed of FlashBack is up to 4times faster than the prepending method on a 7B LLM (Llama 2). Via bypassing unnecessary re-computation, it demonstrates an advancement by achieving significantly faster inference speed, and this heightened efficiency will substantially reduce inferential cost. Our code will be publicly available.

Community

@librarian-bot recommend

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

where's the public code?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.04065 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.04065 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.04065 in a Space README.md to link it from this page.

Collections including this paper 1