https://github.com/jessevig/bertviz/issues/128

\n","updatedAt":"2024-03-01T10:27:11.686Z","author":{"_id":"62d1ddfac58f969c1528f1b5","avatarUrl":"/avatars/75c372a831cde3c7c6dce3bc875488a7.svg","fullname":"Kalle Hilsenbek","name":"Bachstelze","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9029202461242676},"editors":["Bachstelze"],"editorAvatarUrls":["/avatars/75c372a831cde3c7c6dce3bc875488a7.svg"],"reactions":[],"isReport":false}},{"id":"65e27f0275b43b925e59fe61","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-03-02T01:21:06.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference](https://huggingface.co/papers/2402.06262) (2024)\n* [Scaling Sparse Fine-Tuning to Large Language Models](https://huggingface.co/papers/2401.16405) (2024)\n* [Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models](https://huggingface.co/papers/2401.04658) (2024)\n* [Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers](https://huggingface.co/papers/2402.11700) (2024)\n* [Long-Context Language Modeling with Parallel Context Encoding](https://huggingface.co/papers/2402.16617) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-03-02T01:21:06.174Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7384421229362488},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"65e398f9d26b426e3e0c4b88","author":{"_id":"6594b0a1f0152a21fca9c05f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6594b0a1f0152a21fca9c05f/813o7avlQ2FLpO7HsRhBO.jpeg","fullname":"Hossein Ahmadi","name":"HosseinAhmadi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-03-02T21:24:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hello my friend, I read this paper and it was really great. Can I ask for your help to finish my papaer?","html":"

Hello my friend, I read this paper and it was really great. Can I ask for your help to finish my papaer?

\n","updatedAt":"2024-03-02T21:24:09.728Z","author":{"_id":"6594b0a1f0152a21fca9c05f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6594b0a1f0152a21fca9c05f/813o7avlQ2FLpO7HsRhBO.jpeg","fullname":"Hossein Ahmadi","name":"HosseinAhmadi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9807602167129517},"editors":["HosseinAhmadi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6594b0a1f0152a21fca9c05f/813o7avlQ2FLpO7HsRhBO.jpeg"],"reactions":[],"isReport":false}},{"id":"65e6bc23c8fdc776454084e4","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2024-03-05T06:30:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi folks, I see the \"attention\" pattern is 5:5:17 or 7:7:22 for global-linear:64-SWA:BaseConv layers. How are these different layers organized together? Are they stacked (global:SWA:Conv, or another permutation?), are they interleaved? Is the optimal pattern of global/local/conv analyzed?\n\n![3WsGPCFRWpzKSkD.png](https://cdn-uploads.huggingface.co/production/uploads/61f4d468587c793cdf55b4dd/bgUnQa_ZyoHYc8tU0gyvN.png)\n","html":"

Hi folks, I see the \"attention\" pattern is 5:5:17 or 7:7:22 for global-linear:64-SWA:BaseConv layers. How are these different layers organized together? Are they stacked (global:SWA:Conv, or another permutation?), are they interleaved? Is the optimal pattern of global/local/conv analyzed?

$\"3WsGPCFRWpzKSkD.png\"$

\n","updatedAt":"2024-03-05T06:30:59.317Z","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7682012319564819},"editors":["leegao19"],"editorAvatarUrls":["/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg"],"reactions":[],"isReport":false},"replies":[{"id":"65f0899b4553c3b1a7c913fd","author":{"_id":"62703f4bbd9c82ff64c2f99f","avatarUrl":"/avatars/8ed16c6b38a06fd009c39d26f279a6a9.svg","fullname":"Simran","name":"simarora","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false},"createdAt":"2024-03-12T16:58:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi! The layer mixtures and orders are specified in the reference configs provided here: https://github.com/HazyResearch/based/blob/e2834d89d1b23d4b3beb13389881b84601a95db6/train/configs/experiment/reference/based-360m.yaml#L53 They are stacked layers","html":"

Hi! The layer mixtures and orders are specified in the reference configs provided here: https://github.com/HazyResearch/based/blob/e2834d89d1b23d4b3beb13389881b84601a95db6/train/configs/experiment/reference/based-360m.yaml#L53 They are stacked layers

\n","updatedAt":"2024-03-12T16:58:03.736Z","author":{"_id":"62703f4bbd9c82ff64c2f99f","avatarUrl":"/avatars/8ed16c6b38a06fd009c39d26f279a6a9.svg","fullname":"Simran","name":"simarora","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6844226717948914},"editors":["simarora"],"editorAvatarUrls":["/avatars/8ed16c6b38a06fd009c39d26f279a6a9.svg"],"reactions":[{"reaction":"👍","users":["leegao19"],"count":1},{"reaction":"🤗","users":["leegao19"],"count":1}],"isReport":false,"parentCommentId":"65e6bc23c8fdc776454084e4"}},{"id":"65f0ab6ae4c7dddbca787ca6","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2024-03-12T19:22:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you for checking! Did you folks look into an ablation study on how different mixtures of the layers impact performance?","html":"

Thank you for checking! Did you folks look into an ablation study on how different mixtures of the layers impact performance?

\n","updatedAt":"2024-03-12T19:22:18.633Z","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9057085514068604},"editors":["leegao19"],"editorAvatarUrls":["/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg"],"reactions":[],"isReport":false,"parentCommentId":"65e6bc23c8fdc776454084e4"}},{"id":"65f0ae603e29a622f051b1e5","author":{"_id":"6337537b267cee4d068f604d","avatarUrl":"/avatars/15267f0759a6570c98ee6a150558fcc0.svg","fullname":"Sabri Eyuboglu","name":"sabrieyuboglu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false},"createdAt":"2024-03-12T19:34:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"In Table 4, we ablate the use of different sub-layers. However, we don't include ablation studies comparing proportions of each layer type","html":"

In Table 4, we ablate the use of different sub-layers. However, we don't include ablation studies comparing proportions of each layer type

\n","updatedAt":"2024-03-12T19:34:56.529Z","author":{"_id":"6337537b267cee4d068f604d","avatarUrl":"/avatars/15267f0759a6570c98ee6a150558fcc0.svg","fullname":"Sabri Eyuboglu","name":"sabrieyuboglu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9410621523857117},"editors":["sabrieyuboglu"],"editorAvatarUrls":["/avatars/15267f0759a6570c98ee6a150558fcc0.svg"],"reactions":[],"isReport":false,"parentCommentId":"65e6bc23c8fdc776454084e4"}},{"id":"65f0dcda2dfde0475e36732d","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2024-03-12T22:53:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Gotcha, I ask because there seems to be some interesting research around how to best stack or interleave in these global:local layered attention mechanisms, and that the placement of the layers introduces interesting outcomes on the model in terms of long-range modeling. \n\nSome that I am aware of include:\n\n1. Rae and Raavi, 2020 (https://huggingface.co/papers/2007.03356) - while their work is on long/short range memory modules (effectively activation caches fed into attention) in Transformer XL, the spirit is similar. They find that either interleaving local:global or stacking the global layers at the final layers (top) is most effective\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/61f4d468587c793cdf55b4dd/RgFYZNiUfXuD9xG41R-Gn.png)\n\n2. Song et al. 2023 (https://huggingface.co/papers/2312.08618) - found that interleaving local(SWA):global(full) attention layers at a density of 3:1 to be effective.\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/61f4d468587c793cdf55b4dd/D6_TxZF_8n3yzvd8Wamo0.png)\n\n3. Chen and Lee, 2023 (https://huggingface.co/papers/2305.12689) - proposes interleaving local:global attention using what's effectively blocked window local attention and special \"latent tokens\" purely to propagate global information in global layers.\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/61f4d468587c793cdf55b4dd/X0xygciN7og4W1Sb_KWy5.png)\n \n4. Zhang et al. 2023 (https://huggingface.co/papers/2310.12442) - also analyzes placements of local(block window):global(full) attention layers. In contrast to Rae and Raavi (which was for Transformer-XL), they found the optimal placement to be to stack global attention layers at the lower layers (bottom). They found ~ similar performance for interleaved layers, but found that placing the global attention layer at the very top layers is unequivocally worse.\n![image.png](https://cdn-uploads.huggingface.co/production/uploads/61f4d468587c793cdf55b4dd/IGt11eXqSPod4x8h6hgbK.png)\n\nBased on https://github.com/HazyResearch/based/blob/e2834d89d1b23d4b3beb13389881b84601a95db6/train/configs/experiment/reference/based-360m.yaml#L53, it looks like Based's pattern is an alternating pattern of {Global(Linear), Local(SWA), Local(Conv), Local(Conv), Local(Conv)} repeated, which is a very interesting pattern.\n\nHave you guys thought about fixing this attention pattern {Global, Local(SWA), Local(Conv), ...}, but replacing the Global layer with a full attention, and then comparing results with Based's Global Linear attention? That would actually seem to be the most apples-to-apples comparison of long-range modeling performance.\n","html":"

Gotcha, I ask because there seems to be some interesting research around how to best stack or interleave in these global:local layered attention mechanisms, and that the placement of the layers introduces interesting outcomes on the model in terms of long-range modeling.

Some that I am aware of include:

Rae and Raavi, 2020 (https://huggingface.co/papers/2007.03356) - while their work is on long/short range memory modules (effectively activation caches fed into attention) in Transformer XL, the spirit is similar. They find that either interleaving local:global or stacking the global layers at the final layers (top) is most effective
$\"image.png\"$
\n
Song et al. 2023 (https://huggingface.co/papers/2312.08618) - found that interleaving local(SWA):global(full) attention layers at a density of 3:1 to be effective.
$\"image.png\"$
\n
Chen and Lee, 2023 (https://huggingface.co/papers/2305.12689) - proposes interleaving local:global attention using what's effectively blocked window local attention and special \"latent tokens\" purely to propagate global information in global layers.
$\"image.png\"$
\n
Zhang et al. 2023 (https://huggingface.co/papers/2310.12442) - also analyzes placements of local(block window):global(full) attention layers. In contrast to Rae and Raavi (which was for Transformer-XL), they found the optimal placement to be to stack global attention layers at the lower layers (bottom). They found ~ similar performance for interleaved layers, but found that placing the global attention layer at the very top layers is unequivocally worse.
$\"image.png\"$
\n

Based on https://github.com/HazyResearch/based/blob/e2834d89d1b23d4b3beb13389881b84601a95db6/train/configs/experiment/reference/based-360m.yaml#L53, it looks like Based's pattern is an alternating pattern of {Global(Linear), Local(SWA), Local(Conv), Local(Conv), Local(Conv)} repeated, which is a very interesting pattern.

Have you guys thought about fixing this attention pattern {Global, Local(SWA), Local(Conv), ...}, but replacing the Global layer with a full attention, and then comparing results with Based's Global Linear attention? That would actually seem to be the most apples-to-apples comparison of long-range modeling performance.

\n","updatedAt":"2024-03-12T22:53:14.871Z","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8057447671890259},"editors":["leegao19"],"editorAvatarUrls":["/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg"],"reactions":[],"isReport":false,"parentCommentId":"65e6bc23c8fdc776454084e4"}},{"id":"65f0f5f8de069cd5c55f1dd2","author":{"_id":"62703f4bbd9c82ff64c2f99f","avatarUrl":"/avatars/8ed16c6b38a06fd009c39d26f279a6a9.svg","fullname":"Simran","name":"simarora","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false},"createdAt":"2024-03-13T00:40:24.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Thanks for sending this note! Regarding your questions, a few more informal observations from Based beyond paper discussion are as follows. \nRegarding your questions on combining the layers with full-attention, we tried:\n- Transformer ++ (i.e. full attention, with Rotary, with SwiGLU) plus Local(Conv) --- this was the same quality as pure Transformer ++. \n- Transformer (GPT-2, full attention, no Rotary, no SwiGLU) was worse than Transformer++ by a decent margin, but adding Local(Conv) closed almost all the gap. \n\nTakeaways: Essentially, we see Local(Conv) is okay instead of Rotary. We did not see additional benefit to combining the Local(Conv) plus full Transformer++ attention layers.\n\nOrderings: In the following, consider 12 layer hybrid architectures with 3 linear attention (LA, Taylor) and 9 local conv layers at 150M parameters. Again these are just for informal sharing, not paper-quality numbers :) \n - 3 LA first, then 9 Local(Conv): was about 25 ppl after 1.5Bn tokens of training on the Pile\n- 9 Local(Conv)first, then 3 LA: was about 16 ppl, \"\"\n- 3 Local(Conv) + 1 LA, repeated: was about 15 ppl, \"\"\nOn MQAR synthetics in our Zoology repo (https://github.com/HazyResearch/zoology), you can try combining a short-conv plus LA or SWA plus LA layer in different orders (look at 2 or 4 layer networks). We found that it was helpful to put the short mixer first. \n\nOverall in Based we showed that combining complementary operations allowed extending the recall-throughput pareto frontier. We focused on combining the layers to balance both efficiency and quality. Beyond extremes like my comment above about putting all the global mixers (LA) first, we didn't find the trends and key takeaways to be particularly sensitive to how we hybridized","html":"

Thanks for sending this note! Regarding your questions, a few more informal observations from Based beyond paper discussion are as follows.
Regarding your questions on combining the layers with full-attention, we tried:

Transformer ++ (i.e. full attention, with Rotary, with SwiGLU) plus Local(Conv) --- this was the same quality as pure Transformer ++.
Transformer (GPT-2, full attention, no Rotary, no SwiGLU) was worse than Transformer++ by a decent margin, but adding Local(Conv) closed almost all the gap.

Takeaways: Essentially, we see Local(Conv) is okay instead of Rotary. We did not see additional benefit to combining the Local(Conv) plus full Transformer++ attention layers.

Orderings: In the following, consider 12 layer hybrid architectures with 3 linear attention (LA, Taylor) and 9 local conv layers at 150M parameters. Again these are just for informal sharing, not paper-quality numbers :)

3 LA first, then 9 Local(Conv): was about 25 ppl after 1.5Bn tokens of training on the Pile
9 Local(Conv)first, then 3 LA: was about 16 ppl, \"\"
3 Local(Conv) + 1 LA, repeated: was about 15 ppl, \"\"
On MQAR synthetics in our Zoology repo (https://github.com/HazyResearch/zoology), you can try combining a short-conv plus LA or SWA plus LA layer in different orders (look at 2 or 4 layer networks). We found that it was helpful to put the short mixer first.

Overall in Based we showed that combining complementary operations allowed extending the recall-throughput pareto frontier. We focused on combining the layers to balance both efficiency and quality. Beyond extremes like my comment above about putting all the global mixers (LA) first, we didn't find the trends and key takeaways to be particularly sensitive to how we hybridized

\n","updatedAt":"2024-03-13T00:43:22.667Z","author":{"_id":"62703f4bbd9c82ff64c2f99f","avatarUrl":"/avatars/8ed16c6b38a06fd009c39d26f279a6a9.svg","fullname":"Simran","name":"simarora","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":4,"identifiedLanguage":{"language":"en","probability":0.9195837378501892},"editors":["simarora"],"editorAvatarUrls":["/avatars/8ed16c6b38a06fd009c39d26f279a6a9.svg"],"reactions":[{"reaction":"❤️","users":["leegao19"],"count":1}],"isReport":false,"parentCommentId":"65e6bc23c8fdc776454084e4"}},{"id":"65f1323d6ad9dd2e062ea0ac","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2024-03-13T04:57:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an amazing response! Thank you!\n\n> Takeaways: Essentially, we see Local(Conv) is okay instead of Rotary. We did not see additional benefit to combining the Local(Conv) plus full Transformer++ attention layers.\n\nWow, that's quite a finding!\n\nIs the total number of layers (including both transformer and baseconv layers) still the same in this setup? It's so interesting that alternating No-PE transformer layers with BaseConv layers seems to work just as well as having full-attention transformer with PE, even if half the layers are subquadratic (and not even attention based). \n\nThis also reminds me of something that I've heard from a quite a few folks now. There's likely something in RoPE that introduces some inductive bias that favors short-range modeling while making it difficult to generalize to longer range modeling tasks (without, for e.g., extending its frequency via ABF to something like 10M). On the other hand, NoPE (or non-trig additive bias methods, like T5) seem to work well at long-range modeling but perform poorly at short-range. It makes sense that combining NoPE with something that models short-range well (sounds like BaseConv does) would help overcome both challenges.\n\n> In the following, consider 12 layer hybrid architectures with 3 linear attention (LA, Taylor) and 9 local conv layers at 150M parameters.\n\nNice! It seems like y'all also found that global (LA) at the top (later) layers outperforms local at the top, and interleaving is ~ similar (slightly better) as global-at-the-top approach, that's another great data-point.","html":"

This is an amazing response! Thank you!

\n
Takeaways: Essentially, we see Local(Conv) is okay instead of Rotary. We did not see additional benefit to combining the Local(Conv) plus full Transformer++ attention layers.
\n

Wow, that's quite a finding!

Is the total number of layers (including both transformer and baseconv layers) still the same in this setup? It's so interesting that alternating No-PE transformer layers with BaseConv layers seems to work just as well as having full-attention transformer with PE, even if half the layers are subquadratic (and not even attention based).

This also reminds me of something that I've heard from a quite a few folks now. There's likely something in RoPE that introduces some inductive bias that favors short-range modeling while making it difficult to generalize to longer range modeling tasks (without, for e.g., extending its frequency via ABF to something like 10M). On the other hand, NoPE (or non-trig additive bias methods, like T5) seem to work well at long-range modeling but perform poorly at short-range. It makes sense that combining NoPE with something that models short-range well (sounds like BaseConv does) would help overcome both challenges.

\n
In the following, consider 12 layer hybrid architectures with 3 linear attention (LA, Taylor) and 9 local conv layers at 150M parameters.
\n

Nice! It seems like y'all also found that global (LA) at the top (later) layers outperforms local at the top, and interleaving is ~ similar (slightly better) as global-at-the-top approach, that's another great data-point.

\n","updatedAt":"2024-03-13T04:57:33.011Z","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9532641172409058},"editors":["leegao19"],"editorAvatarUrls":["/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg"],"reactions":[{"reaction":"👍","users":["simarora"],"count":1}],"isReport":false,"parentCommentId":"65e6bc23c8fdc776454084e4"}},{"id":"65f232e41cc87ef0baf918e5","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2024-03-13T23:12:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I also saw your other amazing paper (https://huggingface.co/papers/2312.04927) on convolution-only models failing mqAR as well (also proposing a hybrid approach to address the mqAR gap). Recent results on, for e.g., SSM models show similar gaps and it's generally accepted that this is one of (if not the main) reason behind their lackluster performance on ICL\n\nDid you guys also look into/analyze why linear attention for e.g. (especially with an approximate-softmax kernel like Taylor) fall short here as well? I believe the favorite argument (Olsson) right now is from the mechanistic interpretability lens, attributing lack of AR and ICL to lack of robust induction circuits to facilitate recall and induction.","html":"

I also saw your other amazing paper (https://huggingface.co/papers/2312.04927) on convolution-only models failing mqAR as well (also proposing a hybrid approach to address the mqAR gap). Recent results on, for e.g., SSM models show similar gaps and it's generally accepted that this is one of (if not the main) reason behind their lackluster performance on ICL

Did you guys also look into/analyze why linear attention for e.g. (especially with an approximate-softmax kernel like Taylor) fall short here as well? I believe the favorite argument (Olsson) right now is from the mechanistic interpretability lens, attributing lack of AR and ICL to lack of robust induction circuits to facilitate recall and induction.

\n","updatedAt":"2024-03-13T23:12:36.775Z","author":{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","fullname":"Lee Gao","name":"leegao19","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9252729415893555},"editors":["leegao19"],"editorAvatarUrls":["/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg"],"reactions":[{"reaction":"👍","users":["simarora"],"count":1}],"isReport":false,"parentCommentId":"65e6bc23c8fdc776454084e4"}}]},{"id":"65f0897f449045621a0daaf0","author":{"_id":"62703f4bbd9c82ff64c2f99f","avatarUrl":"/avatars/8ed16c6b38a06fd009c39d26f279a6a9.svg","fullname":"Simran","name":"simarora","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false},"createdAt":"2024-03-12T16:57:35.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2024-03-12T16:58:00.521Z","author":{"_id":"62703f4bbd9c82ff64c2f99f","avatarUrl":"/avatars/8ed16c6b38a06fd009c39d26f279a6a9.svg","fullname":"Simran","name":"simarora","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}}],"primaryEmailConfirmed":false,"paper":{"id":"2402.18668","authors":[{"_id":"65e157690890daf03e88f25a","user":{"_id":"62703f4bbd9c82ff64c2f99f","avatarUrl":"/avatars/8ed16c6b38a06fd009c39d26f279a6a9.svg","isPro":false,"fullname":"Simran","user":"simarora","type":"user"},"name":"Simran Arora","status":"claimed_verified","statusLastChangedAt":"2024-10-21T07:54:44.122Z","hidden":false},{"_id":"65e157690890daf03e88f25b","user":{"_id":"6337537b267cee4d068f604d","avatarUrl":"/avatars/15267f0759a6570c98ee6a150558fcc0.svg","isPro":false,"fullname":"Sabri Eyuboglu","user":"sabrieyuboglu","type":"user"},"name":"Sabri Eyuboglu","status":"admin_assigned","statusLastChangedAt":"2024-03-01T10:46:21.661Z","hidden":false},{"_id":"65e157690890daf03e88f25c","name":"Michael Zhang","hidden":false},{"_id":"65e157690890daf03e88f25d","user":{"_id":"65e1bb842b435c2dcb5e2199","avatarUrl":"/avatars/5d4c91401fb989a8005ce43bc3fc2779.svg","isPro":false,"fullname":"Aman Timalsina","user":"amantimalsina","type":"user"},"name":"Aman Timalsina","status":"claimed_verified","statusLastChangedAt":"2024-03-01T12:44:44.616Z","hidden":false},{"_id":"65e157690890daf03e88f25e","user":{"_id":"63378a6cba7895577bcd53c3","avatarUrl":"/avatars/b504bb2088e996325c93f2b56de37c81.svg","isPro":false,"fullname":"Silas Alberti","user":"alberti","type":"user"},"name":"Silas Alberti","status":"admin_assigned","statusLastChangedAt":"2024-03-01T10:47:25.387Z","hidden":false},{"_id":"65e157690890daf03e88f25f","name":"Dylan Zinsley","hidden":false},{"_id":"65e157690890daf03e88f260","user":{"_id":"648a769003fc4a3938bb7943","avatarUrl":"/avatars/7647f99abdcca4251fcac7783b6fcc8d.svg","isPro":false,"fullname":"zou","user":"jameszou707","type":"user"},"name":"James Zou","status":"admin_assigned","statusLastChangedAt":"2024-03-01T10:48:00.307Z","hidden":false},{"_id":"65e157690890daf03e88f261","name":"Atri Rudra","hidden":false},{"_id":"65e157690890daf03e88f262","name":"Christopher Ré","hidden":false}],"publishedAt":"2024-02-28T19:28:27.000Z","submittedOnDailyAt":"2024-03-01T01:49:53.990Z","title":"Simple linear attention language models balance the recall-throughput\n tradeoff","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"Recent work has shown that attention-based language models excel at recall,\nthe ability to ground generations in tokens previously seen in context.\nHowever, the efficiency of attention-based models is bottle-necked during\ninference by the KV-cache's aggressive memory consumption. In this work, we\nexplore whether we can improve language model efficiency (e.g. by reducing\nmemory consumption) without compromising on recall. By applying experiments and\ntheory to a broad set of architectures, we identify a key tradeoff between a\nmodel's state size and recall ability. We show that efficient alternatives to\nattention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but\nstruggle at recall. We propose BASED a simple architecture combining linear and\nsliding window attention. By varying BASED window size and linear attention\nfeature dimension, we can dial the state size and traverse the pareto frontier\nof the recall-memory tradeoff curve, recovering the full quality of attention\non one end and the small state size of attention-alternatives on the other. We\ntrain language models up to 1.3b parameters and show that BASED matches the\nstrongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them\non real-world recall-intensive tasks by 6.22 accuracy points. Implementations\nof linear attention are often less efficient than optimized standard attention\nimplementations. To make BASED competitive, we develop IO-aware algorithms that\nenable 24x higher throughput on language generation than FlashAttention-2, when\ngenerating 1024 tokens using 1.3b parameter models. Code for this work is\nprovided at: https://github.com/HazyResearch/based.","upvotes":20,"discussionId":"65e157690890daf03e88f279","githubRepo":"https://github.com/hazyresearch/based","githubRepoAddedBy":"auto","ai_summary":"BASED, a hybrid architecture combining linear and sliding window attention, improves language model efficiency without sacrificing recall quality.","ai_keywords":["attention-based language models","KV-cache","recall","H3","Mamba","RWKV","BASED","linear attention","sliding window attention","state size","perplexity","real-world recall-intensive tasks","IO-aware algorithms","FlashAttention-2"],"githubStars":248},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"63119cc5af10c9efa1e9b620","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63119cc5af10c9efa1e9b620/RA-UgDNTPsF6j5uDnG3-N.jpeg","isPro":false,"fullname":"Akarshan Biswas","user":"qnixsynapse","type":"user"},{"_id":"6311bca0ae8896941da24e66","avatarUrl":"/avatars/48de64894fc3c9397e26e4d6da3ff537.svg","isPro":false,"fullname":"Fynn Kröger","user":"fynnkroeger","type":"user"},{"_id":"631313e1b46fc4e2432ebe56","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631313e1b46fc4e2432ebe56/r2sDFz8uwmqPZq_0JO_eY.jpeg","isPro":false,"fullname":"Rishabh Singh","user":"lulzx","type":"user"},{"_id":"653a18000dfecb4b26dd2876","avatarUrl":"/avatars/fcf8a2ea58f6eca0a6196299c68fc8ad.svg","isPro":false,"fullname":"James Chang","user":"strategist922","type":"user"},{"_id":"62d19a4b1e36881a57f31c6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d19a4b1e36881a57f31c6a/C-tAc0uXvpIggh0nWB2Dy.jpeg","isPro":false,"fullname":"Hugo Pitorro","user":"twigs","type":"user"},{"_id":"6337537b267cee4d068f604d","avatarUrl":"/avatars/15267f0759a6570c98ee6a150558fcc0.svg","isPro":false,"fullname":"Sabri Eyuboglu","user":"sabrieyuboglu","type":"user"},{"_id":"6495d5e8f1d3ee1d68de7721","avatarUrl":"/avatars/8d57ec468df68d1d1eea9f9b8eacac72.svg","isPro":false,"fullname":"Muhammad Maxalmina Magnum","user":"Maxyro33354","type":"user"},{"_id":"65a28ecb80e2523eea721cda","avatarUrl":"/avatars/c116a64e741aa88d4e1fac3bf62d4382.svg","isPro":false,"fullname":"Jonah Turner","user":"drexalt","type":"user"},{"_id":"61f4d468587c793cdf55b4dd","avatarUrl":"/avatars/ce597d8d2640c726473dd85ae8c5cdc7.svg","isPro":false,"fullname":"Lee Gao","user":"leegao19","type":"user"},{"_id":"65025370b6595dc45c397340","avatarUrl":"/avatars/9469599b176034548042922c0afa7051.svg","isPro":false,"fullname":"J C","user":"dark-pen","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2402.18668

Simple linear attention language models balance the recall-throughput tradeoff

Published on Feb 28, 2024

· Submitted by

AK on Mar 1, 2024

Upvote

Authors:

Simran Arora ,

Sabri Eyuboglu ,

Aman Timalsina ,

Silas Alberti ,

James Zou ,

Abstract

BASED, a hybrid architecture combining linear and sliding window attention, improves language model efficiency without sacrificing recall quality.

AI-generated summary

Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache's aggressive memory consumption. In this work, we explore whether we can improve language model efficiency (e.g. by reducing memory consumption) without compromising on recall. By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model's state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points. Implementations of linear attention are often less efficient than optimized standard attention implementations. To make BASED competitive, we develop IO-aware algorithms that enable 24x higher throughput on language generation than FlashAttention-2, when generating 1024 tokens using 1.3b parameter models. Code for this work is provided at: https://github.com/HazyResearch/based.

View arXiv page View PDF GitHub 248 auto Add to collection

Community

Bachstelze

Mar 1, 2024

Could it be that such an attention mechanism mostly works, because instruction following GPTs use attention as a redundant help pattern for their feed-forward nets?
A visualization of this pattern: https://github.com/jessevig/bertviz/issues/128

librarian-bot

Mar 2, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

HosseinAhmadi

Mar 2, 2024

Hello my friend, I read this paper and it was really great. Can I ask for your help to finish my papaer?

leegao19

Mar 5, 2024

Hi folks, I see the "attention" pattern is 5:5:17 or 7:7:22 for global-linear:64-SWA:BaseConv layers. How are these different layers organized together? Are they stacked (global:SWA:Conv, or another permutation?), are they interleaved? Is the optimal pattern of global/local/conv analyzed?