$\"Screen$

\n","updatedAt":"2024-08-16T04:15:38.427Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9179,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3818974196910858},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"66bffd93ef1f45cd718c95af","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2024-08-17T01:32:03.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training](https://huggingface.co/papers/2406.16554) (2024)\n* [A Survey on Mixture of Experts](https://huggingface.co/papers/2407.06204) (2024)\n* [Layerwise Recurrent Router for Mixture-of-Experts](https://huggingface.co/papers/2408.06793) (2024)\n* [AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models](https://huggingface.co/papers/2406.13233) (2024)\n* [MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts](https://huggingface.co/papers/2407.09816) (2024)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (2024)
A Survey on Mixture of Experts (2024)
Layerwise Recurrent Router for Mixture-of-Experts (2024)
AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models (2024)
MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts (2024)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2024-08-17T01:32:03.124Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7566413283348083},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["aashusingh"],"count":1}],"isReport":false}},{"id":"66c055f0d9a7974122b287ce","author":{"_id":"623769abeddd7763adca040c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/623769abeddd7763adca040c/hkbFhpAPeWMPm_VxeQNhp.jpeg","fullname":"Tornike Onoprishvili","name":"TornikeO","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2024-08-17T07:49:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"With a title like these getting more popular, I sometimes wish we had a downvote button on huggingface :) ","html":"

With a title like these getting more popular, I sometimes wish we had a downvote button on huggingface :)

\n","updatedAt":"2024-08-17T07:49:04.546Z","author":{"_id":"623769abeddd7763adca040c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/623769abeddd7763adca040c/hkbFhpAPeWMPm_VxeQNhp.jpeg","fullname":"Tornike Onoprishvili","name":"TornikeO","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9594899415969849},"editors":["TornikeO"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/623769abeddd7763adca040c/hkbFhpAPeWMPm_VxeQNhp.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2408.08274","authors":[{"_id":"66bed259a9425c872d3eaedb","name":"Qizhen Zhang","hidden":false},{"_id":"66bed259a9425c872d3eaedc","name":"Nikolas Gritsch","hidden":false},{"_id":"66bed259a9425c872d3eaedd","name":"Dwaraknath Gnaneshwar","hidden":false},{"_id":"66bed259a9425c872d3eaede","user":{"_id":"65ef61af8179815f59d189ea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ef61af8179815f59d189ea/fknSvriVyP4KMI7Z53LNH.png","isPro":false,"fullname":"Simon Guo","user":"simonguozirui","type":"user"},"name":"Simon Guo","status":"claimed_verified","statusLastChangedAt":"2025-02-25T09:42:25.438Z","hidden":false},{"_id":"66bed259a9425c872d3eaedf","name":"David Cairuz","hidden":false},{"_id":"66bed259a9425c872d3eaee0","name":"Bharat Venkitesh","hidden":false},{"_id":"66bed259a9425c872d3eaee1","name":"Jakob Foerster","hidden":false},{"_id":"66bed259a9425c872d3eaee2","name":"Phil Blunsom","hidden":false},{"_id":"66bed259a9425c872d3eaee3","name":"Sebastian Ruder","hidden":false},{"_id":"66bed259a9425c872d3eaee4","name":"Ahmet Ustun","hidden":false},{"_id":"66bed259a9425c872d3eaee5","name":"Acyr Locatelli","hidden":false}],"publishedAt":"2024-08-15T17:19:12.000Z","submittedOnDailyAt":"2024-08-16T02:45:38.422Z","title":"BAM! Just Like That: Simple and Efficient Parameter Upcycling for\n Mixture of Experts","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"The Mixture of Experts (MoE) framework has become a popular architecture for\nlarge language models due to its superior performance over dense models.\nHowever, training MoEs from scratch in a large-scale regime is prohibitively\nexpensive. Existing methods mitigate this by pre-training multiple dense expert\nmodels independently and using them to initialize an MoE. This is done by using\nexperts' feed-forward network (FFN) to initialize the MoE's experts while\nmerging other parameters. However, this method limits the reuse of dense model\nparameters to only the FFN layers, thereby constraining the advantages when\n\"upcycling\" these models into MoEs. We propose BAM (Branch-Attend-Mix), a\nsimple yet effective method that addresses this shortcoming. BAM makes full use\nof specialized dense models by not only using their FFN to initialize the MoE\nlayers but also leveraging experts' attention parameters fully by initializing\nthem into a soft-variant of Mixture of Attention (MoA) layers. We explore two\nmethods for upcycling attention parameters: 1) initializing separate attention\nexperts from dense models including all attention parameters for the best model\nperformance; and 2) sharing key and value parameters across all experts to\nfacilitate for better inference efficiency. To further improve efficiency, we\nadopt a parallel attention transformer architecture to MoEs, which allows the\nattention experts and FFN experts to be computed concurrently. Our experiments\non seed models ranging from 590 million to 2 billion parameters demonstrate\nthat BAM surpasses baselines in both perplexity and downstream task\nperformance, within the same computational and data constraints.","upvotes":13,"discussionId":"66bed25ba9425c872d3eaf46","ai_summary":"BAM enhances the Mixture of Experts framework by fully recycling dense model parameters, including attention layers, leading to better performance and efficiency than previous methods.","ai_keywords":["Mixture of Experts","dense models","feed-forward network","initialization","attention parameters","Mixture of Attention","parallel attention transformer","perplexity","downstream tasks"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6409f386f3dabf93824bdcd2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6409f386f3dabf93824bdcd2/4OsM9ur7C65QRDYPUuD4I.jpeg","isPro":false,"fullname":"Ougrid Dumdang","user":"Ougrid-D","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"60078446e55258e41786a959","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60078446e55258e41786a959/UGPCE4YqG9BVMSf0YauxL.png","isPro":false,"fullname":"Motoki Wu","user":"tokestermw","type":"user"},{"_id":"65eff89cc109e95938ce3383","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65eff89cc109e95938ce3383/vYajrpB8SO2uwiZK7eCCd.png","isPro":false,"fullname":"Enneng Yang","user":"EnnengYang","type":"user"},{"_id":"64587be872b60ae7a3817858","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64587be872b60ae7a3817858/BbdOOxOCEzWTvEpkWp8MM.png","isPro":false,"fullname":"Minbyul Jeong","user":"Minbyul","type":"user"},{"_id":"636f533c1ca0ea5107ed171d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636f533c1ca0ea5107ed171d/jLwsrcPtUiHj8WhcE0Y67.jpeg","isPro":false,"fullname":"Bhimraj Yadav","user":"bhimrazy","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"607a74c965b9d0165cb1839c","avatarUrl":"/avatars/c412fd121589c8119fd8ed620231b28c.svg","isPro":false,"fullname":"Abhay Gupta","user":"abhaygupta","type":"user"},{"_id":"6329ab0bde18e8b2d96157ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6329ab0bde18e8b2d96157ff/alQVvAaeb0zh54b4XGVIK.png","isPro":false,"fullname":"Evan","user":"evdcush","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">

Papers

arxiv:2408.08274

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Published on Aug 15, 2024

· Submitted by

AK on Aug 16, 2024

Upvote

Authors:

Simon Guo ,

Abstract

BAM enhances the Mixture of Experts framework by fully recycling dense model parameters, including attention layers, leading to better performance and efficiency than previous methods.

AI-generated summary

The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts' attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.