Librarian Bot. I found the following papers similar to this paper. \n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2024-08-30T01:33:32.966Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7472866773605347},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2408.15664","authors":[{"_id":"66d0746981e0683bca3fb7b0","user":{"_id":"650c509472afb1e60e6151ae","avatarUrl":"/avatars/c16ab5053a586819dc2b965303215ff7.svg","isPro":false,"fullname":"Lean Wang","user":"AdaHousman","type":"user"},"name":"Lean Wang","status":"claimed_verified","statusLastChangedAt":"2024-09-02T07:19:01.270Z","hidden":false},{"_id":"66d0746981e0683bca3fb7b1","user":{"_id":"64e370be59aa5366642ac329","avatarUrl":"/avatars/0fa1eb6ac6c1aeff3e65bc86a6617f64.svg","isPro":false,"fullname":"Huazuo Gao","user":"gaohuazuo","type":"user"},"name":"Huazuo Gao","status":"admin_assigned","statusLastChangedAt":"2024-08-29T16:47:01.712Z","hidden":false},{"_id":"66d0746981e0683bca3fb7b2","user":{"_id":"66053b1f9e3555d648b21c3d","avatarUrl":"/avatars/c8b33e7f702c4edb17add47f0eafe5e6.svg","isPro":false,"fullname":"Chenggang Zhao","user":"LyricZ","type":"user"},"name":"Chenggang Zhao","status":"admin_assigned","statusLastChangedAt":"2024-08-29T16:47:07.557Z","hidden":false},{"_id":"66d0746981e0683bca3fb7b3","name":"Xu Sun","hidden":false},{"_id":"66d0746981e0683bca3fb7b4","user":{"_id":"659389f8de82e1ef7b9a8b13","avatarUrl":"/avatars/896ed9f4cdbd317493b303d070b7e12a.svg","isPro":false,"fullname":"Damai Dai","user":"DeepSeekDDM","type":"user"},"name":"Damai Dai","status":"extracted_confirmed","statusLastChangedAt":"2024-08-30T02:45:15.707Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6032802e1f993496bc14d9e3/FiPCuf_ioCNip6yCV9ae8.jpeg"],"publishedAt":"2024-08-28T09:31:09.000Z","submittedOnDailyAt":"2024-08-29T11:46:05.927Z","title":"Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts","submittedOnDailyBy":{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},"summary":"For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to\nrouting collapse or increased computational overhead. Existing methods commonly\nemploy an auxiliary loss to encourage load balance, but a large auxiliary loss\nwill introduce non-negligible interference gradients into training and thus\nimpair the model performance. In order to control load balance while not\nproducing undesired gradients during training, we propose Loss-Free Balancing,\nfeatured by an auxiliary-loss-free load balancing strategy. To be specific,\nbefore the top-K routing decision, Loss-Free Balancing will first apply an\nexpert-wise bias to the routing scores of each expert. By dynamically updating\nthe bias of each expert according to its recent load, Loss-Free Balancing can\nconsistently maintain a balanced distribution of expert load. In addition,\nsince Loss-Free Balancing does not produce any interference gradients, it also\nelevates the upper bound of model performance gained from MoE training. We\nvalidate the performance of Loss-Free Balancing on MoE models with up to 3B\nparameters trained on up to 200B tokens. Experimental results show that\nLoss-Free Balancing achieves both better performance and better load balance\ncompared with traditional auxiliary-loss-controlled load balancing strategies.","upvotes":15,"discussionId":"66d0746b81e0683bca3fb847","ai_summary":"Loss-Free Balancing improves performance and load balance in Mixture-of-Experts models by dynamically adjusting expert biases without introducing auxiliary loss gradients.","ai_keywords":["Mixture-of-Experts","expert load","routing collapse","computational overhead","auxiliary loss","interference gradients","top-K routing","expert-wise bias","recent load","model performance"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"},{"_id":"642fef28a043f0ac7defa8a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642fef28a043f0ac7defa8a9/RwOEkuj3fOnOA54tGR7Ea.png","isPro":false,"fullname":"Yaowei Zheng","user":"hiyouga","type":"user"},{"_id":"60f2fc91b92afccb7c34b8ed","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60f2fc91b92afccb7c34b8ed/W2-Nay12Ef4Ltyaf8EKE9.jpeg","isPro":true,"fullname":"Gabriel MartĂn Blázquez","user":"gabrielmbmb","type":"user"},{"_id":"659389f8de82e1ef7b9a8b13","avatarUrl":"/avatars/896ed9f4cdbd317493b303d070b7e12a.svg","isPro":false,"fullname":"Damai Dai","user":"DeepSeekDDM","type":"user"},{"_id":"65c4063740d617a14238f3df","avatarUrl":"/avatars/726b1470e46ad71c9ec233f3f0f396ec.svg","isPro":false,"fullname":"Zikun Li","user":"zikun-li","type":"user"},{"_id":"6377fcd4004350f6219bb357","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668807885947-noauth.png","isPro":false,"fullname":"Eryk Banatt","user":"ambisinister","type":"user"},{"_id":"6538119803519fddb4a17e10","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6538119803519fddb4a17e10/ffJMkdx-rM7VvLTCM6ri_.jpeg","isPro":false,"fullname":"samusenps","user":"samusenps","type":"user"},{"_id":"5f3fe13d79c1ba4c353d0c19","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f3fe13d79c1ba4c353d0c19/XswyGe3OtOdZ6g7rnrgfc.png","isPro":false,"fullname":"Aaditya Ura","user":"aaditya","type":"user"},{"_id":"641b754d1911d3be6745cce9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641b754d1911d3be6745cce9/Ydjcjd4VuNUGj5Cd4QHdB.png","isPro":false,"fullname":"atayloraerospace","user":"Taylor658","type":"user"},{"_id":"663ccbff3a74a20189d4aa2e","avatarUrl":"/avatars/83a54455e0157480f65c498cd9057cf2.svg","isPro":false,"fullname":"Nguyen Van Thanh","user":"NguyenVanThanhHust","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Published on Aug 28, 2024
Abstract
Loss-Free Balancing improves performance and load balance in Mixture-of-Experts models by dynamically adjusting expert biases without introducing auxiliary loss gradients.
For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to
routing collapse or increased computational overhead. Existing methods commonly
employ an auxiliary loss to encourage load balance, but a large auxiliary loss
will introduce non-negligible interference gradients into training and thus
impair the model performance. In order to control load balance while not
producing undesired gradients during training, we propose Loss-Free Balancing,
featured by an auxiliary-loss-free load balancing strategy. To be specific,
before the top-K routing decision, Loss-Free Balancing will first apply an
expert-wise bias to the routing scores of each expert. By dynamically updating
the bias of each expert according to its recent load, Loss-Free Balancing can
consistently maintain a balanced distribution of expert load. In addition,
since Loss-Free Balancing does not produce any interference gradients, it also
elevates the upper bound of model performance gained from MoE training. We
validate the performance of Loss-Free Balancing on MoE models with up to 3B
parameters trained on up to 200B tokens. Experimental results show that
Loss-Free Balancing achieves both better performance and better load balance
compared with traditional auxiliary-loss-controlled load balancing strategies.