Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out
Strategies
https://github.com/FlagAI-Open/Aquila-MoE\n","updatedAt":"2024-08-15T07:26:30.991Z","author":{"_id":"632c234f42c386ebd2710434","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632c234f42c386ebd2710434/HyWRWi063S69JTy_IMjoe.jpeg","fullname":"Guang Liu","name":"ZacLiu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.320858359336853},"editors":["ZacLiu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/632c234f42c386ebd2710434/HyWRWi063S69JTy_IMjoe.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2408.06567","authors":[{"_id":"66bd96cba545b0cefefef7bc","user":{"_id":"6335113375bed9932474315e","avatarUrl":"/avatars/aa1837a3a07514b89c48de888934a7b2.svg","isPro":false,"fullname":"bowenzhang","user":"bowen92","type":"user"},"name":"Bo-Wen Zhang","status":"claimed_verified","statusLastChangedAt":"2024-08-19T22:13:45.317Z","hidden":false},{"_id":"66bd96cba545b0cefefef7bd","user":{"_id":"63a11ce02fabbbb899a01d58","avatarUrl":"/avatars/ee3d4088b6d32b2c18b8be91913e90dd.svg","isPro":false,"fullname":"ldwang","user":"ldwang","type":"user"},"name":"Liangdong Wang","status":"claimed_verified","statusLastChangedAt":"2024-08-28T08:01:05.198Z","hidden":false},{"_id":"66bd96cba545b0cefefef7be","name":"Ye Yuan","hidden":false},{"_id":"66bd96cba545b0cefefef7bf","name":"Jijie Li","hidden":false},{"_id":"66bd96cba545b0cefefef7c0","user":{"_id":"642e72cec1b0f8e4e76af16d","avatarUrl":"/avatars/f900811d3c22a114c67283b646949f86.svg","isPro":false,"fullname":"shuhao gu","user":"gsh33","type":"user"},"name":"Shuhao Gu","status":"claimed_verified","statusLastChangedAt":"2024-10-30T09:56:03.296Z","hidden":false},{"_id":"66bd96cba545b0cefefef7c1","user":{"_id":"67e3a342e7a53c1d10de146b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/s7b-cR-BuGCFlFQu-A1kq.png","isPro":false,"fullname":"Mengdi Zhao","user":"MengdiiZhao","type":"user"},"name":"Mengdi Zhao","status":"claimed_verified","statusLastChangedAt":"2025-07-08T12:40:52.403Z","hidden":false},{"_id":"66bd96cba545b0cefefef7c2","name":"Xinya Wu","hidden":false},{"_id":"66bd96cba545b0cefefef7c3","user":{"_id":"632c234f42c386ebd2710434","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632c234f42c386ebd2710434/HyWRWi063S69JTy_IMjoe.jpeg","isPro":false,"fullname":"Guang Liu","user":"ZacLiu","type":"user"},"name":"Guang Liu","status":"claimed_verified","statusLastChangedAt":"2024-10-28T11:28:23.163Z","hidden":false},{"_id":"66bd96cba545b0cefefef7c4","name":"Chengwei Wu","hidden":false},{"_id":"66bd96cba545b0cefefef7c5","name":"Hanyu Zhao","hidden":false},{"_id":"66bd96cba545b0cefefef7c6","name":"Li Du","hidden":false},{"_id":"66bd96cba545b0cefefef7c7","name":"Yiming Ju","hidden":false},{"_id":"66bd96cba545b0cefefef7c8","name":"Quanyue Ma","hidden":false},{"_id":"66bd96cba545b0cefefef7c9","user":{"_id":"64f3eb4fd41e83d74a0173ec","avatarUrl":"/avatars/461f804bb5a7c5b31f862cfa27e52de1.svg","isPro":false,"fullname":"Yulong Ao","user":"aoyulong","type":"user"},"name":"Yulong Ao","status":"claimed_verified","statusLastChangedAt":"2025-07-08T12:40:50.976Z","hidden":false},{"_id":"66bd96cba545b0cefefef7ca","name":"Yingli Zhao","hidden":false},{"_id":"66bd96cba545b0cefefef7cb","name":"Songhe Zhu","hidden":false},{"_id":"66bd96cba545b0cefefef7cc","user":{"_id":"66aadfc1344279e0243d4569","avatarUrl":"/avatars/d2afbceb2e6279e42ed9ad98dffa7f0a.svg","isPro":false,"fullname":"Caozhou","user":"Caozhou1995","type":"user"},"name":"Zhou Cao","status":"claimed_verified","statusLastChangedAt":"2025-07-08T09:05:58.564Z","hidden":false},{"_id":"66bd96cba545b0cefefef7cd","name":"Dong Liang","hidden":false},{"_id":"66bd96cba545b0cefefef7ce","name":"Yonghua Lin","hidden":false},{"_id":"66bd96cba545b0cefefef7cf","name":"Ming Zhang","hidden":false},{"_id":"66bd96cba545b0cefefef7d0","name":"Shunfei Wang","hidden":false},{"_id":"66bd96cba545b0cefefef7d1","name":"Yanxin Zhou","hidden":false},{"_id":"66bd96cba545b0cefefef7d2","name":"Min Ye","hidden":false},{"_id":"66bd96cba545b0cefefef7d3","name":"Xuekai Chen","hidden":false},{"_id":"66bd96cba545b0cefefef7d4","name":"Xinyang Yu","hidden":false},{"_id":"66bd96cba545b0cefefef7d5","name":"Xiangjun Huang","hidden":false},{"_id":"66bd96cba545b0cefefef7d6","name":"Jian Yang","hidden":false}],"publishedAt":"2024-08-13T02:07:00.000Z","title":"AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out\n Strategies","summary":"In recent years, with the rapid application of large language models across\nvarious fields, the scale of these models has gradually increased, and the\nresources required for their pre-training have grown exponentially. Training an\nLLM from scratch will cost a lot of computation resources while scaling up from\na smaller model is a more efficient approach and has thus attracted significant\nattention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B\nMixture of Experts (MoE) language model that has 8 experts with 16 billion\nparameters each and is developed using an innovative training methodology\ncalled EfficientScale. This approach optimizes performance while minimizing\ndata requirements through a two-stage process. The first stage, termed\nScale-Up, initializes the larger model with weights from a pre-trained smaller\nmodel, enabling substantial knowledge transfer and continuous pretraining with\nsignificantly less data. The second stage, Scale-Out, uses a pre-trained dense\nmodel to initialize the MoE experts, further enhancing knowledge transfer and\nperformance. Extensive validation experiments on 1.8B and 7B models compared\nvarious initialization schemes, achieving models that maintain and reduce loss\nduring continuous pretraining. Utilizing the optimal scheme, we successfully\ntrained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating\nsignificant improvements in performance and training efficiency.","upvotes":2,"discussionId":"66bd96cda545b0cefefef87b","githubRepo":"https://github.com/FlagAI-Open/Aquila-MoE","githubRepoAddedBy":"auto","ai_summary":"AquilaMoE, a large-scale bilingual Mixture of Experts language model, uses EfficientScale methodology to efficiently scale up from smaller models while maintaining performance.","ai_keywords":["Mixture of Experts (MoE)","EfficientScale","Scale-Up","Scale-Out","knowledge transfer","continuous pretraining","performance","training efficiency"],"githubStars":10},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"632c234f42c386ebd2710434","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/632c234f42c386ebd2710434/HyWRWi063S69JTy_IMjoe.jpeg","isPro":false,"fullname":"Guang Liu","user":"ZacLiu","type":"user"},{"_id":"63a11ce02fabbbb899a01d58","avatarUrl":"/avatars/ee3d4088b6d32b2c18b8be91913e90dd.svg","isPro":false,"fullname":"ldwang","user":"ldwang","type":"user"}],"acceptLanguages":["*"]}">
AquilaMoE, a large-scale bilingual Mixture of Experts language model, uses EfficientScale methodology to efficiently scale up from smaller models while maintaining performance.
AI-generated summary
In recent years, with the rapid application of large language models across
various fields, the scale of these models has gradually increased, and the
resources required for their pre-training have grown exponentially. Training an
LLM from scratch will cost a lot of computation resources while scaling up from
a smaller model is a more efficient approach and has thus attracted significant
attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B
Mixture of Experts (MoE) language model that has 8 experts with 16 billion
parameters each and is developed using an innovative training methodology
called EfficientScale. This approach optimizes performance while minimizing
data requirements through a two-stage process. The first stage, termed
Scale-Up, initializes the larger model with weights from a pre-trained smaller
model, enabling substantial knowledge transfer and continuous pretraining with
significantly less data. The second stage, Scale-Out, uses a pre-trained dense
model to initialize the MoE experts, further enhancing knowledge transfer and
performance. Extensive validation experiments on 1.8B and 7B models compared
various initialization schemes, achieving models that maintain and reduce loss
during continuous pretraining. Utilizing the optimal scheme, we successfully
trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating
significant improvements in performance and training efficiency.