Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
[go: Go Back, main page]

https://huggingface.co/zh-ai-community 👀

\n","updatedAt":"2024-06-12T10:10:53.751Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false}},"numEdits":0,"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66696a0e14d1fa86f7b6ef3b"}},{"id":"666975a81526f2e11c581005","author":{"_id":"61344ab1d19e49a35751462e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg","fullname":"Puffy Bird","name":"puffy310","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false},"createdAt":"2024-06-12T10:17:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Sure, sent in a request. Going to get my phone so I can join the WeChat group.","html":"

Sure, sent in a request. Going to get my phone so I can join the WeChat group.

\n","updatedAt":"2024-06-12T10:17:12.732Z","author":{"_id":"61344ab1d19e49a35751462e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg","fullname":"Puffy Bird","name":"puffy310","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"editors":["puffy310"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg"],"reactions":[{"reaction":"🔥","users":["AdinaY","puffy310"],"count":2}],"isReport":false,"parentCommentId":"66696a0e14d1fa86f7b6ef3b"}},{"id":"666976729d9c9930ebb188c3","author":{"_id":"61344ab1d19e49a35751462e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg","fullname":"Puffy Bird","name":"puffy310","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false},"createdAt":"2024-06-12T10:20:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"WeChat QR link seems to link to HF Blog post.","html":"

WeChat QR link seems to link to HF Blog post.

\n","updatedAt":"2024-06-12T10:20:34.243Z","author":{"_id":"61344ab1d19e49a35751462e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg","fullname":"Puffy Bird","name":"puffy310","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"editors":["puffy310"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66696a0e14d1fa86f7b6ef3b"}},{"id":"666977284ca10632338bf6c9","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false},"createdAt":"2024-06-12T10:23:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Added ✅ Feel free to contribute and share your advice : ) \nFor Wechat, it's our public account, we usually share blogs and some news on it. ","html":"

Added ✅ Feel free to contribute and share your advice : )
For Wechat, it's our public account, we usually share blogs and some news on it.

\n","updatedAt":"2024-06-12T10:23:36.050Z","author":{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","fullname":"Adina Yakefu","name":"AdinaY","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1145,"isUserFollowing":false}},"numEdits":0,"editors":["AdinaY"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66696a0e14d1fa86f7b6ef3b"}},{"id":"666977e3f3d2a0e58977b59b","author":{"_id":"61344ab1d19e49a35751462e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg","fullname":"Puffy Bird","name":"puffy310","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false},"createdAt":"2024-06-12T10:26:43.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I misread as a group chat link haha, it would be a decent idea to have a group though.","html":"

I misread as a group chat link haha, it would be a decent idea to have a group though.

\n","updatedAt":"2024-06-12T10:26:43.882Z","author":{"_id":"61344ab1d19e49a35751462e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg","fullname":"Puffy Bird","name":"puffy310","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"editors":["puffy310"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg"],"reactions":[{"reaction":"👍","users":["AdinaY"],"count":1}],"isReport":false,"parentCommentId":"66696a0e14d1fa86f7b6ef3b"}},{"id":"6669a0525d8457804a6d54de","author":{"_id":"61344ab1d19e49a35751462e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg","fullname":"Puffy Bird","name":"puffy310","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false},"createdAt":"2024-06-12T13:19:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I wonder if it’d be a good idea to mirror some model scope versions of current Chinese LLMs, some of them have slight differences (especially the instruction models)","html":"

I wonder if it’d be a good idea to mirror some model scope versions of current Chinese LLMs, some of them have slight differences (especially the instruction models)

\n","updatedAt":"2024-06-12T13:19:14.493Z","author":{"_id":"61344ab1d19e49a35751462e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg","fullname":"Puffy Bird","name":"puffy310","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"editors":["puffy310"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1630816930903-noauth.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"66696a0e14d1fa86f7b6ef3b"}}]},{"id":"666a4652a638e57bb77351c6","author":{"_id":"64a2dfbfd1439128d2abf421","avatarUrl":"/avatars/4d45a97c78b438d944b61372b0317483.svg","fullname":"Chen","name":"Ethan1234567","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-06-13T01:07:30.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"great job. ","html":"

great job.

\n","updatedAt":"2024-06-13T01:07:30.985Z","author":{"_id":"64a2dfbfd1439128d2abf421","avatarUrl":"/avatars/4d45a97c78b438d944b61372b0317483.svg","fullname":"Chen","name":"Ethan1234567","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":["Ethan1234567"],"editorAvatarUrls":["/avatars/4d45a97c78b438d944b61372b0317483.svg"],"reactions":[],"isReport":false}},{"id":"666a481cf0a7955b82c53b4d","author":{"_id":"64a2dfbfd1439128d2abf421","avatarUrl":"/avatars/4d45a97c78b438d944b61372b0317483.svg","fullname":"Chen","name":"Ethan1234567","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2024-06-13T01:15:08.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"In section 3, \"Upcycling vs. From Scratch\", train the MoE model from scratch is better then upcycling.\n\nBut, in section 5, \"Skywork-MoE\", it initialized from our in-house pre-trained Skywork-13B.\n\nCan you tell me the detailed information?","html":"

In section 3, \"Upcycling vs. From Scratch\", train the MoE model from scratch is better then upcycling.

\n

But, in section 5, \"Skywork-MoE\", it initialized from our in-house pre-trained Skywork-13B.

\n

Can you tell me the detailed information?

\n","updatedAt":"2024-06-13T01:15:08.416Z","author":{"_id":"64a2dfbfd1439128d2abf421","avatarUrl":"/avatars/4d45a97c78b438d944b61372b0317483.svg","fullname":"Chen","name":"Ethan1234567","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":["Ethan1234567"],"editorAvatarUrls":["/avatars/4d45a97c78b438d944b61372b0317483.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2406.06563","authors":[{"_id":"666919d303e3c114f06b8f15","user":{"_id":"63d8773a635499f21644601b","avatarUrl":"/avatars/07efea204d172e2844dbacc418131da2.svg","isPro":false,"fullname":"Wei Tianwen","user":"weitianwen","type":"user"},"name":"Tianwen Wei","status":"admin_assigned","statusLastChangedAt":"2024-06-12T09:14:32.771Z","hidden":false},{"_id":"666919d303e3c114f06b8f16","name":"Bo Zhu","hidden":false},{"_id":"666919d303e3c114f06b8f17","user":{"_id":"64996ce2ed09df36eaf236ea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64996ce2ed09df36eaf236ea/wIr1j2zkAezXfpsczTjq_.jpeg","isPro":false,"fullname":"zhaoliang","user":"zhao1iang","type":"user"},"name":"Liang Zhao","status":"claimed_verified","statusLastChangedAt":"2024-06-12T09:21:14.969Z","hidden":false},{"_id":"666919d303e3c114f06b8f18","user":{"_id":"64535b71bcbd25618f7655da","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64535b71bcbd25618f7655da/vDc8nrYQd4rpYgXLWOuIx.jpeg","isPro":false,"fullname":"cheng cheng","user":"chengtbf","type":"user"},"name":"Cheng Cheng","status":"claimed_verified","statusLastChangedAt":"2024-06-12T09:21:16.941Z","hidden":false},{"_id":"666919d303e3c114f06b8f19","name":"Biye Li","hidden":false},{"_id":"666919d303e3c114f06b8f1a","name":"Weiwei Lü","hidden":false},{"_id":"666919d303e3c114f06b8f1b","user":{"_id":"636b8f4429668f04eaf5d015","avatarUrl":"/avatars/84b6209914cb83af7cfce2ffb337d7bd.svg","isPro":false,"fullname":"CPFLAME","user":"CPFLAME","type":"user"},"name":"Peng Cheng","status":"claimed_verified","statusLastChangedAt":"2024-06-12T09:24:00.021Z","hidden":false},{"_id":"666919d303e3c114f06b8f1c","user":{"_id":"64235adf23b98d5ee6e9ca85","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64235adf23b98d5ee6e9ca85/m_7uwLaAkTXyt3qB2pPLA.png","isPro":false,"fullname":"Jianhao Zhang","user":"Heerozh","type":"user"},"name":"Jianhao Zhang","status":"admin_assigned","statusLastChangedAt":"2024-06-12T09:24:18.827Z","hidden":false},{"_id":"666919d303e3c114f06b8f1d","user":{"_id":"63647823a1503c61efd4fbcb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63647823a1503c61efd4fbcb/Df5U59Imobmr9S8wDf77I.png","isPro":false,"fullname":"Yi Wang","user":"BBuf","type":"user"},"name":"Xiaoyu Zhang","status":"claimed_verified","statusLastChangedAt":"2024-06-12T11:50:52.655Z","hidden":false},{"_id":"666919d303e3c114f06b8f1e","user":{"_id":"6621efe1a6eec3ad03e38759","avatarUrl":"/avatars/c35acce69f244ec0833dffd53eedf6a3.svg","isPro":false,"fullname":"Liang Zeng","user":"zengliangcs","type":"user"},"name":"Liang Zeng","status":"claimed_verified","statusLastChangedAt":"2024-06-12T09:31:25.323Z","hidden":false},{"_id":"666919d303e3c114f06b8f1f","user":{"_id":"62be9b5aae56e75e4d689e7c","avatarUrl":"/avatars/6772bc09d6eeb4e86b1210481be91720.svg","isPro":false,"fullname":"wangxiaokun","user":"shawn0wang","type":"user"},"name":"Xiaokun Wang","status":"claimed_verified","statusLastChangedAt":"2025-01-09T10:12:40.976Z","hidden":false},{"_id":"666919d303e3c114f06b8f20","name":"Yutuan Ma","hidden":false},{"_id":"666919d303e3c114f06b8f21","name":"Rui Hu","hidden":false},{"_id":"666919d303e3c114f06b8f22","name":"Shuicheng Yan","hidden":false},{"_id":"666919d303e3c114f06b8f23","name":"Han Fang","hidden":false},{"_id":"666919d303e3c114f06b8f24","name":"Yahui Zhou","hidden":false}],"publishedAt":"2024-06-03T03:58:41.000Z","submittedOnDailyAt":"2024-06-12T02:15:25.476Z","title":"Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts\n Language Models","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user"},"summary":"In this technical report, we introduce the training methodologies implemented\nin the development of Skywork-MoE, a high-performance mixture-of-experts (MoE)\nlarge language model (LLM) with 146 billion parameters and 16 experts. It is\ninitialized from the pre-existing dense checkpoints of our Skywork-13B model.\nWe explore the comparative effectiveness of upcycling versus training from\nscratch initializations. Our findings suggest that the choice between these two\napproaches should consider both the performance of the existing dense\ncheckpoints and the MoE training budget. We highlight two innovative\ntechniques: gating logit normalization, which improves expert diversification,\nand adaptive auxiliary loss coefficients, allowing for layer-specific\nadjustment of auxiliary loss coefficients. Our experimental results validate\nthe effectiveness of these methods. Leveraging these techniques and insights,\nwe trained our upcycled Skywork-MoE on a condensed subset of our SkyPile\ncorpus. The evaluation results demonstrate that our model delivers strong\nperformance across a wide range of benchmarks.","upvotes":20,"discussionId":"666919d503e3c114f06b8f8c","ai_summary":"Skywork-MoE, a high-performance mixture-of-experts language model, uses upcycling and innovative techniques like gating logit normalization and adaptive auxiliary loss coefficients to achieve strong performance across benchmarks.","ai_keywords":["mixture-of-experts","large language model","MoE","gating logit normalization","adaptive auxiliary loss coefficients"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63a369d98c0c89dcae3b8329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a369d98c0c89dcae3b8329/AiH2zjy1cnt9OADAAZMLD.jpeg","isPro":false,"fullname":"Adina Yakefu","user":"AdinaY","type":"user"},{"_id":"64996ce2ed09df36eaf236ea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64996ce2ed09df36eaf236ea/wIr1j2zkAezXfpsczTjq_.jpeg","isPro":false,"fullname":"zhaoliang","user":"zhao1iang","type":"user"},{"_id":"64535b71bcbd25618f7655da","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64535b71bcbd25618f7655da/vDc8nrYQd4rpYgXLWOuIx.jpeg","isPro":false,"fullname":"cheng cheng","user":"chengtbf","type":"user"},{"_id":"652260e0728c0b6dc72dd957","avatarUrl":"/avatars/71950c532d8da892eb0af4e9f4f00151.svg","isPro":false,"fullname":"Skywork","user":"SkyworkAI","type":"user"},{"_id":"63d8773a635499f21644601b","avatarUrl":"/avatars/07efea204d172e2844dbacc418131da2.svg","isPro":false,"fullname":"Wei Tianwen","user":"weitianwen","type":"user"},{"_id":"646b342a36505117e229affe","avatarUrl":"/avatars/a76fff929dc54cedff9ff8642abd78c4.svg","isPro":false,"fullname":"Jędrzej","user":"YenJJ","type":"user"},{"_id":"64747f7e33192631bacd8831","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64747f7e33192631bacd8831/dstkZJ4sHJSeqLesV5cOC.jpeg","isPro":false,"fullname":"Taufiq Dwi Purnomo","user":"taufiqdp","type":"user"},{"_id":"6234a8105e7398c64ac62199","avatarUrl":"/avatars/1ae2fc3910a64bd91d20aadb267c0bc3.svg","isPro":false,"fullname":"Maozhou Ge","user":"Gmc2","type":"user"},{"_id":"6621efe1a6eec3ad03e38759","avatarUrl":"/avatars/c35acce69f244ec0833dffd53eedf6a3.svg","isPro":false,"fullname":"Liang Zeng","user":"zengliangcs","type":"user"},{"_id":"6032802e1f993496bc14d9e3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"_id":"6468e3dff43574d9556d9927","avatarUrl":"/avatars/be3670d7359049d1738a2b067e67883d.svg","isPro":false,"fullname":"David","user":"flopsy1","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2406.06563

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Published on Jun 3, 2024
· Submitted by
AK
on Jun 12, 2024
Authors:
,
,
,
,
,
,
,

Abstract

Skywork-MoE, a high-performance mixture-of-experts language model, uses upcycling and innovative techniques like gating logit normalization and adaptive auxiliary loss coefficients to achieve strong performance across benchmarks.

AI-generated summary

In this technical report, we introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. It is initialized from the pre-existing dense checkpoints of our Skywork-13B model. We explore the comparative effectiveness of upcycling versus training from scratch initializations. Our findings suggest that the choice between these two approaches should consider both the performance of the existing dense checkpoints and the MoE training budget. We highlight two innovative techniques: gating logit normalization, which improves expert diversification, and adaptive auxiliary loss coefficients, allowing for layer-specific adjustment of auxiliary loss coefficients. Our experimental results validate the effectiveness of these methods. Leveraging these techniques and insights, we trained our upcycled Skywork-MoE on a condensed subset of our SkyPile corpus. The evaluation results demonstrate that our model delivers strong performance across a wide range of benchmarks.

Community

Congrats on the release and paper! Really cool to see a new MoE from Chinese community!🔥

·

What I would do to be as on top of the Chinese community research as you lol

great job.

In section 3, "Upcycling vs. From Scratch", train the MoE model from scratch is better then upcycling.

But, in section 5, "Skywork-MoE", it initialized from our in-house pre-trained Skywork-13B.

Can you tell me the detailed information?

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.06563 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.06563 in a Space README.md to link it from this page.

Collections including this paper 8