Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - Scaling Embeddings Outperforms Scaling Experts in Language Models
\n","updatedAt":"2026-01-30T22:22:14.812Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6680307388305664},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"697d5d01b0a90b2bc7d776c2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-31T01:38:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models](https://huggingface.co/papers/2601.07372) (2026)\n* [Sigma-MoE-Tiny Technical Report](https://huggingface.co/papers/2512.16248) (2025)\n* [VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse](https://huggingface.co/papers/2512.14531) (2025)\n* [STEM: Scaling Transformers with Embedding Modules](https://huggingface.co/papers/2601.10639) (2026)\n* [LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts](https://huggingface.co/papers/2601.18089) (2026)\n* [MIDUS: Memory-Infused Depth Up-Scaling](https://huggingface.co/papers/2512.13751) (2025)\n* [A.X K1 Technical Report](https://huggingface.co/papers/2601.09200) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-01-31T01:38:09.564Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.700796365737915},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.21204","authors":[{"_id":"697c3801a67238fac88cc1b1","name":"Hong Liu","hidden":false},{"_id":"697c3801a67238fac88cc1b2","name":"Jiaqi Zhang","hidden":false},{"_id":"697c3801a67238fac88cc1b3","name":"Chao Wang","hidden":false},{"_id":"697c3801a67238fac88cc1b4","name":"Xing Hu","hidden":false},{"_id":"697c3801a67238fac88cc1b5","user":{"_id":"64cc5eb3942890af937473c4","avatarUrl":"/avatars/38b71030d6fc5592565a0d286a0c2ec4.svg","isPro":false,"fullname":"Linkun Lyu","user":"llk2why","type":"user"},"name":"Linkun Lyu","status":"claimed_verified","statusLastChangedAt":"2026-02-02T16:59:30.907Z","hidden":false},{"_id":"697c3801a67238fac88cc1b6","name":"Jiaqi Sun","hidden":false},{"_id":"697c3801a67238fac88cc1b7","name":"Xurui Yang","hidden":false},{"_id":"697c3801a67238fac88cc1b8","name":"Bo Wang","hidden":false},{"_id":"697c3801a67238fac88cc1b9","name":"Fengcun Li","hidden":false},{"_id":"697c3801a67238fac88cc1ba","name":"Yulei Qian","hidden":false},{"_id":"697c3801a67238fac88cc1bb","name":"Lingtong Si","hidden":false},{"_id":"697c3801a67238fac88cc1bc","name":"Yerui Sun","hidden":false},{"_id":"697c3801a67238fac88cc1bd","name":"Rumei Li","hidden":false},{"_id":"697c3801a67238fac88cc1be","name":"Peng Pei","hidden":false},{"_id":"697c3801a67238fac88cc1bf","name":"Yuchen Xie","hidden":false},{"_id":"697c3801a67238fac88cc1c0","name":"Xunliang Cai","hidden":false}],"publishedAt":"2026-01-29T03:11:19.000Z","submittedOnDailyAt":"2026-01-30T02:18:11.112Z","title":"Scaling Embeddings Outperforms Scaling Experts in Language Models","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.","upvotes":99,"discussionId":"697c3801a67238fac88cc1c1","ai_summary":"Embedding scaling offers superior sparsity scaling compared to expert scaling in large language models, enabling efficient inference through system optimizations and speculative decoding.","ai_keywords":["Mixture-of-Experts","sparsity scaling","embedding scaling","Pareto frontier","parameter budgeting","model width","model depth","system optimizations","speculative decoding","LongCat-Flash-Lite"],"organization":{"_id":"68b28d79a176a9beb30d2049","name":"meituan-longcat","fullname":"LongCat","avatar":"https://cdn-uploads.huggingface.co/production/uploads/68a2a29ab9d4c5698e02c747/CDCAx7X7rXDt7xjI-DoxG.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"697c61037e2142d01a928324","avatarUrl":"/avatars/540e8696cb50398e7f4f58624008181b.svg","isPro":false,"fullname":"Li K8s","user":"kubernetes66","type":"user"},{"_id":"6443ae786cea0db46dca8fc3","avatarUrl":"/avatars/8bebd7b3aa3e081cf7bc1246bac57b12.svg","isPro":false,"fullname":"Jackie Zhang","user":"jiaqi61","type":"user"},{"_id":"651581635b041d575d08e549","avatarUrl":"/avatars/45a4d6f8257d1808c7452eefb6b20720.svg","isPro":false,"fullname":"linsen GUO","user":"sooonas","type":"user"},{"_id":"64f814cbdcd7b028c16e0db2","avatarUrl":"/avatars/85d83c54873e7e46e3d49b856500f25d.svg","isPro":false,"fullname":"chao wang","user":"chaowang1139","type":"user"},{"_id":"65dbef7c80bafdfb4b48e99b","avatarUrl":"/avatars/9470bb425b642d938f4ab5cedbf0f8d0.svg","isPro":false,"fullname":"wu","user":"visionw5","type":"user"},{"_id":"697c676514e87c4cc60fc296","avatarUrl":"/avatars/b23904984e2747728306cde954a8ef09.svg","isPro":false,"fullname":"Hong Liu","user":"Redther","type":"user"},{"_id":"68a2a29ab9d4c5698e02c747","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68a2a29ab9d4c5698e02c747/_Kbeh8VJ1pzao8GWAa6Xj.png","isPro":false,"fullname":"LongCat","user":"LongCat0830","type":"user"},{"_id":"64bef7bc1363b5c799de6d44","avatarUrl":"/avatars/a9947c6d7ca98d7385efa5ee7f2fb9a8.svg","isPro":false,"fullname":"hassenhamdi","user":"hassenhamdi","type":"user"},{"_id":"6978564e7754c316dbdec49f","avatarUrl":"/avatars/e111b1f8b7b5dc4aa538cf4e10c2c7d4.svg","isPro":false,"fullname":"Fengcun Li","user":"RobertLexis","type":"user"},{"_id":"65a11db3dce7f9ec81ffd48a","avatarUrl":"/avatars/049ab3c56dd5d43cf44589def07fa61e.svg","isPro":false,"fullname":"ray young","user":"Sagacity666","type":"user"},{"_id":"62e0a0f727c22962710ca8d0","avatarUrl":"/avatars/db2f61ff5a3393522e092deacbec89f2.svg","isPro":false,"fullname":"han","user":"zhix","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":2,"organization":{"_id":"68b28d79a176a9beb30d2049","name":"meituan-longcat","fullname":"LongCat","avatar":"https://cdn-uploads.huggingface.co/production/uploads/68a2a29ab9d4c5698e02c747/CDCAx7X7rXDt7xjI-DoxG.png"}}">
Embedding scaling offers superior sparsity scaling compared to expert scaling in large language models, enabling efficient inference through system optimizations and speculative decoding.
AI-generated summary
While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.
Embedding scaling can outperform mixture of experts for sparse language models, aided by system optimizations and speculative decoding, with LongCat-Flash-Lite achieving strong competitiveness.