Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-14T01:40:34.670Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7098859548568726},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.11761","authors":[{"_id":"698e93e8cace060ff123acbe","name":"MiniCPM Team","hidden":false},{"_id":"698e93e8cace060ff123acbf","name":"Wenhao An","hidden":false},{"_id":"698e93e8cace060ff123acc0","name":"Yingfa Chen","hidden":false},{"_id":"698e93e8cace060ff123acc1","name":"Yewei Fang","hidden":false},{"_id":"698e93e8cace060ff123acc2","name":"Jiayi Li","hidden":false},{"_id":"698e93e8cace060ff123acc3","name":"Xin Li","hidden":false},{"_id":"698e93e8cace060ff123acc4","name":"Yaohui Li","hidden":false},{"_id":"698e93e8cace060ff123acc5","name":"Yishan Li","hidden":false},{"_id":"698e93e8cace060ff123acc6","name":"Yuxuan Li","hidden":false},{"_id":"698e93e8cace060ff123acc7","name":"Biyuan Lin","hidden":false},{"_id":"698e93e8cace060ff123acc8","name":"Chuan Liu","hidden":false},{"_id":"698e93e8cace060ff123acc9","name":"Hezi Liu","hidden":false},{"_id":"698e93e8cace060ff123acca","name":"Siyuan Liu","hidden":false},{"_id":"698e93e8cace060ff123accb","name":"Hongya Lyu","hidden":false},{"_id":"698e93e8cace060ff123accc","name":"Yinxu Pan","hidden":false},{"_id":"698e93e8cace060ff123accd","name":"Shixin Ren","hidden":false},{"_id":"698e93e8cace060ff123acce","name":"Xingyu Shen","hidden":false},{"_id":"698e93e8cace060ff123accf","name":"Zhou Su","hidden":false},{"_id":"698e93e8cace060ff123acd0","name":"Haojun Sun","hidden":false},{"_id":"698e93e8cace060ff123acd1","name":"Yangang Sun","hidden":false},{"_id":"698e93e8cace060ff123acd2","name":"Zhen Leng Thai","hidden":false},{"_id":"698e93e8cace060ff123acd3","name":"Xin Tian","hidden":false},{"_id":"698e93e8cace060ff123acd4","name":"Rui Wang","hidden":false},{"_id":"698e93e8cace060ff123acd5","name":"Xiaorong Wang","hidden":false},{"_id":"698e93e8cace060ff123acd6","name":"Yudong Wang","hidden":false},{"_id":"698e93e8cace060ff123acd7","name":"Bo Wu","hidden":false},{"_id":"698e93e8cace060ff123acd8","name":"Xiaoyue Xu","hidden":false},{"_id":"698e93e8cace060ff123acd9","name":"Dong Xu","hidden":false},{"_id":"698e93e8cace060ff123acda","name":"Shuaikang Xue","hidden":false},{"_id":"698e93e8cace060ff123acdb","name":"Jiawei Yang","hidden":false},{"_id":"698e93e8cace060ff123acdc","name":"Bowen Zhang","hidden":false},{"_id":"698e93e8cace060ff123acdd","name":"Jinqian Zhang","hidden":false},{"_id":"698e93e8cace060ff123acde","name":"Letian Zhang","hidden":false},{"_id":"698e93e8cace060ff123acdf","name":"Shengnan Zhang","hidden":false},{"_id":"698e93e8cace060ff123ace0","name":"Xinyu Zhang","hidden":false},{"_id":"698e93e8cace060ff123ace1","name":"Xinyuan Zhang","hidden":false},{"_id":"698e93e8cace060ff123ace2","name":"Zhu Zhang","hidden":false},{"_id":"698e93e8cace060ff123ace3","name":"Hengyu Zhao","hidden":false},{"_id":"698e93e8cace060ff123ace4","name":"Jiacheng Zhao","hidden":false},{"_id":"698e93e8cace060ff123ace5","name":"Jie Zhou","hidden":false},{"_id":"698e93e8cace060ff123ace6","name":"Zihan Zhou","hidden":false},{"_id":"698e93e8cace060ff123ace7","name":"Shuo Wang","hidden":false},{"_id":"698e93e8cace060ff123ace8","name":"Chaojun Xiao","hidden":false},{"_id":"698e93e8cace060ff123ace9","name":"Xu Han","hidden":false},{"_id":"698e93e8cace060ff123acea","name":"Zhiyuan Liu","hidden":false},{"_id":"698e93e8cace060ff123aceb","name":"Maosong Sun","hidden":false}],"publishedAt":"2026-02-12T09:37:05.000Z","submittedOnDailyAt":"2026-02-13T00:30:57.943Z","title":"MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},"summary":"The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.","upvotes":6,"discussionId":"698e93e9cace060ff123acec","ai_summary":"MiniCPM-SALA combines sparse and linear attention mechanisms in a hybrid architecture to enable efficient processing of ultra-long contexts while maintaining model performance and reducing training costs.","ai_keywords":["large language models","Transformer architecture","sparse attention","linear attention","hybrid architecture","layer selection algorithm","hybrid positional encoding","continual training framework","inference speed","sequence length","token context"],"organization":{"_id":"633fe81429b5a95f6e16e34a","name":"openbmb","fullname":"OpenBMB","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1670387859384-633fe7784b362488336bbfad.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64f5abc2e8f27f20a067a596","avatarUrl":"/avatars/d0eac39488fac0c9c08d76109cabaa9f.svg","isPro":false,"fullname":"cwt","user":"yiye2023","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"69555842ca93f97f4129bf30","avatarUrl":"/avatars/5b0bb5a430d2487f91ded60d66f9c069.svg","isPro":false,"fullname":"Wael Antar","user":"Wael-Antar","type":"user"},{"_id":"66ea72fc1cbdd141c287ef22","avatarUrl":"/avatars/fe7f3281f7aade3045787fbb56f086c6.svg","isPro":false,"fullname":"BoyceYi","user":"DeadFishhh","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"633fe81429b5a95f6e16e34a","name":"openbmb","fullname":"OpenBMB","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1670387859384-633fe7784b362488336bbfad.png"}}">
MiniCPM-SALA combines sparse and linear attention mechanisms in a hybrid architecture to enable efficient processing of ultra-long contexts while maintaining model performance and reducing training costs.
AI-generated summary
The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.