Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456 Paper page - MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration
\n","updatedAt":"2026-02-09T15:11:07.390Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6721627712249756},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"698a8d0c9d03ad21c16fa8e1","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-02-10T01:42:36.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [SimpleGPT: Improving GPT via A Simple Normalization Strategy](https://huggingface.co/papers/2602.01212) (2026)\n* [Controlled LLM Training on Spectral Sphere](https://huggingface.co/papers/2601.08393) (2026)\n* [Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models](https://huggingface.co/papers/2601.09719) (2025)\n* [State Rank Dynamics in Linear Attention LLMs](https://huggingface.co/papers/2602.02195) (2026)\n* [SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning](https://huggingface.co/papers/2602.02472) (2026)\n* [TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training](https://huggingface.co/papers/2601.23261) (2026)\n* [Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models](https://huggingface.co/papers/2512.15973) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2026-02-10T01:42:36.078Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.735058069229126},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.01734","authors":[{"_id":"6987609dbeecc443208d2375","name":"Lianhai Ren","hidden":false},{"_id":"6987609dbeecc443208d2376","name":"Yucheng Ding","hidden":false},{"_id":"6987609dbeecc443208d2377","user":{"_id":"63fb6e281b4b1bd4e7ffc5be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63fb6e281b4b1bd4e7ffc5be/aiRu_bulgnxvEMrjipGoQ.jpeg","isPro":false,"fullname":"Xiao Liu","user":"lx865712528","type":"user"},"name":"Xiao Liu","status":"claimed_verified","statusLastChangedAt":"2026-02-09T08:31:02.000Z","hidden":false},{"_id":"6987609dbeecc443208d2378","name":"Qianxiao Li","hidden":false},{"_id":"6987609dbeecc443208d2379","name":"Peng Cheng","hidden":false},{"_id":"6987609dbeecc443208d237a","name":"Yeyun Gong","hidden":false}],"publishedAt":"2026-02-02T07:18:45.000Z","submittedOnDailyAt":"2026-02-09T02:17:53.967Z","title":"MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration","submittedOnDailyBy":{"_id":"63fb6e281b4b1bd4e7ffc5be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63fb6e281b4b1bd4e7ffc5be/aiRu_bulgnxvEMrjipGoQ.jpeg","isPro":false,"fullname":"Xiao Liu","user":"lx865712528","type":"user"},"summary":"Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.","upvotes":32,"discussionId":"6987609dbeecc443208d237b","ai_summary":"Training instability in large language models is linked to weight matrix stable rank decline and Jacobian alignment, which MSign addresses through matrix sign operations to prevent gradient explosions.","ai_keywords":["large language model","pretraining","gradient explosions","weight matrix stable rank","Frobenius norm","spectral norm","Jacobian","matrix sign operations","optimizer"],"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63fb6e281b4b1bd4e7ffc5be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63fb6e281b4b1bd4e7ffc5be/aiRu_bulgnxvEMrjipGoQ.jpeg","isPro":false,"fullname":"Xiao Liu","user":"lx865712528","type":"user"},{"_id":"6560763e152b659e623865ae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6560763e152b659e623865ae/cTT2jGnPU_8XMrUTvqZ2h.jpeg","isPro":false,"fullname":"Xiao Liang","user":"MasterVito","type":"user"},{"_id":"65eaf755ab0a6a90da55ab58","avatarUrl":"/avatars/a46890a9d067a913513edf3759f12c85.svg","isPro":false,"fullname":"Cunxiang Wang","user":"wangcunxiang","type":"user"},{"_id":"67e95c8d2b124840d0cb8d7f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UCt1iaTveaIXA-NBEqX3A.png","isPro":false,"fullname":"shawnxzhu","user":"shawnxzhu","type":"user"},{"_id":"63b6af3accebeadccc868efd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b6af3accebeadccc868efd/cFTHKggMpsoaPe_46gcy9.webp","isPro":false,"fullname":"Zhijiang","user":"Zeee","type":"user"},{"_id":"642b9861bb77f8456634b048","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642b9861bb77f8456634b048/VrNmmcdgX7FufQmdP5YaG.jpeg","isPro":false,"fullname":"Zichen Ding","user":"heroding77","type":"user"},{"_id":"61669c456916c52acd5a1aa3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61669c456916c52acd5a1aa3/HnZTwRaXgTeTG3ljO3ITb.jpeg","isPro":false,"fullname":"jianbo dai","user":"jbd","type":"user"},{"_id":"6064a0eeb1703ddba0d458b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1617207525789-noauth.png","isPro":false,"fullname":"Qiushi","user":"QiushiSun","type":"user"},{"_id":"64b7cd74ff6d81ae297feded","avatarUrl":"/avatars/880fbc96cc093f5e901ce84f32a1d21d.svg","isPro":false,"fullname":"ZHANG HAO","user":"26hzhang","type":"user"},{"_id":"63776f1806241efce1e7aae6","avatarUrl":"/avatars/d67d9dcd932934c630f407ac152f2ce6.svg","isPro":false,"fullname":"Zhenghao Lin","user":"Lin0","type":"user"},{"_id":"6379cc5b736e2989332641eb","avatarUrl":"/avatars/acd97953329a30a27855bb79d3576312.svg","isPro":false,"fullname":"Yu Xia","user":"cheesewafer","type":"user"},{"_id":"679864bcdcd96345105b0e35","avatarUrl":"/avatars/4139099afea29c222297b7228c1511bd.svg","isPro":false,"fullname":"REN Lianhai","user":"hajiao","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"5e6485f787403103f9f1055e","name":"microsoft","fullname":"Microsoft","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1583646260758-5e64858c87403103f9f1055d.png"}}">
Training instability in large language models is linked to weight matrix stable rank decline and Jacobian alignment, which MSign addresses through matrix sign operations to prevent gradient explosions.
AI-generated summary
Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.
Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.