https://huggingface.co/papers/2507.22448), this work extends the idea to learnable matrix-, row-, and column-wise scaling. We show that the weight-norm equilibrium induced by weight decay and gradient noise is suboptimal, and that freeing these scale constraints yields consistent gains, generalizes μP, and improves downstream performance with both Adam and Muon optimizers.

\n","updatedAt":"2026-01-09T05:25:50.946Z","author":{"_id":"6460c3811db65f878513bcaf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6460c3811db65f878513bcaf/CRdJ8lXixDku3k8Rm5Stn.jpeg","fullname":"Jingwei Zuo","name":"JingweiZuo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":39,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8441076874732971},"editors":["JingweiZuo"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6460c3811db65f878513bcaf/CRdJ8lXixDku3k8Rm5Stn.jpeg"],"reactions":[],"isReport":false}},{"id":"6961ad8800600e42ae401063","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-10T01:38:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Controlling changes to attention logits](https://huggingface.co/papers/2511.21377) (2025)\n* [Understanding the Mechanisms of Fast Hyperparameter Transfer](https://huggingface.co/papers/2512.22768) (2025)\n* [The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss](https://huggingface.co/papers/2512.08374) (2025)\n* [Data-Free Pruning of Self-Attention Layers in LLMs](https://huggingface.co/papers/2512.20636) (2025)\n* [Scaling Behavior of Discrete Diffusion Language Models](https://huggingface.co/papers/2512.10858) (2025)\n* [Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings](https://huggingface.co/papers/2512.12167) (2025)\n* [Correction of Decoupled Weight Decay](https://huggingface.co/papers/2512.08217) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Controlling changes to attention logits (2025)
Understanding the Mechanisms of Fast Hyperparameter Transfer (2025)
The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss (2025)
Data-Free Pruning of Self-Attention Layers in LLMs (2025)
Scaling Behavior of Discrete Diffusion Language Models (2025)
Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings (2025)
Correction of Decoupled Weight Decay (2025)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-10T01:38:16.414Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7254237532615662},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"696be711e7a76925b93daae6","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-01-17T19:46:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/learnable-multipliers-freeing-the-scale-of-language-model-matrix-layers-4763-7a709001\n\n- Executive Summary\n- Detailed Breakdown\n- Practical Applications","html":"

arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/learnable-multipliers-freeing-the-scale-of-language-model-matrix-layers-4763-7a709001

Executive Summary
Detailed Breakdown
Practical Applications

\n","updatedAt":"2026-01-17T19:46:25.470Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6883251667022705},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.04890","authors":[{"_id":"69608e7c5b7998385e639583","user":{"_id":"64670db15993aa7666cc6022","avatarUrl":"/avatars/b68caad7e987c095b0cab4d9035aac25.svg","isPro":false,"fullname":"Maksim Velikanov","user":"yellowvm","type":"user"},"name":"Maksim Velikanov","status":"admin_assigned","statusLastChangedAt":"2026-01-09T15:49:44.975Z","hidden":false},{"_id":"69608e7c5b7998385e639584","user":{"_id":"6697a9fb6d173ec7382e0392","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6697a9fb6d173ec7382e0392/Q_myrIBbdWI3RmEvEHrtQ.jpeg","isPro":false,"fullname":"Ilyas Chahed","user":"IChahed","type":"user"},"name":"Ilyas Chahed","status":"claimed_verified","statusLastChangedAt":"2026-01-09T15:46:04.314Z","hidden":false},{"_id":"69608e7c5b7998385e639585","user":{"_id":"6460c3811db65f878513bcaf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6460c3811db65f878513bcaf/CRdJ8lXixDku3k8Rm5Stn.jpeg","isPro":false,"fullname":"Jingwei Zuo","user":"JingweiZuo","type":"user"},"name":"Jingwei Zuo","status":"claimed_verified","statusLastChangedAt":"2026-01-09T08:34:49.874Z","hidden":false},{"_id":"69608e7c5b7998385e639586","name":"Dhia Eddine Rhaiem","hidden":false},{"_id":"69608e7c5b7998385e639587","user":{"_id":"62441d1d9fdefb55a0b7d12c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648631057413-noauth.png","isPro":false,"fullname":"Younes B","user":"ybelkada","type":"user"},"name":"Younes Belkada","status":"claimed_verified","statusLastChangedAt":"2026-01-09T08:34:47.914Z","hidden":false},{"_id":"69608e7c5b7998385e639588","user":{"_id":"6471d727a2b0a376b8b6a4ed","avatarUrl":"/avatars/aedda547f2ca40dfa898e76be787952f.svg","isPro":false,"fullname":"Hakim Hacid","user":"HakimHacid","type":"user"},"name":"Hakim Hacid","status":"admin_assigned","statusLastChangedAt":"2026-01-09T15:49:56.651Z","hidden":false}],"publishedAt":"2026-01-08T12:41:49.000Z","submittedOnDailyAt":"2026-01-09T02:55:50.938Z","title":"Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers","submittedOnDailyBy":{"_id":"6460c3811db65f878513bcaf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6460c3811db65f878513bcaf/CRdJ8lXixDku3k8Rm5Stn.jpeg","isPro":false,"fullname":"Jingwei Zuo","user":"JingweiZuo","type":"user"},"summary":"Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.","upvotes":42,"discussionId":"69608e7c5b7998385e639589","projectPage":"https://tiiuae.github.io/Falcon-H1/","ai_summary":"Learnable multipliers are introduced to address weight decay-induced normalization artifacts in large language model training, outperforming traditional methods while reducing computational overhead.","ai_keywords":["weight decay","stochastic gradient noise","Brownian-like expansion","WD-noise equilibrium","learnable multipliers","matrix layers","weight norm","muP multipliers","Adam optimizer","Muon optimizer"],"organization":{"_id":"6448cad23adf50d86406b0a3","name":"tiiuae","fullname":"Technology Innovation Institute","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61a8d1aac664736898ffc84f/AT6cAB5ZNwCcqFMal71WD.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6460c3811db65f878513bcaf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6460c3811db65f878513bcaf/CRdJ8lXixDku3k8Rm5Stn.jpeg","isPro":false,"fullname":"Jingwei Zuo","user":"JingweiZuo","type":"user"},{"_id":"62441d1d9fdefb55a0b7d12c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648631057413-noauth.png","isPro":false,"fullname":"Younes B","user":"ybelkada","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"62cd4b03c5cc157be82f0b56","avatarUrl":"/avatars/351e963c1c763d507ae78cbcd62966a3.svg","isPro":false,"fullname":"Abhay kumar","user":"akanyaani","type":"user"},{"_id":"6697a9fb6d173ec7382e0392","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6697a9fb6d173ec7382e0392/Q_myrIBbdWI3RmEvEHrtQ.jpeg","isPro":false,"fullname":"Ilyas Chahed","user":"IChahed","type":"user"},{"_id":"664c3101d1ba9237d34ae972","avatarUrl":"/avatars/a830c0ee4c95571419770f1ffb41ef11.svg","isPro":false,"fullname":"BrahimFarhat","user":"ifarhatTII","type":"user"},{"_id":"6770c739a32d6abf514a0684","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6770c739a32d6abf514a0684/BzPMY9ddivZSN73JbBppQ.jpeg","isPro":false,"fullname":"Suhail M Shah","user":"SMSHAH","type":"user"},{"_id":"664ddb00da286d1a60c298fc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/664ddb00da286d1a60c298fc/VchH_2fFjZxnBD_Ht10ED.jpeg","isPro":false,"fullname":"Rhaiem","user":"DhiyaEddine","type":"user"},{"_id":"62fe441427c98b09b503a4e3","avatarUrl":"/avatars/d908a3842a49283f7e8de11017a6a17b.svg","isPro":false,"fullname":"wamiq para","user":"wamreyaz","type":"user"},{"_id":"66db041dd04c920ebf3198ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66db041dd04c920ebf3198ff/yfkeoBpp6ztKi27GglP1z.jpeg","isPro":false,"fullname":"Slim Frikha","user":"slimfrikha","type":"user"},{"_id":"6761588916e385fe51781d9a","avatarUrl":"/avatars/5d43402788afb5da9904f008d5fb29bb.svg","isPro":false,"fullname":"Saarah Abdulla","user":"saarah-a","type":"user"},{"_id":"68383f909deef11aa6f2b5ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68383f909deef11aa6f2b5ca/3qGaWwA5plcQTyix_ZUvT.jpeg","isPro":false,"fullname":"Pasquale","user":"pbalsebre","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":3,"organization":{"_id":"6448cad23adf50d86406b0a3","name":"tiiuae","fullname":"Technology Innovation Institute","avatar":"https://cdn-uploads.huggingface.co/production/uploads/61a8d1aac664736898ffc84f/AT6cAB5ZNwCcqFMal71WD.jpeg"}}">

Papers

arxiv:2601.04890

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

Published on Jan 8

· Submitted by

Jingwei Zuo on Jan 9

#3 Paper of the day

Technology Innovation Institute

Upvote

Authors:

Maksim Velikanov ,

Ilyas Chahed ,

Jingwei Zuo ,

Younes Belkada ,

Hakim Hacid

Abstract

Learnable multipliers are introduced to address weight decay-induced normalization artifacts in large language model training, outperforming traditional methods while reducing computational overhead.

AI-generated summary

Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.

View arXiv page View PDF Project page Add to collection

Community

JingweiZuo

Paper author Paper submitter Jan 9

Building on the μP multipliers applied in Falcon-H1 pretraining (https://huggingface.co/papers/2507.22448), this work extends the idea to learnable matrix-, row-, and column-wise scaling. We show that the weight-norm equilibrium induced by weight decay and gradient noise is suboptimal, and that freeing these scale constraints yields consistent gains, generalizes μP, and improves downstream performance with both Adam and Muon optimizers.