@
louisowen6\n\t
\n\n@akanyaani\n\t \n\n@nilabhra\n\t \n\n@gueraf\n\t \n","updatedAt":"2025-03-25T05:23:13.772Z","author":{"_id":"6071c4b270e11b30cfcfd7a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg","fullname":"Louis Owen","name":"louisowen6","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7120363712310791},"editors":["louisowen6"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg"],"reactions":[],"isReport":false}},{"id":"67e3598c6531d2134c129661","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-03-26T01:34:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models](https://huggingface.co/papers/2502.15499) (2025)\n* [HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization](https://huggingface.co/papers/2503.04598) (2025)\n* [AdaGC: Improving Training Stability for Large Language Model Pretraining](https://huggingface.co/papers/2502.11034) (2025)\n* [Binary Neural Networks for Large Language Model: A Survey](https://huggingface.co/papers/2502.19008) (2025)\n* [Peri-LN: Revisiting Layer Normalization in the Transformer Architecture](https://huggingface.co/papers/2502.02732) (2025)\n* [A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization](https://huggingface.co/papers/2503.06385) (2025)\n* [Hyperspherical Normalization for Scalable Deep Reinforcement Learning](https://huggingface.co/papers/2502.15280) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
\n
The following papers were recommended by the Semantic Scholar API
\n
\n
Please give a thumbs up to this comment if you found it helpful!
\n
If you want recommendations for any Paper on Hugging Face checkout this Space
\n
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend
\n","updatedAt":"2025-03-26T01:34:04.557Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7307531237602234},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.17500","authors":[{"_id":"67e23d503ef5318b1550f1bc","user":{"_id":"6071c4b270e11b30cfcfd7a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg","isPro":false,"fullname":"Louis Owen","user":"louisowen6","type":"user"},"name":"Louis Owen","status":"claimed_verified","statusLastChangedAt":"2025-03-25T08:19:59.211Z","hidden":false},{"_id":"67e23d503ef5318b1550f1bd","user":{"_id":"62cd4b03c5cc157be82f0b56","avatarUrl":"/avatars/351e963c1c763d507ae78cbcd62966a3.svg","isPro":false,"fullname":"Abhay kumar","user":"akanyaani","type":"user"},"name":"Abhay Kumar","status":"claimed_verified","statusLastChangedAt":"2025-03-25T09:23:34.033Z","hidden":false},{"_id":"67e23d503ef5318b1550f1be","user":{"_id":"645a0d3dd6648853107c5fdc","avatarUrl":"/avatars/1e3b6a4f5ce81a707ba7cbdf81631091.svg","isPro":false,"fullname":"Nilabhra Roy Chowdhury","user":"nilabhra","type":"user"},"name":"Nilabhra Roy Chowdhury","status":"claimed_verified","statusLastChangedAt":"2025-03-25T09:15:43.961Z","hidden":false},{"_id":"67e23d503ef5318b1550f1bf","user":{"_id":"65e4be59e8b017ee1310a1b6","avatarUrl":"/avatars/c3f7cdf5d0859cb80bfb2b970a675dfa.svg","isPro":false,"fullname":"Fabian","user":"gueraf","type":"user"},"name":"Fabian Gรผra","status":"claimed_verified","statusLastChangedAt":"2025-03-25T08:19:56.909Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6071c4b270e11b30cfcfd7a3/AaY617TTwvPiB-ub2B9bA.png"],"publishedAt":"2025-03-21T19:23:08.000Z","submittedOnDailyAt":"2025-03-25T03:52:18.158Z","title":"Variance Control via Weight Rescaling in LLM Pre-training","submittedOnDailyBy":{"_id":"6071c4b270e11b30cfcfd7a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg","isPro":false,"fullname":"Louis Owen","user":"louisowen6","type":"user"},"summary":"The outcome of Large Language Model (LLM) pre-training strongly depends on\nweight initialization and variance control strategies. Although the importance\nof initial variance control has been well documented in neural networks in\ngeneral, the literature on initialization and management of its growth during\nLLM pre-training, specifically, is somewhat sparse. In this paper, we introduce\nthe Layer Index Rescaling (LIR) weight initialization scheme, and the Target\nVariance Rescaling (TVR) variance control strategy. Experiments on a 1B\nparameter LLaMA model demonstrate that better variance management using these\ntechniques yields substantial improvements in downstream task performance (up\nto 4.6% on common pre-training benchmarks) and reduces extreme activation\nvalues, thus mitigating challenges associated with quantization and\nlow-precision training. Our code is available at:\nhttps://github.com/bluorion-com/weight_rescaling.","upvotes":5,"discussionId":"67e23d513ef5318b1550f22c","githubRepo":"https://github.com/bluorion-com/weight_rescaling","githubRepoAddedBy":"user","ai_summary":"The Layer Index Rescaling (LIR) and Target Variance Rescaling (TVR) techniques improve variance management during LLM pre-training, leading to better performance and mitigating quantization and low-precision training challenges.","ai_keywords":["Large Language Model (LLM)","weight initialization","variance control","Layer Index Rescaling (LIR)","Target Variance Rescaling (TVR)","downstream task performance","quantization","low-precision training"],"githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6071c4b270e11b30cfcfd7a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg","isPro":false,"fullname":"Louis Owen","user":"louisowen6","type":"user"},{"_id":"645a0d3dd6648853107c5fdc","avatarUrl":"/avatars/1e3b6a4f5ce81a707ba7cbdf81631091.svg","isPro":false,"fullname":"Nilabhra Roy Chowdhury","user":"nilabhra","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"651c80a26ba9ab9b9582c273","avatarUrl":"/avatars/e963452eafd21f517d800f2e58e0f918.svg","isPro":false,"fullname":"siyeng feng","user":"siyengfeng","type":"user"},{"_id":"62cd4b03c5cc157be82f0b56","avatarUrl":"/avatars/351e963c1c763d507ae78cbcd62966a3.svg","isPro":false,"fullname":"Abhay kumar","user":"akanyaani","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Variance Control via Weight Rescaling in LLM Pre-training
Published on Mar 21, 2025
Abstract
The Layer Index Rescaling (LIR) and Target Variance Rescaling (TVR) techniques improve variance management during LLM pre-training, leading to better performance and mitigating quantization and low-precision training challenges.
The outcome of Large Language Model (LLM) pre-training strongly depends on
weight initialization and variance control strategies. Although the importance
of initial variance control has been well documented in neural networks in
general, the literature on initialization and management of its growth during
LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce
the Layer Index Rescaling (LIR) weight initialization scheme, and the Target
Variance Rescaling (TVR) variance control strategy. Experiments on a 1B
parameter LLaMA model demonstrate that better variance management using these
techniques yields substantial improvements in downstream task performance (up
to 4.6% on common pre-training benchmarks) and reduces extreme activation
values, thus mitigating challenges associated with quantization and
low-precision training. Our code is available at:
https://github.com/bluorion-com/weight_rescaling.