Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - Variance Control via Weight Rescaling in LLM Pre-training
[go: Go Back, main page]

@louisowen6\n\t \n\n@akanyaani\n\t \n\n@nilabhra\n\t \n\n@gueraf\n\t

\n","updatedAt":"2025-03-25T05:23:13.772Z","author":{"_id":"6071c4b270e11b30cfcfd7a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg","fullname":"Louis Owen","name":"louisowen6","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7120363712310791},"editors":["louisowen6"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg"],"reactions":[],"isReport":false}},{"id":"67e3598c6531d2134c129661","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2025-03-26T01:34:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models](https://huggingface.co/papers/2502.15499) (2025)\n* [HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization](https://huggingface.co/papers/2503.04598) (2025)\n* [AdaGC: Improving Training Stability for Large Language Model Pretraining](https://huggingface.co/papers/2502.11034) (2025)\n* [Binary Neural Networks for Large Language Model: A Survey](https://huggingface.co/papers/2502.19008) (2025)\n* [Peri-LN: Revisiting Layer Normalization in the Transformer Architecture](https://huggingface.co/papers/2502.02732) (2025)\n* [A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization](https://huggingface.co/papers/2503.06385) (2025)\n* [Hyperspherical Normalization for Scalable Deep Reinforcement Learning](https://huggingface.co/papers/2502.15280) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-03-26T01:34:04.557Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7307531237602234},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2503.17500","authors":[{"_id":"67e23d503ef5318b1550f1bc","user":{"_id":"6071c4b270e11b30cfcfd7a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg","isPro":false,"fullname":"Louis Owen","user":"louisowen6","type":"user"},"name":"Louis Owen","status":"claimed_verified","statusLastChangedAt":"2025-03-25T08:19:59.211Z","hidden":false},{"_id":"67e23d503ef5318b1550f1bd","user":{"_id":"62cd4b03c5cc157be82f0b56","avatarUrl":"/avatars/351e963c1c763d507ae78cbcd62966a3.svg","isPro":false,"fullname":"Abhay kumar","user":"akanyaani","type":"user"},"name":"Abhay Kumar","status":"claimed_verified","statusLastChangedAt":"2025-03-25T09:23:34.033Z","hidden":false},{"_id":"67e23d503ef5318b1550f1be","user":{"_id":"645a0d3dd6648853107c5fdc","avatarUrl":"/avatars/1e3b6a4f5ce81a707ba7cbdf81631091.svg","isPro":false,"fullname":"Nilabhra Roy Chowdhury","user":"nilabhra","type":"user"},"name":"Nilabhra Roy Chowdhury","status":"claimed_verified","statusLastChangedAt":"2025-03-25T09:15:43.961Z","hidden":false},{"_id":"67e23d503ef5318b1550f1bf","user":{"_id":"65e4be59e8b017ee1310a1b6","avatarUrl":"/avatars/c3f7cdf5d0859cb80bfb2b970a675dfa.svg","isPro":false,"fullname":"Fabian","user":"gueraf","type":"user"},"name":"Fabian Gรผra","status":"claimed_verified","statusLastChangedAt":"2025-03-25T08:19:56.909Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6071c4b270e11b30cfcfd7a3/AaY617TTwvPiB-ub2B9bA.png"],"publishedAt":"2025-03-21T19:23:08.000Z","submittedOnDailyAt":"2025-03-25T03:52:18.158Z","title":"Variance Control via Weight Rescaling in LLM Pre-training","submittedOnDailyBy":{"_id":"6071c4b270e11b30cfcfd7a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg","isPro":false,"fullname":"Louis Owen","user":"louisowen6","type":"user"},"summary":"The outcome of Large Language Model (LLM) pre-training strongly depends on\nweight initialization and variance control strategies. Although the importance\nof initial variance control has been well documented in neural networks in\ngeneral, the literature on initialization and management of its growth during\nLLM pre-training, specifically, is somewhat sparse. In this paper, we introduce\nthe Layer Index Rescaling (LIR) weight initialization scheme, and the Target\nVariance Rescaling (TVR) variance control strategy. Experiments on a 1B\nparameter LLaMA model demonstrate that better variance management using these\ntechniques yields substantial improvements in downstream task performance (up\nto 4.6% on common pre-training benchmarks) and reduces extreme activation\nvalues, thus mitigating challenges associated with quantization and\nlow-precision training. Our code is available at:\nhttps://github.com/bluorion-com/weight_rescaling.","upvotes":5,"discussionId":"67e23d513ef5318b1550f22c","githubRepo":"https://github.com/bluorion-com/weight_rescaling","githubRepoAddedBy":"user","ai_summary":"The Layer Index Rescaling (LIR) and Target Variance Rescaling (TVR) techniques improve variance management during LLM pre-training, leading to better performance and mitigating quantization and low-precision training challenges.","ai_keywords":["Large Language Model (LLM)","weight initialization","variance control","Layer Index Rescaling (LIR)","Target Variance Rescaling (TVR)","downstream task performance","quantization","low-precision training"],"githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6071c4b270e11b30cfcfd7a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6071c4b270e11b30cfcfd7a3/-1ekCBzSTpqxkkul0bgmI.jpeg","isPro":false,"fullname":"Louis Owen","user":"louisowen6","type":"user"},{"_id":"645a0d3dd6648853107c5fdc","avatarUrl":"/avatars/1e3b6a4f5ce81a707ba7cbdf81631091.svg","isPro":false,"fullname":"Nilabhra Roy Chowdhury","user":"nilabhra","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"651c80a26ba9ab9b9582c273","avatarUrl":"/avatars/e963452eafd21f517d800f2e58e0f918.svg","isPro":false,"fullname":"siyeng feng","user":"siyengfeng","type":"user"},{"_id":"62cd4b03c5cc157be82f0b56","avatarUrl":"/avatars/351e963c1c763d507ae78cbcd62966a3.svg","isPro":false,"fullname":"Abhay kumar","user":"akanyaani","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0}">
Papers
arxiv:2503.17500

Variance Control via Weight Rescaling in LLM Pre-training

Published on Mar 21, 2025
ยท Submitted by
Louis Owen
on Mar 25, 2025

Abstract

The Layer Index Rescaling (LIR) and Target Variance Rescaling (TVR) techniques improve variance management during LLM pre-training, leading to better performance and mitigating quantization and low-precision training challenges.

AI-generated summary

The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.

Community

Paper author Paper submitter
โ€ข
edited Mar 25, 2025

๐Ÿš€ Controlling Weight Variance for Better LLM Performance ๐Ÿš€

We trained over ๐Ÿฐ๐Ÿฌ ๐—ผ๐—ป๐—ฒ-๐—ฏ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป-๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ ๐—Ÿ๐—Ÿ๐—ฎ๐— ๐—” ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ณ๐—ผ๐—ฟ ๐Ÿญ๐Ÿฌ๐Ÿฌ ๐—•๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป ๐—ง๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ and discovered that ๐—ฐ๐—ผ๐—ป๐˜๐—ฟ๐—ผ๐—น๐—น๐—ถ๐—ป๐—ด ๐˜„๐—ฒ๐—ถ๐—ด๐—ต๐˜ ๐˜ƒ๐—ฎ๐—ฟ๐—ถ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ฎ๐˜ ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ฎ๐—ป๐—ฑ ๐—ฑ๐˜‚๐—ฟ๐—ถ๐—ป๐—ด ๐—ฝ๐—ฟ๐—ฒ-๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด is crucial for improving downstream task performanceโ€”leading to gains of up to ๐Ÿฐ.๐Ÿฒ% ๐—ผ๐—ป ๐—ฐ๐—ผ๐—บ๐—บ๐—ผ๐—ป ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€! ๐Ÿ“ˆ

To achieve this, we introduce:
โœ… Layer Index Rescaling (LIR) โ€“ a weight initialization scheme
โœ… Target Variance Rescaling (TVR) โ€“ a variance control strategy

Beyond performance gains, these techniques also help reduce extreme activation values, mitigating risks in quantization and low-precision training for LLMs.

@louisowen6 @akanyaani @nilabhra @gueraf

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.17500 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.17500 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.17500 in a Space README.md to link it from this page.

Collections including this paper 1