https://arxivlens.com/PaperView/Details/eco-quantized-training-without-full-precision-master-weights-6349-b2d7ae61

Executive Summary
Detailed Breakdown
Practical Applications

\n","updatedAt":"2026-01-30T22:29:40.764Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7691949605941772},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}},{"id":"697d5d87440ea4e00c831fd4","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false},"createdAt":"2026-01-31T01:40:23.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [FOAM: Blocked State Folding for Memory-Efficient LLM Training](https://huggingface.co/papers/2512.07112) (2025)\n* [HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs](https://huggingface.co/papers/2601.20745) (2026)\n* [SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models](https://huggingface.co/papers/2512.14481) (2025)\n* [Controlled LLM Training on Spectral Sphere](https://huggingface.co/papers/2601.08393) (2026)\n* [SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs](https://huggingface.co/papers/2512.04746) (2025)\n* [What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study](https://huggingface.co/papers/2601.14888) (2026)\n* [Sliced-Wasserstein Distribution Alignment Loss Improves the Ultra-Low-Bit Quantization of Large Language Models](https://huggingface.co/papers/2601.07878) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-01-31T01:40:23.310Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":318,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.73231041431427},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2601.22101","authors":[{"_id":"697c69e5a67238fac88cc270","user":{"_id":"6526b8ebba9a8279c139616b","avatarUrl":"/avatars/09f6b677603a03be128996a0765233e6.svg","isPro":false,"fullname":"Mahdi Nikdan","user":"mnikdan97","type":"user"},"name":"Mahdi Nikdan","status":"claimed_verified","statusLastChangedAt":"2026-01-30T13:31:32.261Z","hidden":false},{"_id":"697c69e5a67238fac88cc271","name":"Amir Zandieh","hidden":false},{"_id":"697c69e5a67238fac88cc272","name":"Dan Alistarh","hidden":false},{"_id":"697c69e5a67238fac88cc273","name":"Vahab Mirrokni","hidden":false}],"publishedAt":"2026-01-29T18:35:01.000Z","submittedOnDailyAt":"2026-01-30T06:34:59.557Z","title":"ECO: Quantized Training without Full-Precision Master Weights","submittedOnDailyBy":{"_id":"6526b8ebba9a8279c139616b","avatarUrl":"/avatars/09f6b677603a03be128996a0765233e6.svg","isPro":false,"fullname":"Mahdi Nikdan","user":"mnikdan97","type":"user"},"summary":"Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as master weights. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.","upvotes":6,"discussionId":"697c69e5a67238fac88cc274","ai_summary":"Error-compensating optimizer eliminates memory overhead from master weights in quantized LLM training while maintaining near-lossless accuracy.","ai_keywords":["quantization","Large Language Models","Sparse Mixture of Experts","master weights","gradient updates","error-compensating optimizer","error-feedback loop","convergence","Pareto frontier","FP8 quantization","INT4 precision"],"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"687f8f525dcc8e6b36e4c71e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/XixQoK2j0oEZutnNzAdt9.png","isPro":false,"fullname":"Croc-Prog-HF","user":"Croc-Prog-HF","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"69783faf6b191c5d1f88b263","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69783faf6b191c5d1f88b263/6SkN2eZwYigK1jEV1vX4-.png","isPro":false,"fullname":"Steven Lees","user":"Subzteveo","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":false,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"64834b399b352597e41816ac","avatarUrl":"/avatars/63d9d123bffa90f43186a0bdc4455cbd.svg","isPro":false,"fullname":"Shaobai Jiang","user":"shaobaij","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}}">

Papers

arxiv:2601.22101

ECO: Quantized Training without Full-Precision Master Weights

Published on Jan 29

· Submitted by

Mahdi Nikdan on Jan 30

Google

Upvote

Authors:

Mahdi Nikdan ,

Abstract

Error-compensating optimizer eliminates memory overhead from master weights in quantized LLM training while maintaining near-lossless accuracy.

AI-generated summary

Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as master weights. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.

View arXiv page View PDF Add to collection

Community

mnikdan97

Paper author Paper submitter 21 days ago

We present Error-Compensating Optimizer (ECO), which integrates with standard optimizers and, for the first time, enables quantized training of large-scale LLMs without requiring high-precision master weights.

avahal

21 days ago

arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/eco-quantized-training-without-full-precision-master-weights-6349-b2d7ae61

Executive Summary
Detailed Breakdown
Practical Applications

librarian-bot

20 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.22101 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.22101 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.22101 in a Space README.md to link it from this page.