https://arxivexplained.com/papers/canzona-a-unified-asynchronous-and-load-balanced-framework-for-distributed-matrix-based-optimizers

\n","updatedAt":"2026-02-09T18:02:13.543Z","author":{"_id":"65d9fc2a0e6ad24551d87a1e","avatarUrl":"/avatars/3aedb9522cc3cd08349d654f523fd792.svg","fullname":"Grant Singleton","name":"grantsing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6764851808547974},"editors":["grantsing"],"editorAvatarUrls":["/avatars/3aedb9522cc3cd08349d654f523fd792.svg"],"reactions":[],"isReport":false}},{"id":"698a8cf7221e6eff5092e850","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false},"createdAt":"2026-02-10T01:42:15.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [PROBE: Co-Balancing Computation and Communication in MoE Inference via Real-Time Predictive Prefetching](https://huggingface.co/papers/2602.00509) (2026)\n* [Revisiting Parameter Server in LLM Post-Training](https://huggingface.co/papers/2601.19362) (2026)\n* [Diving into 3D Parallelism with Heterogeneous Spot Instance GPUs: Design and Implications](https://huggingface.co/papers/2512.20953) (2025)\n* [BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models](https://huggingface.co/papers/2512.12131) (2025)\n* [Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference](https://huggingface.co/papers/2512.16134) (2025)\n* [DataStates-LLM: Scalable Checkpointing for Transformer Models Using Composable State Providers](https://huggingface.co/papers/2601.16956) (2026)\n* [Horizon-LM: A RAM-Centric Architecture for LLM Training](https://huggingface.co/papers/2602.04816) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2026-02-10T01:42:15.078Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":317,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7510803937911987},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.06079","authors":[{"_id":"69895559beecc443208d26a2","user":{"_id":"66224a84afbc88c1e4881ad7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66224a84afbc88c1e4881ad7/fGDiIiqhTBQri3khSqNcU.jpeg","isPro":false,"fullname":"Liangyu Wang","user":"ly4096","type":"user"},"name":"Liangyu Wang","status":"claimed_verified","statusLastChangedAt":"2026-02-11T22:17:13.834Z","hidden":false},{"_id":"69895559beecc443208d26a3","name":"Siqi Zhang","hidden":false},{"_id":"69895559beecc443208d26a4","name":"Junjie Wang","hidden":false},{"_id":"69895559beecc443208d26a5","name":"Yiming Dong","hidden":false},{"_id":"69895559beecc443208d26a6","name":"Bo Zheng","hidden":false},{"_id":"69895559beecc443208d26a7","name":"Zihan Qiu","hidden":false},{"_id":"69895559beecc443208d26a8","name":"Shengkun Tang","hidden":false},{"_id":"69895559beecc443208d26a9","name":"Di Wang","hidden":false},{"_id":"69895559beecc443208d26aa","name":"Rui Men","hidden":false},{"_id":"69895559beecc443208d26ab","name":"Dayiheng Liu","hidden":false}],"publishedAt":"2026-02-04T07:38:24.000Z","submittedOnDailyAt":"2026-02-09T05:54:16.367Z","title":"Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers","submittedOnDailyBy":{"_id":"66224a84afbc88c1e4881ad7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66224a84afbc88c1e4881ad7/fGDiIiqhTBQri3khSqNcU.jpeg","isPro":false,"fullname":"Liangyu Wang","user":"ly4096","type":"user"},"summary":"The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.","upvotes":18,"discussionId":"6989555abeecc443208d26ac","ai_summary":"Canzona presents a unified asynchronous framework that addresses the conflict between matrix-based optimizers and distributed tensor fragmentation in LLM training, improving efficiency and reducing latency.","ai_keywords":["Large Language Models","matrix-based optimizers","Shampoo","Muon","SOAP","distributed frameworks","Megatron","synchronous approaches","layer-wise partitioning","geometric constraints","Canzona","Data Parallelism","alpha-Balanced Static Partitioning","tensor fragmentation","Tensor Parallelism","Asynchronous Compute pipeline","Micro-Group Scheduling","optimizer step latency","end-to-end iteration time"],"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-uploads.huggingface.co/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66224a84afbc88c1e4881ad7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66224a84afbc88c1e4881ad7/fGDiIiqhTBQri3khSqNcU.jpeg","isPro":false,"fullname":"Liangyu Wang","user":"ly4096","type":"user"},{"_id":"67971a528c7a5e66d4c792e2","avatarUrl":"/avatars/b155d2bbad4ae317305f85db535ada03.svg","isPro":false,"fullname":"Yiming Dong","user":"ymdongpku","type":"user"},{"_id":"64322f41dec2a70d81365aca","avatarUrl":"/avatars/2f75d094b1e0a94fdb14d415ff02c7af.svg","isPro":false,"fullname":"jjwang","user":"jie23","type":"user"},{"_id":"68b5bbe9ab4be4280d8e76ee","avatarUrl":"/avatars/d19bebf00ef9eccef488ac3a6a344722.svg","isPro":false,"fullname":"Xie","user":"Huanyiiiii","type":"user"},{"_id":"668318e62ca1c52c272bafb0","avatarUrl":"/avatars/83ef4e184a5be14a03a24ef790314e02.svg","isPro":false,"fullname":"Xinhai Wang","user":"wangx0t","type":"user"},{"_id":"63e76e2bfdb4097ef65e0745","avatarUrl":"/avatars/6d4d94ab6f44e23437488fd9fed2a383.svg","isPro":false,"fullname":"Tang","user":"Shengkun","type":"user"},{"_id":"6434d4989bd5a84b5dd0b0f5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434d4989bd5a84b5dd0b0f5/0Elf9qbfG9Hkgypm9pTGm.jpeg","isPro":false,"fullname":"Dayiheng Liu","user":"Losin94","type":"user"},{"_id":"610b70452719facd4ea85e28","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/610b70452719facd4ea85e28/S7nMy7D0Rxq0VIVblhYDG.jpeg","isPro":false,"fullname":"Chujie Zheng","user":"chujiezheng","type":"user"},{"_id":"650d82cffb7a5108875e9c35","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vvDvhekO1YCaEBG5nJ-bK.jpeg","isPro":false,"fullname":"Jihao Xin","user":"JihaoXin","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":false,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"65294b334d7cf551ac50d6a6","avatarUrl":"/avatars/75d21e20b711b871616ef3850bb900b7.svg","isPro":false,"fullname":"ChengpengLi","user":"ChengpengLi","type":"user"},{"_id":"6687b233586426849536faff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6687b233586426849536faff/q7EBRrWlk2eYidsKCPC9h.jpeg","isPro":false,"fullname":"Ru Peng","user":"RuPeng","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":0,"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-uploads.huggingface.co/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}}">

Papers

arxiv:2602.06079

Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers

Published on Feb 4

· Submitted by

Liangyu Wang on Feb 9

Qwen

Upvote

Authors:

Liangyu Wang ,

Abstract

Canzona presents a unified asynchronous framework that addresses the conflict between matrix-based optimizers and distributed tensor fragmentation in LLM training, improving efficiency and reducing latency.

AI-generated summary

The scaling of Large Language Models (LLMs) drives interest in matrix-based optimizers (e.g., Shampoo, Muon, SOAP) for their convergence efficiency; yet their requirement for holistic updates conflicts with the tensor fragmentation in distributed frameworks like Megatron. Existing solutions are suboptimal: synchronous approaches suffer from computational redundancy, while layer-wise partitioning fails to reconcile this conflict without violating the geometric constraints of efficient communication primitives. To bridge this gap, we propose Canzona, a Unified, Asynchronous, and Load-Balanced framework that decouples logical optimizer assignment from physical parameter distribution. For Data Parallelism, we introduce an alpha-Balanced Static Partitioning strategy that respects atomicity while neutralizing the load imbalance. For Tensor Parallelism, we design an Asynchronous Compute pipeline utilizing Micro-Group Scheduling to batch fragmented updates and hide reconstruction overhead. Extensive evaluations on the Qwen3 model family (up to 32B parameters) on 256 GPUs demonstrate that our approach preserves the efficiency of established parallel architectures, achieving a 1.57x speedup in end-to-end iteration time and reducing optimizer step latency by 5.8x compared to the baseline.

View arXiv page View PDF Add to collection

Community

ly4096

Paper author Paper submitter 11 days ago

We propose Canzona, a unified, asynchronous, and load-balanced framework that makes matrix-based optimizers (e.g., Muon/Shampoo/SOAP) work efficiently under Megatron-style tensor fragmentation, by decoupling logical optimizer assignment from physical parameter distribution. It introduces α-Balanced Static Partitioning for DP (atomicity + load balance) and an asynchronous compute pipeline with Micro-Group Scheduling for TP to batch fragmented updates and hide reconstruction overhead, yielding 1.57× end-to-end iteration speedup and 5.8× lower optimizer-step latency on Qwen3-32B with 256 GPUs.