Abstract
Researchers propose a novel router redesign for Mixture-of-Experts models that aligns router rows with the principal singular directions of expert matrices using Manifold Power Iteration to improve model effectiveness.
Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.
Community
We propose a redesign of the MoE router using Power Iteration during forward pass to couple router weights and expert parameters within the singular space of the parameters. We contend that this imposes an explicit constraint that forces router weights to better reflect the parametric characteristics of the expert weights, resulting in optimized expert routing. Our initial results and extensive analysis validate the effectiveness of this design. We hope our work inspires researchers to rethink MoE routers and leads to more valuable insights for future router designs.
This is a neat approach to MoE routing. I like the idea of moving away from arbitrary router weights and instead using the principal singular direction of the experts to guide the selection process. It feels like a much more grounded way to define token-expert affinity than how most models currently handle it.
Since this uses a Power-then-Retract paradigm, how much of a computational overhead does this add during the training loop compared to standard routing?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/b091d9ea-bfd5-4ea9-bced-18546d1f87e4
Get this paper in your agent:
hf papers read 2606.12397 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper