Pruning and Distilling Mixture-of-Experts into Dense Language Models
Abstract
A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.
Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.
Community
We introduce the first systematic framework for converting a trained MoE into a fully dense model: score, select, group, concatenate into a dense FFN, then distill. A 350-config sweep on Qwen3-30B-A3B (also DeepSeek-V2-Lite, GPT-OSS-20B) finds our novel diversity-aware scoring consistently wins. At matched params, MoE→dense beats dense→dense pruning by +6.3pp at 1.6× faster training.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training (2026)
- Post-Trained MoE Can Skip Half Experts via Self-Distillation (2026)
- ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression (2026)
- Less is MoE: Trimming Experts in Domain-Specialist Language Models (2026)
- Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning (2026)
- BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE (2026)
- HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.28207 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
EvanOLeary/laguna-xs2-dense-k8-recon
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper