Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99 Pay per view - $4.99 Access through your institution Login with Underline account

Need help?

Contact us

EMNLP 2025

•

November 08, 2025

•

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient small language models. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top-$k$ selection, enabling flexible specialization through $\binom{62}{6} \approx 3.6 \times 10^{7}$ possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that \oursmoe with compression ratio $r=d/2$ achieves 68\% KV cache memory reduction and 3.2× inference speedup while maintaining competitive perplexity (0.8\% degradation). Compared to the parameters with 53.9M parameters, \oursmoe improves the validation loss by 6.9\% over the vanilla transformers while using 42\% fewer active parameters per forward pass. FLOP-matched experiments reveal even larger gains: 11.1\% improvement with 3.2× inference acceleration. Automated evaluation using GPT-4 as a judge confirms quality improvements in generation, with higher scores on coherence (8.1/10), creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish that architectural synergy, not parameter scaling, defines the efficiency frontier for resource-constrained language model deployment.