Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Mixture-of-Experts (MoE) architectures have become a cornerstone for scaling large language models (LLMs) efficiently, yet how their sparse structure shapes knowledge acquisition during pre-training remains unknown. Existing interpretability methods predominantly focus on post-hoc analysis of dense models, overlooking the dynamic, architectural differences that define MoE. To bridge this gap, we introduce Gated-LPI, a neuron-level attribution metric that decomposes log-probability increase across neurons. We present the first time-resolved comparison of knowledge acquisition dynamics in MoE versus dense architectures through tracking checkpoints across 1.2M training steps ($\approx 5.2T$ tokens). Our analysis reveals three key phenomena: (1) Early consolidation. MoE model locks into a stable importance profile within $<$100K steps, whereas the dense model remains volatile throughout training. (2) Low-entropy backbone. The top approximately 1\% of MoE neurons consistently receive $>$45\% of positive updates, creating a persistent, high-utility core absent in the dense baseline. (3) Functional robustness. Masking the ten most important MoE attention heads reduces relational HIT@10 by $<$10\%, compared with $>$50\% for the dense model, showing that sparsity fosters distributed---rather than brittle---knowledge storage. These phenomena collectively demonstrate that sparsity fosters an intrinsically stable and distributed computational backbone from early in training. Together, these findings bridge the gap between sparse architectures and training-time interpretability, offering actionable insights for expert-pruning and routing-strategy design in next generation MoE models.