Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Mixture-of-Experts (MoE) architecture with experts parallelism scales LLMs efficiently by activating only a subset of experts per input, avoiding proportional training costs. However, the intensive and heterogeneous communication substantially hinders the efficiency and scalability of MoE training in the resource-constrained scenario. Existing communication compression techniques fall short in MoE training due to: (\textit{i}) Intensive training amplifies compression overhead, compromising training efficiency; (\textit{ii}) Accumulated compression errors propagate through the network, degrading training quality. In this paper, we propose RCMoE, a communication-efficient \textbf{R}andom \textbf{C}ompression framework for \uline{MoE} training with two core modules: (\textit{i}) \textit{Local-Stochastic Quantization} compresses the all-to-all communication by stochastically quantizing each row of the expert's intermediate computing results in parallel, effectively improving the compression efficiency and reducing compression error; (\textit{ii}) \textit{Probabilistic Thresholding Sparsification} compresses the all-reduce communication by probabilistically sampling large gradients at high probability, thereby reducing the computational complexity and maintaining the convergence efficiency. Experiments on four typical MoE training tasks prove that RCMoE achieves higher 5.9$\times$-8.1$\times$ total communication compression ratios and 1.3$\times$-10.1$\times$ training speedup compared with the state-of-the-art compression techniques while maintaining the MoE training accuracy.
