Content not yet available

This lecture has no active video or poster.

AAAI 2026

January 22, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

To accelerate Mixture-of-Experts (MoE) inference, the hybrid parallelism paradigm is first applying pipeline parallelism (PP) to vertically divide the model into stages, with each stage further divided horizontally using tensor or expert parallelism. On the algorithm side, dynamic Top-K routing reduces computation by activating fewer experts per token on average. In this paper, we explore the application of dynamic Top-K routing to PP-enabled MoE inference, aiming to fully unleash their combined potential. We identify key performance bottlenecks arising from Top-K value variation across layers, which conflicts with PP's typically uniform stage partitioning, as well as opportunities to optimize memory usage through their integration.

To address these challenges, we present SMIDT, an efficient MoE inference framework tailored for dynamic Top-K routing. SMIDT features: (1) an adaptive, module-level uneven partitioning strategy to balance computation across PP stages, (2) a memory-aware expert replication scheme (DPMoE) that reduces communication overhead, and (3) a lightweight search algorithm combining binary search and dynamic programming to generate efficient parallelism plans. We implement SMIDT on SGLang, a state-of-the-art LLM inference framework, evaluate it on 32 A40 GPUs and 16 A100 GPUs, and compare with manually tuned parallelism strategies. Experimental results show that, when co-locating prefill and decoding phases, SMIDT achieves 1.20–3.13$\times$ throughput improvements for prefill-only tasks and 1.05–1.89$\times$ for prefill-decoding tasks. When disaggregating prefill and decoding tasks, SMIDT improves average and P99 time-to-first-token (TTFT) by 1.10–1.17$\times$ and 1.21–1.26$\times$, respectively.

Downloads

Paper

Next from AAAI 2026

To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance
poster

To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance

AAAI 2026

Alvin Chan and 2 other authors

22 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved