Singapore

Mixture-of-Experts (MoE) architectures have recently become a more prevalent choice for large language models (LLMs) than dense architectures due to their superior performance. However, billions of parameters bring MoE LLMs a huge cost for deployment and inference. To address these issues, knowledge distillation (KD) has become a widely adopted technique to compress LLMs. Existing KD methods for LLMs can be divided into *dense-to-dense* and *moe-to-dense* distillation. *Dense-to-dense* distillation transfers knowledge between single dense LLMs, while *moe-to-dense* distillation attempts to transfer knowledge between the MoE LLMs and the dense LLMs. However, the architectural mismatch prevents the student from fully absorbing knowledge when distilling MoE LLMs. To address this limitation, we investigate a new distillation setting, *moe-to-moe*, which aims to fully leverage expert knowledge of teachers and enable the student to absorb it more effectively. Compared to *dense-to-dense* and *moe-to-dense*, *moe-to-moe* suffers from two imbalance issues. First, expert-coverage deficiency reflects an imbalanced knowledge transfer of teacher experts: traditional distillation utilizes only the few experts activated by the teacher router. Second, routing imbalance appears when the student routing distribution drifts from the teacher, which makes it difficult for students to learn how to distribute different experts. To overcome these issues, we propose a novel distillation framework for *moe-to-moe*, **B**alanced **Distill**ation (**B-Distill**), which equally spreads teacher expertise across student experts while regularizing the student router toward teacher-consistent balance. First, to mitigate expert-coverage deficiency, we introduce Monte Carlo exploration, which stochastically perturbs router probabilities so every teacher and student expert is sampled without enlarging the search space. Second, to correct routing imbalance and avert load collapse, we propose an entropy-aware router distillation mechanism that aligns the student router with the teacher while curbing over-concentration. Experiments in various datasets show that B-Distill outperforms baselines by up to 6.6\% in Rouge-L. Our code can be available at https://anonymous.4open.science/r/moedistill-D5FC/.

AAAI 2026

Balanced Knowledge Distillation for Large Language Models with Mix-of-Experts

learning & optimization for nlp

(large) language models

mixture of experts

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Knowledge distillation (KD) aims to enhance the performance of lightweight student networks through the guidance of teacher models. However, the existing methods have deficiencies in two key aspects: First, these methods rely heavily on static representation alignment, failing to account for optimization sensitivity in different directions within the distillation subspace; second, they lack a fine-grained mechanism to align critical directional features. To address these issues, we propose Direction Sensitivity–based Knowledge Distillation method (DSKD), which can quantitatively measure the sensitivity of each direction to the loss function at different training stages and dynamically select the optimization direction accordingly. Meanwhile, we designed a directional sensitivities weighted distillation loss. By aligning the parameter matrices of the teacher and student models in the key directions, we can more effectively transfer knowledge and improve the distillation effect. We combined DSKD with multiple advanced distillation strategies and conducted an empirical evaluation in the GLUE benchmark and CIFAR-100. The results showed that this method could significantly improve the performance of existing distillation techniques.

Direction Sensitivity–Based Knowledge Distillation: Optimization-Aware Low-Rank Knowledge Transfer

Accurately forecasting the spatiotemporal dynamics of biological systems, such as human pluripotent stem cell (hPSC)-derived cardiac organoids, from microscopy time-series is a critical challenge in biomedicine with profound implications for drug discovery. Existing generative models often fail to capture the intricate dynamics of organoid development, struggling with their irregular morphology, indistinct boundaries, and complex spatiotemporal patterns. To overcome these limitations, we introduce OrgaCast, a novel multimodal conditional diffusion model for high-fidelity organoid forecasting. OrgaCast uniquely conditions the generative process on three synergistic modalities: (i) historical image sequences, captured by a dedicated spatiotemporal control module; (ii) structured numerical metadata defining experimental conditions; and (iii) descriptive text captions summarizing the biological context. This comprehensive conditioning enables the generation of forecasts with high visual accuracy and biological plausibility. Furthermore, to enhance the model's utility in critical research settings, we introduce a post-hoc uncertainty quantification method that produces intuitive confidence maps, bolstering the interpretability and trustworthiness of predictions. Extensive experiments on a challenging cardiac organoid dataset demonstrate that OrgaCast outperforms existing baselines in metrics such as SSIM, PSNR, and biological plausibility scores. Our framework presents a robust solution for biological forecasting, promising to accelerate research discovery while minimizing experimental costs and manual effort.

OrgaCast: A Trustworthy Spatiotemporal Diffusion Model for Fluorescence Organoid Forecasting

The rapid proliferation of AI-generated images necessitates effective watermarking techniques to protect intellectual property and detect fraudulent content. While existing training-based watermarking methods show promise, they often struggle with generalization across diverse prompts, introduce visible artifacts, and require substantial external data for retraining on new model variants. To this end, we propose Modular Self-Augmented Training for Latent Diffusion Models (MSAT-LDM), a novel and transferable watermarking framework. MSAT-LDM integrates two key components: (1) Self-Augmented Training (SAT) leverages an internally generated "free generation" distribution to train the watermark module, aligning the training and testing phases without relying on external data. We theoretically demonstrate that this design improves generalization by inducing a tighter generalization bound. (2) Modular watermark architecture is a plug-and-play module that can be independently fine-tuned, enabling efficient adaptation to various fine-tuned backbones or LoRA-enhanced variants with minimal overhead. Extensive experiments show that MSAT-LDM achieves robust watermarking, significantly improves the quality of watermarked images across diverse prompts, and exhibits strong transfer performance--all without the need for external training data.

MSAT-LDM: Toward Transferable High-Fidelity Watermarking for Latent Diffusion Model via Modular Self-Augmented Training

Computer-aided diagnosis (CADx) has become vital in medical imaging, but automated systems often struggle to replicate the nuanced process of clinical interpretation. Expert diagnosis requires a comprehensive analysis of how abnormalities relate to each other across various views and time points, but current multi-view CADx methods frequently overlook these complex dependencies. Specifically, they fail to model the crucial relationships within a single view and the dynamic changes lesions exhibit across different views. This limitation, combined with the common challenge of incomplete data, greatly reduces their predictive reliability. To address these gaps, we reframe the diagnostic task as one of relationship modeling and propose GIIM, a novel graph-based approach. Our framework is uniquely designed to simultaneously capture both critical intra-view dependencies between abnormalities and inter-view dynamics. Furthermore, it ensures diagnostic robustness by incorporating specific techniques to effectively handle missing data, a common clinical issue. We demonstrate the generality of this approach through extensive evaluations on diverse imaging modalities, including CT, MRI, and mammography. The results confirm that our GIIM model significantly enhances diagnostic accuracy and robustness over existing methods, establishing a more effective framework for future CADx systems.

GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis

Modern foundation models such as large language models (LLMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor decomposition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.

LatentLLM: Activation-Aware Transform to Multi-Head Latent Attention

Accurate medical diagnosis often relies on both textual self-reported symptoms and structured medical examination results of patients. However, these examinations vary significantly in cost—measured in time, money, or patient discomfort---creating a challenging trade-off between diagnostic accuracy and resource efficiency.
To address this issue, we propose a dynamic diagnostic framework that incrementally selects medical examinations based on individual characteristics of each patient. Starting with textual self-reported symptoms and basic demographic, the system determines follow-up examinations step-by-step, improving accuracy while minimizing additional costs.
Specifically, we introduce DISC—— $\textbf{D}$ynamic feature selection with $\textbf{I}$nstance-$\textbf{S}$pecific $\textbf{C}$ost sensitivity——a multimodal framework that integrates unstructured textual self-reported symptoms and structured medical examination data. DISC treats each examination as a feature and learns to acquire them sequentially to optimize predictive performance under personalized cost constraints.
Experiments on three real-world hospital datasets demonstrate that DISC outperforms existing feature selection baselines, achieving substantial cost reductions while maintaining high diagnostic accuracy. These results highlight the potential of DISC for practical deployment in cost-sensitive clinical decision-making.
We evaluate DISC on three real-world datasets collected from hospitals, where it achieves state-of-the-art performance compared to existing methods.

DISC: Dynamic Feature Selection for Cost-Sensitive Medical Diagnosis

Recent advances in speech large language models(Speech LLMs) have led to significant progress in speech understanding tasks such as Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). However, whether these models can achieve human-level auditory perception, particularly in terms of their ability to comprehend latent intentions and implicit emotions in real-world spoken language, remains underexplored.To this end, we introduce the Human-level Perception in Spoken Speech Understanding (HPSU), a pioneering benchmark for systematically evaluating the human-level perceptual and understanding capabilities of Speech LLMs.
HPSU comprises 20k expert-validated English and Chinese spoken language understanding instances . It establishes a comprehensive evaluation framework by encompassing a spectrum of tasks, ranging from fundamental speaker attribute recognition to complex inference of latent intentions and implicit emotions.To address the challenges of data scarcity in real-world scenarios and the difficulty of fine-grained annotation, we developed an annotation pipeline that emulates human multimodal cognitive mechanisms. This process fuses audio, textual, and visual information to enable precise speech understanding and labeling, thus significantly enhancing both annotation efficiency and quality.Our systematic evaluation of various open-source and proprietary Speech LLMs demonstrates that even top-performing models still fall considerably short of human capabilities in understanding genuine spoken interactions. Consequently, HPSU will be instrumental in guiding the development of Speech LLMs toward human-level perception and cognition.

HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding

State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4\% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0\% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6\% in GenEval score and 7\% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Symmetry breaking is a crucial technique in modern combinatorial solving, but it is difficult to be sure it is implemented correctly.
The most successful approach to deal with bugs is to make solvers certifying, so that they output not just a solution, but also a mathematical proof of correctness in a standard format, which can then be checked by a formally verified checker. This requires justifying symmetry reasoning within the proof, but developing efficient methods for this has remained a long-standing open challenge.
A fully general approach was recently proposed by Bogaerts et al. (2023), but it relies on encoding lexicographic orders with big integers, which quickly becomes infeasible for large symmetries. In this work, we develop a method for instead encoding orders with auxiliary variables. We show that this leads to orders-of-magnitude speed-ups in both theory and practice by running experiments on proof logging and checking for SAT symmetry breaking using the state-of-the-art satsuma symmetry breaker and the VeriPB proof checking toolchain.

Faster Certified Symmetry Breaking Using Orders with Auxiliary Variables

Linear Predictive Clustering (LPC) partitions samples based on shared linear relationships between feature and target variables, with numerous applications including marketing, medicine, and education. Greedy optimization methods, commonly used for LPC, alternate between clustering and linear regression but lack global optimality. While effective for separable clusters, they struggle in *non-separable* settings where clusters overlap in feature space. In an alternative constrained optimization paradigm, Bertsimas & Shioda (2007) formulated LPC as a Mixed-Integer Program (MIP), ensuring global optimality regardless of separability but suffering from poor scalability. This work builds on the constrained optimization paradigm to introduce two novel approaches that improve the efficiency of global optimization for LPC. By leveraging key theoretical properties of separability, we derive near-optimal approximations with provable error bounds, significantly reducing the MIP formulation’s complexity and improving scalability. Additionally, we can further approximate LPC as a Quadratic Pseudo-Boolean Optimization (QPBO) problem, achieving additional computational gains in the special case of two clusters. Comparative analyses on synthetic and real-world datasets demonstrate that our methods consistently achieve near-optimal solutions with substantially lower regression errors than greedy optimization while exhibiting superior scalability over existing MIP formulations.

Downloads

Next from AAAI 2026

Direction Sensitivity–Based Knowledge Distillation: Optimization-Aware Low-Rank Knowledge Transfer

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Direction Sensitivity–Based Knowledge Distillation: Optimization-Aware Low-Rank Knowledge Transfer

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads