Singapore

Recent advances in speech large language models(Speech LLMs) have led to significant progress in speech understanding tasks such as Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). However, whether these models can achieve human-level auditory perception, particularly in terms of their ability to comprehend latent intentions and implicit emotions in real-world spoken language, remains underexplored.To this end, we introduce the Human-level Perception in Spoken Speech Understanding (HPSU), a pioneering benchmark for systematically evaluating the human-level perceptual and understanding capabilities of Speech LLMs.
HPSU comprises 20k expert-validated English and Chinese spoken language understanding instances . It establishes a comprehensive evaluation framework by encompassing a spectrum of tasks, ranging from fundamental speaker attribute recognition to complex inference of latent intentions and implicit emotions.To address the challenges of data scarcity in real-world scenarios and the difficulty of fine-grained annotation, we developed an annotation pipeline that emulates human multimodal cognitive mechanisms. This process fuses audio, textual, and visual information to enable precise speech understanding and labeling, thus significantly enhancing both annotation efficiency and quality.Our systematic evaluation of various open-source and proprietary Speech LLMs demonstrates that even top-performing models still fall considerably short of human capabilities in understanding genuine spoken interactions. Consequently, HPSU will be instrumental in guiding the development of Speech LLMs toward human-level perception and cognition.

AAAI 2026

HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding

nlp: conversational ai/dialog systems

nlp: speech

nlp: language grounding & multi-modal nlp

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4\% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0\% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6\% in GenEval score and 7\% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Symmetry breaking is a crucial technique in modern combinatorial solving, but it is difficult to be sure it is implemented correctly.
The most successful approach to deal with bugs is to make solvers certifying, so that they output not just a solution, but also a mathematical proof of correctness in a standard format, which can then be checked by a formally verified checker. This requires justifying symmetry reasoning within the proof, but developing efficient methods for this has remained a long-standing open challenge.
A fully general approach was recently proposed by Bogaerts et al. (2023), but it relies on encoding lexicographic orders with big integers, which quickly becomes infeasible for large symmetries. In this work, we develop a method for instead encoding orders with auxiliary variables. We show that this leads to orders-of-magnitude speed-ups in both theory and practice by running experiments on proof logging and checking for SAT symmetry breaking using the state-of-the-art satsuma symmetry breaker and the VeriPB proof checking toolchain.

Faster Certified Symmetry Breaking Using Orders with Auxiliary Variables

Linear Predictive Clustering (LPC) partitions samples based on shared linear relationships between feature and target variables, with numerous applications including marketing, medicine, and education. Greedy optimization methods, commonly used for LPC, alternate between clustering and linear regression but lack global optimality. While effective for separable clusters, they struggle in *non-separable* settings where clusters overlap in feature space. In an alternative constrained optimization paradigm, Bertsimas & Shioda (2007) formulated LPC as a Mixed-Integer Program (MIP), ensuring global optimality regardless of separability but suffering from poor scalability. This work builds on the constrained optimization paradigm to introduce two novel approaches that improve the efficiency of global optimization for LPC. By leveraging key theoretical properties of separability, we derive near-optimal approximations with provable error bounds, significantly reducing the MIP formulation’s complexity and improving scalability. Additionally, we can further approximate LPC as a Quadratic Pseudo-Boolean Optimization (QPBO) problem, achieving additional computational gains in the special case of two clusters. Comparative analyses on synthetic and real-world datasets demonstrate that our methods consistently achieve near-optimal solutions with substantially lower regression errors than greedy optimization while exhibiting superior scalability over existing MIP formulations.

Near-optimal Linear Predictive Clustering in Non-separable Spaces via MIP and QPBO Reductions

Fine-tuning large language models on downstream tasks is crucial for realizing their cross-domain potential but often relies on sensitive data, raising privacy concerns. 
Differential privacy (DP) offers rigorous privacy guarantees and has been widely adopted in fine-tuning; however, naively injecting noise across the high-dimensional parameter space creates perturbations with large norms, degrading performance and destabilizing training. 
We propose DP-SFT, a two-stage subspace fine-tuning method that substantially reduces noise magnitude while preserving formal DP guarantees. Our intuition is that, during fine-tuning, significant parameter updates lie within a low-dimensional task-specific subspace, while other directions change minimally. 
Hence, we only inject DP noise into this subspace to protect privacy without perturbing irrelevant parameters. 
In phase one, we identify the subspace by analyzing principal gradient directions to capture task-specific update signals. 
In phase two, we project full gradients onto this subspace, add DP noise, and map the perturbed gradients back to the original parameter space for model updates, markedly lowering noise impact. 
Experiments on multiple datasets demonstrate that DP-SFT enhances accuracy and stability under stringent DP constraints, accelerates convergence, and achieves substantial gains over DP fine-tuning baselines. It provides a practical solution for differentially private fine-tuning of large models at scale.

Differentially Private Subspace Fine-Tuning for Large Language Models

Starting from the utilization of deep neural networks to approximate the state-action value function that led to winning one of the most challenging games, to algorithmic advancements that allowed solving problems without even explicitly stating the rules of the challenge at hand, reinforcement learning research has been the center of remarkable scientific progress for the past decade. In this paper, we focus on the key ingredients of this research progress and we analyze the canonical evaluation and design paradigms in reinforcement learning. We introduce the theoretical foundations of the underlying causes outlining that the asymptotic performance of reinforcement learning algorithms does not have a monotone relationship between performance rankings and data-regimes. We conduct large-scale experiments and our results demonstrate that a line of reinforcement learning research under the canonical design paradigms resulted in incorrect conclusions.

Principled Analysis of Deep Reinforcement Learning Evaluation and Design Paradigms

LLM-based solvers have emerged as a promising means of automating problem modeling and solving. However, they remain unreliable and often depend on iterative repair loops that result in significant latency. We introduce OptiHive, an LLM-based framework that produces high-quality solvers for optimization problems from natural-language descriptions without iterative self-correction. OptiHive uses a single batched LLM query to generate diverse components (solvers, problem instances, and validation tests) and filters out erroneous components to ensure fully interpretable outputs. Taking into account the imperfection of the generated components, we employ a statistical model to infer their true performance, enabling principled uncertainty quantification and solver selection. On tasks ranging from traditional optimization problems to challenging variants of the Multi-Depot Vehicle Routing Problem, OptiHive significantly outperforms baselines, increasing the optimality rate from 5\% to 92\% on the most complex problems.

OptiHive: Ensemble Selection for LLM-Based Optimization via Statistical Modeling

Differentiable simulators represent an environment’s dynamics as a differentiable function. Within robotics and autonomous driving, this property is used in Analytic Policy Gradients (APG), which relies on backpropagating through the dynamics to train accurate policies for diverse tasks. Here we show that differentiable simulation also has an important role in world modeling, where it can impart predictive, prescriptive, and counterfactual capabilities to an agent. Specifically, we design three novel task setups in which the differentiable dynamics are combined within an end-to-end computation graph not with a policy, but a state predictor. This allows us to learn relative odometry, optimal planners, and optimal inverse states. We collectively call these predictors Analytic World Models (AWMs) and demonstrate how differentiable simulation enables their efficient, end-to-end learning. In autonomous driving scenarios, they have broad applicability and can augment an agent’s decision-making beyond reactive control.

Unlocking Efficient Vehicle Dynamics Modeling via Analytic World Models

Multimodal learning has shown significant superiority on various tasks by integrating multiple modalities.
However, the interdependencies among modalities increase the susceptibility of multimodal models to adversarial attacks.
Existing methods mainly focus on attacks on specific modalities or indiscriminately attack all modalities. 
In this paper, we find that these approaches ignore the differences between modalities in their contribution to final robustness, resulting in suboptimal robustness performance.
To bridge this gap, we introduce \textbf{V}ulnerability-\textbf{A}ware \textbf{R}obust \textbf{M}ultimodal \textbf{A}dversarial \textbf{T}raining (\texttt{VARMAT}), a probe-in-training adversarial training method that improves multimodal robustness by identifying the vulnerability of each modality.
To be specific, \texttt{VARMAT} first explicitly quantifies the vulnerability of each modality, grounded in a first-order approximation of the attack objective (Probe). Then, we propose a targeted regularization term that penalizes modalities with high vulnerability, guiding robust learning while maintaining task accuracy (Training).
We demonstrate the enhanced robustness of our method across multiple multimodal datasets involving diverse modalities.
Finally, we achieve $\{12.73\%, 22.21\%, 11.19\%\}$ robustness improvement on three multimodal datasets, revealing a significant blind spot in multimodal adversarial training.

Vulnerability-Aware Robust Multimodal Adversarial Training

Label errors can significantly degrade model performance, making effective mechanisms crucial. Active error correction (AEC) addresses this by prioritizing data points for human re-labeling where corrections are expected to have significant impact. We extend AEC to distributed collaborative learning, where clients hold local data and a central server allocates labeling resources. Existing AEC methods assume centralized access and do not generalize to distributed settings. To overcome this, we use neural network weight gradients from client updates as proxies for local data and apply a Gaussian process in gradient space to strategically select clients for correction. Our method identifies gradient inconsistencies and encourages diversity through a computationally efficient rank-one Cholesky update. Experiments on eight benchmark datasets demonstrate the effectiveness of our approach.

Client-level Active Error Correction in Distributed Learning

Text-to-video models have demonstrated impressive capabilities in producing diverse video content, yet often lack fine-grained control over motion. We introduce MotionFlow, a novel, training-free framework for motion transfer in pre-trained video diffusion models. MotionFlow uniquely leverages cross-attention maps by guiding a test-time optimization of latent representations to align the generated video's attention patterns with those extracted from a source motion. This approach enables the capture and manipulation of complex spatial and temporal dynamics for seamless motion transfer across diverse contexts. Unlike methods relying on direct attention map replacement, which can introduce artifacts, or those requiring model-specific training, MotionFlow operates solely at test-time, robustly handling significant scene and appearance alterations. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing methods in motion fidelity, temporal consistency, and versatility, even during drastic scene transformations.

Content not yet available

Next from AAAI 2026

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES