Singapore

Multimodal Instruction Following serves as a fundamental capability of multimodal language models, involving accurate comprehension and execution of user-provided instructions. However, existing multimodal instruction-following datasets and benchmarks face the shortcomings outlined below: (a) Lack of Difficulty Stratification, they collect diverse instruction categories but neglect the stratification of difficulty levels across these categories, which leads to overlap, bias, and low interpretability. (b) Lack of Fine-Grained Metrics, they conflate the model&#39;s ability to ``solve tasks&quot; and ``follow constraints&quot; into a single metric, which fails to accurately reflect its instruction-following capability. (c) Lack of Multi-Task Instructions, they overlook the fact that real-world user instructions often consist of multiple combined tasks. This paper proposes MMIFEvol, a framework for multimodal instruction evolving and benchmarking. First, we define the essential components of a carefully curated multimodal instruction set and establish corresponding difficulty levels, based on which we synthesize diverse instruction data. Next, we decouple the evaluation criteria for the instruction following into three different metrics to construct a high-quality benchmark and assess existing models. Experimental results demonstrate that current models still struggle with following complex instructions, while fine-tuning using MMIFEvol data effectively improves models&#39; responsiveness to multimodal instructions.

AAAI 2026

MMIFEvol: Towards Evolutionary Multimodal Instruction Following

multimodal instruction following

large multimodal model

multimedia & multimodal data

mining of visual

Multimodal Instruction Following serves as a fundamental capability of multimodal language models, involving accurate comprehension and execution of user-provided instructions. However, existing multimodal instruction-following datasets and benchmarks face the shortcomings outlined below: (a) Lack of Difficulty Stratification, they collect diverse instruction categories but neglect the stratification of difficulty levels across these categories, which leads to overlap, bias, and low interpretability. (b) Lack of Fine-Grained Metrics, they conflate the model's ability to ``solve tasks" and ``follow constraints" into a single metric, which fails to accurately reflect its instruction-following capability. (c) Lack of Multi-Task Instructions, they overlook the fact that real-world user instructions often consist of multiple combined tasks. This paper proposes MMIFEvol, a framework for multimodal instruction evolving and benchmarking. First, we define the essential components of a carefully curated multimodal instruction set and establish corresponding difficulty levels, based on which we synthesize diverse instruction data. Next, we decouple the evaluation criteria for the instruction following into three different metrics to construct a high-quality benchmark and assess existing models. Experimental results demonstrate that current models still struggle with following complex instructions, while fine-tuning using MMIFEvol data effectively improves models' responsiveness to multimodal instructions.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Magnetic Resonance Imaging (MRI) and its automatic segmentation are pivotal in assisting physicians with clinical diagnoses. In recent years, with the scarcity of labeled data, significant advancements have been made in semi-supervised segmentation. However, the prediction of many current methods is affected by the presence of false positive regions, which limits their reliability in clinical applications. To tackle this issue, we propose a pseudo-label optimization method based on polar coordinate modeling and prior constraints (PMPC), which refines false positive regions in pseudo-labels by leveraging prior knowledge within the polar coordinate system. Firstly, to improve the efficiency and rationality during polar coordinate modeling, the Adaptive Pole Selection (APS) algorithm is presented to ensure that the pole is located within the foreground region. Secondly, to mitigate false positive regions in pseudo-labels that violate medical anatomical priors, we propose the Prior Knowledge Constraint in Polar Coordinate System (KCP) module to reassign pixel categories in these regions. Finally, the Shape-Aware Weighting strategy is presented to evaluate the quality of the optimized pseudo-labels based on their shape and then determine their weight in guiding network parameter updates. Experiments on three MRI datasets demonstrate that the proposed method can be effectively integrated with existing pelvic MRI segmentation approaches, significantly reducing false positive rates and further improving segmentation quality.

A Pseudo-Label Optimization Method Based on Polar Coordinate Modeling and Prior Constraints

Trajectory prediction is a crucial task in modeling human behavior, especially in safety-critical fields such as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy, while recently proposed deep learning approaches suffer from computational cost, slow inference speed, lack of explainability, and generalization issues that limit their practical adoption in such environments. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We introduce a Cross-Generation Elite Sampling to promote population diversity and a Statistics Feedback Loop allowing the LLM to analyze alternative predictions. Our evaluations show TrajEvo outperforms previous heuristic methods on various real-world datasets, and remarkably outperforms both heuristics and deep learning methods when generalizing to an unseen real-world dataset. TrajEvo represents a first step toward automated design of fast, explainable, and generalizable trajectory prediction heuristics. We make our source code publicly available to foster future research.

TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution

Knowledge editing aims to efficiently modify the internal knowledge of large language models (LLMs) without compromising their other capabilities. The prevailing editing paradigm, which appends an update matrix to the original parameter matrix, has been shown by some studies to damage key numerical stability indicators (such as condition number and norm), thereby reducing editing performance and general abilities, especially in sequential editing scenario. Although subsequent methods have made some improvements, they remain within the additive framework and have not fundamentally addressed this limitation. To solve this problem, we analyze it from both statistical and mathematical perspectives and conclude that multiplying the original matrix by an orthogonal matrix does not change the numerical stability of the matrix. Inspired by this, different from the previous additive editing paradigm, a multiplicative editing paradigm termed Multiplicative Orthogonal Sequential Editing (MOSE) is proposed. Specifically, we first derive the matrix update in the multiplicative form, the new knowledge is then incorporated into an orthogonal matrix, which is multiplied by the original parameter matrix. In this way, the numerical stability of the edited matrix is unchanged, thereby maintaining editing performance and general abilities. We compared MOSE with several current knowledge editing methods, systematically evaluating their impact on both editing performance and the general abilities across three different LLMs. Experimental results show that MOSE effectively limits deviations in the edited parameter matrix and maintains its numerical stability. Compared to current methods, MOSE achieves a 12.08% improvement in sequential editing performance, while retaining 95.73% of general abilities across downstream tasks.

Multiplicative Orthogonal Sequential Editing for Language Models

The Mamba architecture has been widely applied to various low-level vision tasks due to its exceptional adaptability and strong performance. Although the Mamba architecture has been adopted for spectral reconstruction, it still faces the following two challenges: (1) Single spatial perception limits the ability to fully understand and analyze hyperspectral images; (2) Single-scale feature extraction struggles to capture the complex structures and fine details present in hyperspectral images. To address these issues, we propose a multi-scale, multi-perceptual Mamba architecture for the spectral reconstruction task, called M3SR. Specifically, we design a multi-perceptual fusion block to enhance the ability of the model to comprehensively understand and analyze the input features. By integrating the multi-perceptual fusion block into a U-Net structure, M3SR can effectively extract and fuse global, intermediate, and local features, thereby enabling accurate reconstruction of hyperspectral images at multiple scales. Extensive quantitative and qualitative experiments demonstrate that the proposed M3SR outperforms existing state-of-the-art methods while incurring a lower computational cost.

M3SR: Multi-Scale Multi-Perceptual Mamba for Efficient Spectral Reconstruction

Recent studies have explored the capabilities of large language models (LLMs) in solving knowledge-intensive mathematical reasoning problems. However, existing benchmarks predominantly involve static theorems that LLMs have encountered during pretraining, making it difficult to assess whether these models can incorporate new or evolving knowledge into their reasoning processes. In this work, we introduce TaxReasoning, a novel benchmark designed to evaluate LLMs’ abilities in real-world tax calculation scenarios. These tasks require not only mathematical reasoning and numerical computation, but also the extraction and application of complex, frequently updated tax regulations. Through extensive experiments with state-of-the-art LLMs using diverse prompting strategies and knowledge augmentation techniques, we uncover substantial limitations in their ability to handle dynamic, knowledge-intensive questions—primarily due to missing domain-specific knowledge and ineffective retrieval. Even the best-performing models fall significantly short of human-level performance. Our analysis points to key avenues for improvement, including enhancing LLMs’ reasoning capabilities, developing more effective knowledge summarization techniques, and improving retrieval strategies. TaxReasoning offers a challenging new testbed for advancing LLMs toward more reliable reasoning in real-world, evolving, and knowledge-intensive domains.

TaxReasoning: Benchmarking Knowledge-Intensive Mathematical Reasoning with Evolving Tax Laws

In the context of global population aging, the prevalence of neurodegenerative diseases is rapidly increasing. Vision-based impaired gait analysis emerges as a promising alternative for automatic and non-invasive diagnosis. While prior efforts have advanced either accuracy or interpretability of gait analysis, few have effectively addressed both aspects in a unified framework. To bridge this gap, we propose DPPD, a Diffusion-based Personalized Pathology Disentanglement model that jointly performs quantitative gait scoring, dementia subtyping, and qualitative anomaly highlighting. Motivated by the observation that pathological gait features exhibit stronger inter-class separability across different gait severity than raw features, DPPD is proposed based on the subject-specific pathology disentanglement perspective. Specifically, it comprises three key components: (1) a 3DmotionBERT for encoding gait representation from 3D human pose sequences estimated, (2) a latent diffusion-based Gait Denoiser for generating personalized normal gait features, and (3) a Dual Pathology Disentanglement mechanism that captures both static pose and dynamic motion pathological representation from the residual between raw and normal gait features. These disentangled pathologies further enable quantitative classification and qualitative anomaly highlighting. Experiments on the PDGait and 3DGait datasets demonstrate that DPPD outperforms state-of-the-art methods in classification accuracy while providing reliable and interpretable visualizations of gait anomalies.

Diffusion-based Personalized Pathology Disentanglement for Impaired Gait Analysis

Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. The code will be released publicly.

Decoupling Continual Semantic Segmentation

Traditional short video recommendations primarily enhance user retention by reinforcing existing user preferences, potentially leading to information cocoons. Conversely, proactive recommendations aim to diversify user interests by exposing users to content beyond their historical preferences. However, current proactive approaches face three limitations: (1) homogeneous receptivity assumption, neglecting individual differences in users' openness to new interests; (2) short-term item exposure without interest anchoring, focusing on item-level shifts rather than interest evolution; and (3) static feedback utilization, failing to incorporate dynamic user feedback during the recommendation adequately. To address these challenges, we propose **ProRec-Video**, a proactive framework that guides hierarchical interest transitions through three innovations. First, *User Receptivity Profiling* assesses individual openness for new interests, ensuring personalized transition pacing. Second, *Hierarchical Interest Transition Planning* decomposes complex interest shifts into intermediate steps to generate smooth interest transition paths and semantically coherent video sequences, addressing overemphasis on item exposure. Third, *Dynamic Feedback Adaptation* integrates agent-based simulation and Reflexion mechanisms to refine interest transition paths and video sequences based on real-time user feedback, enhancing adaptability and satisfaction. Extensive experiments on two datasets demonstrate that ProRec-Video achieves a significant improvement in proactive recommendation performance, with an interest transition success rate of 85\% and a user satisfaction rate of 78.3\%.

ProRec-Video: Guiding Hierarchical Interest Transitions for Proactive Short Video Recommendation with Dynamic Feedback Adaptation

The Speaker Diarization and Recognition (SDR) task aims to predict ``who spoke when and what'' within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

With the widespread use of location-tracking technologies, large volumes of trajectory data are continuously generated. Trajectory similarity computation is a core task in trajectory mining with broad applications. However, existing methods still face two key challenges: (1) difficulty in balancing efficiency and representation quality, and (2) reliance on a single training paradigm, which limits the ability to capture both pairwise similarity and batch-level coherence. To address the challenges mentioned above, we propose a trajectory similarity computation framework, named TrajAgg. Specifically, our framework incorporates a novel aggregation transformer that efficiently aggregates GPS and grid features through two stages of direct interaction and enhances the expressiveness of the resulting trajectory embeddings. In addition, by integrating two distinct training paradigms, our model captures both fine-grained pairwise relationships and global structural consistency. We further analyze its effectiveness from the perspective of mutual information. Extensive experiments on three publicly available datasets show that TrajAgg consistently outperforms state-of-the-art baselines. Our method achieves average improvements of 15.11%, 16.49%, and 40.15% in HR@1 under three distance measures across three datasets, respectively. The code of our model is provided in the appendix.

Downloads

Next from AAAI 2026

A Pseudo-Label Optimization Method Based on Polar Coordinate Modeling and Prior Constraints

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

A Pseudo-Label Optimization Method Based on Polar Coordinate Modeling and Prior Constraints

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads