Singapore

Existing multimodal representation learning approaches often rely on simple feature concatenation or unified transformations, which fail to effectively disentangle and leverage common and private information across different modalities in a progressive manner. Moreover, they typically lack adaptive modeling tailored to specific task requirements. To address these limitations, we propose a Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network (PLUM-Net). It first employs a multilevel semantic alignment module to synchronize global and local semantics across audio, visual and textual streams. On this aligned foundation, a prototype-based single-modal label generation module derives modality-specific hard and soft-labels that subtly steer the network toward a cleaner split between shared and private cues. Guided by these labels, the task-conditioned feature bifurcator module channels information through the most beneficial common or private pathway for the given task, after which a private refinement module polishes and fusion each modality’s idiosyncratic signals. Extensive experiments show that PLUM-Net delivers strong performance on datasets such as CMU-MOSI, CMU-MOSEI and UR-FUNNY, achieving an ACC-2 of 90.3% on CMU-MOSI and 83.2% on UR-FUNNY .

AAAI 2026

PLUM-Net: Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network

ml: mixture of experts (moe)

nlp: sentiment analysis

and argument mining

stylistic analysis

ml: multimodal learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

In Bi-Objective Search (BOS), the task is to find all Pareto-optimal paths in a graph where each edge has two cost values. Skyler et. al. (2024) proposed a general algorithmic framework for BOS and divided the search nodes into classes. They also discussed several ordering functions of the objectives that are used to choose nodes for expansion. In this paper, we continue this line of research.
We further refine the classes of nodes and show that many nodes that are classified as never-expand nodes, that were added to the open list, still need to be extracted from the open list and be further examined. Additionally, we introduce a method that enables constant-time dominance checks for the MIN and MAX ordering functions. This allows a practical usage of these ordering functions, as we demonstrate in our experimental section.

Deeper Treatment of the Bi-objective Search Framework

Model merging serves as a training-free technique that combines multiple task-specific models into a unified multi-task model, but parameter conflicts often lead to performance drops. Previous methods flatten weight matrices into one-dimensional vectors, losing the inherent structural information of their row and column spaces. We mathematically prove and experimentally validate that parameter conflicts arise from non-orthogonal components of task vectors, while orthogonal components are conflict-free. Furthermore, we find that non-orthogonal components can contain both harmful conflicts and beneficial synergies. To precisely locate parameter conflicts and extract orthogonal components, we propose GLOBA (GLObal Basis Analysis Framework), which projects task vectors onto a global basis to align them within a unified coordinate system and construct a task interaction matrix. Following energy-based pruning, we divide parameters into five types based on the orthogonal relationships between the row spaces and column spaces of task vectors. Experiments on three fine-tuned models (mathematics, coding, and instruction-following) using LLaMA-2-7B and LLaMA-2-13B demonstrate significant performance gains through selective retention of beneficial parameters and removal of conflicting ones.

GLOBA: Rethinking Parameter Conflicts in Model Merging

Recent advances in multi-instance learning (MIL) have demonstrated impressive performance in whole slide image (WSI) analysis. However, current methods search for cues and draw conclusions from all instances or regions, resulting in excessive redundant computation and suboptimal representation quality due to irrelevant and uninformative feature interference. To address these issues, we propose CICS, an efficient and general framework that performs compact information compression and selection for high-efficiency WSI analysis. In particular, CICS features two key components: (1) context-aware compression (CAC), which partitions the instance space into sub-regions and applies learnable compression to discard irrelevant components, reduce computational complexity while facilitating information selection, and (2) global-proximity selective attention (GPSA), which cherry-picks the most informative representation with a proximity-assisted global dynamic selection strategy. Building upon these innovations, CICS forms a plug-and-play module that reduces computational complexity through compact instance representations while improving feature quality by preserving the most informative cues. Extensive experiments on six WSI classification and survival prediction datasets show that CICS consistently improves the performance of multiple representative MIL methods. It achieves 2.5%, 7.7%, and 3.9% accuracy gain over the state-of-the-art Transformer-based TransMIL, Mamba-based MambaMIL, and graph-based WIKG methods on the ESCA dataset.

Content-aware Information Compression and Selection for Whole Slide Image Analysis

Large Language Models (LLMs) exhibit strong performance across various natural language processing (NLP) tasks but remain vulnerable to hallucinations, generating factually incorrect or misleading outputs. Uncertainty estimation, often using predictive entropy estimation, is key to addressing this issue. However, existing methods often require multiple samples or extra computation to assess semantic entropy. This paper proposes an efficient, training-free uncertainty estimation method that approximates predictive entropy using the responses' top-$K$ probabilities. Moreover, we employ an adaptive mechanism to determine $K$ to enhance flexibility and filter out low-confidence probabilities. Experimental results on three free-form question-answering datasets across several LLMs demonstrate that our method outperforms expensive state-of-the-art baselines, contributing to the broader goal of enhancing LLM trustworthiness.

Probabilities Are All You Need: A Probability-Only Approach to Uncertainty Estimation in Large Language Models

Peptide-based drug design targeting “undruggable” proteins remains one of the most critical challenges in modern drug discovery. Conventional peptide-discovery pipelines rely on low-throughput experimental screening, which is both time-consuming and prohibitively expensive. Moreover, existing computational approaches for designing peptides against target proteins typically depend on the availability of high-quality structural information. Although recent structure-prediction tools such as AlphaFold3 have achieved breakthroughs in protein modeling, their accuracy for functional interfaces remains limited. The acquisition of high-resolution structures is often expensive, time-intensive, and particularly challenging for targets with dynamic conformations, further restricting the efficient development of peptide therapeutics. Additionally, current sequence-based generative methods follow a paradigm that relies on known templates, which limits the exploration of sequence space and results in generated peptides lacking diversity and novelty. To address these limitations, we propose a contrastive conditioned diffusion framework for target-specific peptide generation, referred to as PepCCD. It employs a contrastive learning strategy between proteins and peptides to extract sequence-based conditioning representations of target proteins, which serve as precise conditions to guide a pre-trained diffusion model to generate peptide sequences with the desired target specificity. Extensive experiments on multiple benchmark target proteins demonstrate that the peptides designed by PepCCD exhibit strong binding affinity and outperform state-of-the-art methods in terms of diversity and generation efficiency.

PepCCD: A Contrastive Conditioned Diffusion Framework for Target-Specific Peptide Generation

Multimodal sarcasm detection (MSD) aims to identify sarcasm polarity through diverse modalities (i.e., image-text pairs), which gains increasing attention. While significant advancements have been witnessed, the existing approaches still face two major issues: lack of explainability and weak generalizability. In this paper, we introduce a new large vision-language model (LVLM) dubbed S³-MSD for explainable and generalizable MSD through three key components. For explainability, we develop (1) a self-training paradigm bootstrapping answers with explanations automatically, and (2) a self-calibrating mechanism rectifying flawed explanations. For generalizability, we design (3) a self-focusing module amplifying visual semantic entities through preference optimization, to mitigate textual over-reliance. Experimental results on both in-distribution and out-of-distribution (OOD) benchmarks demonstrate that S³-MSD consistently outperforms state-of-the-art methods in detection performance. Furthermore, the proposed S³-MSD provides persuasive explanations, as validated by quantitative and human evaluations.

S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection

The AI community has shown substantial interest in the concept of world models: internal representations that simulate aspects of the external world, track entities and states, capture causal relationships, and enable prediction of consequences. This contrasts with representations based solely on statistical correlations. A key motivation behind this research direction is the argument that humans possess such mental world models, and finding evidence of similar representations in AI models might indicate that these models truly "understand" the world in a human-like way. In this paper, we use problems and case studies from the philosophy of science literature to critically examine whether the world model framework adequately characterizes human-level understanding. We focus on specific philosophical analyses where the distinction between world model capabilities and human understanding is most pronounced. While these represent particular views of understanding rather than universal definitions, they illuminate some important limitations in using world models as a lens to claim that AI models understand in a human-like way. By highlighting these distinctions, we hope to stimulate deeper discussion about the nature of understanding in both human and artificial contexts.

Beyond World Models: Rethinking Understanding in AI Models

Advanced text generation is paramount for enhancing the naturalness of human-computer interaction and improving emotional expressiveness. Current mainstream methods largely rely on large language models (LLMs) for single-turn generation, often lacking the interactivity and multi-dimensional feedback mechanisms inherent in human writing. This limitation frequently results in generated texts that fall short in terms of depth, fluency, and stylistic sophistication.

To address these deficiencies, this paper proposes WRitEer (Writer-Reader iterative tuning with Editor-Driven evolution and refinement), an interactive multi-agent collaborative human-like writing framework. Centered around an LLM, this framework integrates multi-objective optimization with preference fine-tuning techniques. It introduces three synergistic agents: the Reader, responsible for discourse analysis and indicator generation; the Editor, which constructs prompts based on feedback indicators and iteratively refines them through an evolutionary search; and the Writer, which generates text based on these refined prompts and continuously self-optimizes via a DPO mechanism that incorporates preference feedback. Experimental results consistently demonstrate that this ``generate-evaluate-reflect-optimize'' workflow significantly outperforms single LLM models across multiple datasets, yielding advanced rich texts that exhibit superior human-like style, coherence, expressiveness, and controllability. Our code can be found in https://github.com/frontsea320/WRitEer.

WRitEer: A Multi-Objective, Preference-Driven Multi-Agent Framework for Human-Like Advanced Text Generation

Pose graph optimization (PGO) is fundamental to robot perception and navigation systems, serving as the mathematical backbone for solving simultaneous localization and mapping (SLAM). Existing solvers suffer from polynomial growth in computational complexity with graph size, hindering real-time deployment in large-scale scenarios. In this paper, by duplicating variables and introducing equality constraints, we reformulate the problem and propose a Parallelizable Riemannian Alternating Direction Method of Multipliers (PRADMM) to solve it efficiently. Compared with the state-of-the-art methods that usually exhibit polynomial time complexity growth with graph size, PRADMM enables efficient parallel computation across vertices regardless of graph size. Crucially, all subproblems admit closed-form solutions, ensuring PRADMM maintains exceptionally stable performance. Furthermore, by carefully exploiting the structures of the coefficient matrices in the constraints, we establish the global convergence of PRADMM under mild conditions, enabling larger relaxation step sizes within the interval (0,2). Extensive empirical validation on two synthetic datasets and multiple real-world 3D SLAM benchmarks confirms the superior computational performance of PRADMM.

Parallelizable Riemannian Alternating Direction Method of Multipliers for Non-convex Pose Graph Optimization

Articulated object modeling, which represents interconnected rigid bodies with their geometry, part segmentation, articulation tree, and physical properties, is crucial for robotic perception and manipulation. Recently existing methods like SAGCI leverage Interactive Perception (IP) to refine models through robot interaction. However, SAGCI suffers from prior-dependency (requiring initialization), neglects kinematic/dynamic constraints, and generates non-watertight meshes. To overcome these limitations, we propose SIAM, a novel framework for efficient and generalizable Single-Interaction Articulated Modeling. Given an initial point cloud, SIAM first enables minimal robot interaction to trigger object motion. It then precisely segments parts by analyzing point cloud differences pre- and post-interaction. For joint parameter estimation, we introduce an optimization incorporating novel kinematic energy constraints, enhancing physical consistency. Finally, we reconstruct a high-quality, topologically watertight mesh by learning 3D Gaussian Primitives from multi-view RGB-D observations under deformation. Extensive experiments on the PartNet-Mobility benchmark demonstrate state-of-the-art articulation modeling performance. Successful real-world deployment with an xArm robot further validates the framework's practicality and transferability. SIAM achieves accurate, prior-free modeling with significantly reduced interaction cost. Code will be publicly available upon acceptance.

Content not yet available

Next from AAAI 2026

Deeper Treatment of the Bi-objective Search Framework

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES