Singapore

Multimodal sarcasm detection (MSD) aims to identify sarcasm polarity through diverse modalities (i.e., image-text pairs), which gains increasing attention. While significant advancements have been witnessed, the existing approaches still face two major issues: lack of explainability and weak generalizability. In this paper, we introduce a new large vision-language model (LVLM) dubbed S³-MSD for explainable and generalizable MSD through three key components. For explainability, we develop (1) a self-training paradigm bootstrapping answers with explanations automatically, and (2) a self-calibrating mechanism rectifying flawed explanations. For generalizability, we design (3) a self-focusing module amplifying visual semantic entities through preference optimization, to mitigate textual over-reliance. Experimental results on both in-distribution and out-of-distribution (OOD) benchmarks demonstrate that S³-MSD consistently outperforms state-of-the-art methods in detection performance. Furthermore, the proposed S³-MSD provides persuasive explanations, as validated by quantitative and human evaluations.

AAAI 2026

S³-MSD: Large Vision-Language Model for Explainable and Generalizable Multi-modal Sarcasm Detection

multi-modal sarcasm detection

large vision-language model

sentiment analysis

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The AI community has shown substantial interest in the concept of world models: internal representations that simulate aspects of the external world, track entities and states, capture causal relationships, and enable prediction of consequences. This contrasts with representations based solely on statistical correlations. A key motivation behind this research direction is the argument that humans possess such mental world models, and finding evidence of similar representations in AI models might indicate that these models truly "understand" the world in a human-like way. In this paper, we use problems and case studies from the philosophy of science literature to critically examine whether the world model framework adequately characterizes human-level understanding. We focus on specific philosophical analyses where the distinction between world model capabilities and human understanding is most pronounced. While these represent particular views of understanding rather than universal definitions, they illuminate some important limitations in using world models as a lens to claim that AI models understand in a human-like way. By highlighting these distinctions, we hope to stimulate deeper discussion about the nature of understanding in both human and artificial contexts.

Beyond World Models: Rethinking Understanding in AI Models

Advanced text generation is paramount for enhancing the naturalness of human-computer interaction and improving emotional expressiveness. Current mainstream methods largely rely on large language models (LLMs) for single-turn generation, often lacking the interactivity and multi-dimensional feedback mechanisms inherent in human writing. This limitation frequently results in generated texts that fall short in terms of depth, fluency, and stylistic sophistication.

To address these deficiencies, this paper proposes WRitEer (Writer-Reader iterative tuning with Editor-Driven evolution and refinement), an interactive multi-agent collaborative human-like writing framework. Centered around an LLM, this framework integrates multi-objective optimization with preference fine-tuning techniques. It introduces three synergistic agents: the Reader, responsible for discourse analysis and indicator generation; the Editor, which constructs prompts based on feedback indicators and iteratively refines them through an evolutionary search; and the Writer, which generates text based on these refined prompts and continuously self-optimizes via a DPO mechanism that incorporates preference feedback. Experimental results consistently demonstrate that this ``generate-evaluate-reflect-optimize'' workflow significantly outperforms single LLM models across multiple datasets, yielding advanced rich texts that exhibit superior human-like style, coherence, expressiveness, and controllability. Our code can be found in https://github.com/frontsea320/WRitEer.

WRitEer: A Multi-Objective, Preference-Driven Multi-Agent Framework for Human-Like Advanced Text Generation

Pose graph optimization (PGO) is fundamental to robot perception and navigation systems, serving as the mathematical backbone for solving simultaneous localization and mapping (SLAM). Existing solvers suffer from polynomial growth in computational complexity with graph size, hindering real-time deployment in large-scale scenarios. In this paper, by duplicating variables and introducing equality constraints, we reformulate the problem and propose a Parallelizable Riemannian Alternating Direction Method of Multipliers (PRADMM) to solve it efficiently. Compared with the state-of-the-art methods that usually exhibit polynomial time complexity growth with graph size, PRADMM enables efficient parallel computation across vertices regardless of graph size. Crucially, all subproblems admit closed-form solutions, ensuring PRADMM maintains exceptionally stable performance. Furthermore, by carefully exploiting the structures of the coefficient matrices in the constraints, we establish the global convergence of PRADMM under mild conditions, enabling larger relaxation step sizes within the interval (0,2). Extensive empirical validation on two synthetic datasets and multiple real-world 3D SLAM benchmarks confirms the superior computational performance of PRADMM.

Parallelizable Riemannian Alternating Direction Method of Multipliers for Non-convex Pose Graph Optimization

Articulated object modeling, which represents interconnected rigid bodies with their geometry, part segmentation, articulation tree, and physical properties, is crucial for robotic perception and manipulation. Recently existing methods like SAGCI leverage Interactive Perception (IP) to refine models through robot interaction. However, SAGCI suffers from prior-dependency (requiring initialization), neglects kinematic/dynamic constraints, and generates non-watertight meshes. To overcome these limitations, we propose SIAM, a novel framework for efficient and generalizable Single-Interaction Articulated Modeling. Given an initial point cloud, SIAM first enables minimal robot interaction to trigger object motion. It then precisely segments parts by analyzing point cloud differences pre- and post-interaction. For joint parameter estimation, we introduce an optimization incorporating novel kinematic energy constraints, enhancing physical consistency. Finally, we reconstruct a high-quality, topologically watertight mesh by learning 3D Gaussian Primitives from multi-view RGB-D observations under deformation. Extensive experiments on the PartNet-Mobility benchmark demonstrate state-of-the-art articulation modeling performance. Successful real-world deployment with an xArm robot further validates the framework's practicality and transferability. SIAM achieves accurate, prior-free modeling with significantly reduced interaction cost. Code will be publicly available upon acceptance.

SIAM: Towards Generalizable Articulated Object Modeling via Single Robot-Object Interaction

In community question answering (cQA) platforms like Stack Overflow, related question retrieval is recognized as a fundamental task that allows users to retrieve related questions to answer user queries automatically. Although many traditional approaches have been proposed for investigating this research field, they mostly rely on static approaches and neglect the interaction property. We argue that the conversational way can well distinguish the fine-grained representations of questions and has great potential to improve the performance of question retrieval. In this paper, we propose a related question retrieval model through conversations, called TeCQR, to locate related questions in cQA. Specifically, we build conversations by utilizing tag-enhanced clarifying questions. In addition, we design a noise tolerance model that evaluates the semantic similarity between questions and tags, enabling the model to effectively handle noisy feedback. Moreover, the tag-enhanced two-stage offline training is proposed to fully exploit the mutual relationships among user queries, questions, and tags to learn their fine-grained representations. Based on the learned representations and contextual conversations, TeCQR incorporates conversational feedback by learning to ask tag-enhanced clarifying questions to retrieve related questions more effectively. Experimental results demonstrate that our model significantly outperforms state-of-the-art baselines. \textit{The code and dataset are provided in the Supplementary Materials.}

Beyond Static: Related Questions Retrieval Through Conversations in Community Question Answering

Finding a few solutions for a given problem that are diverse, as opposed to finding a single best solution to solve the problem, has recently become a notable topic in theoretical computer science. Recently, Baste, Fellows, Jaffke, Masařík, Oliveira, Philip, and Rosamond showed that under a standard structural parameterization by treewidth, one can find a set of diverse solutions for many problems with only a very small additional cost [Artificial Intelligence 2022]. In this paper, we investigate a much stronger graph parameter, the cliquewidth, which can additionally describe some dense graph classes. Broadly speaking, it describes graphs that can be recursively constructed by a few operations defined on graphs whose vertices are divided into a bounded number of groups, while each such group behaves uniformly with respect to any operation.

We show that for any vertex problem, if we are given a dynamic program solving that problem on cliquewidth decomposition, we can modify it to produce a few solutions that are as diverse as possible with as little overhead as in the above-mentioned treewidth paper. As a consequence, we prove that a diverse version of any MSO$_1$ expressible problem can be solved in linear FPT time parameterized by cliquewidth, the number of sought solutions, and the number of quantifiers in the formula. That was an important missing piece in the complexity landscape of structural graph parameters and logic. We prove our results, allowing for a more general natural collection of diversity functions compared to only two mostly studied diversity functions previously. That might be of independent interest as a larger pool of different diversity functions can highlight various aspects of different solutions to a problem.

Finding Diverse Solutions Parameterized by Cliquewidth

Multi-character role-playing aims to equip models with the capability to simulate diverse roles. Existing methods either use one shared parameterized module across all roles or assign a separate parameterized module to each role. However, the role-shared module may ignore distinct traits of each role, weakening personality learning, while the role-specific module may overlook shared traits across multiple roles, hindering commonality modeling. In this paper, we propose a novel \textbf{HyCoRA}: \textbf{Hy}per-\textbf{Co}ntrastive \textbf{R}ole-\textbf{A}daptive learning framework, which efficiently improves multi-character role-playing ability by balancing the learning of distinct and shared traits. Specifically, we propose a Hyper-Half Low-Rank Adaptation structure, where one half is a role-specific module generated by a lightweight hyper-network, and the other half is a trainable role-shared module. The role-specific module is devised to represent distinct persona signatures, while the role-shared module serves to capture common traits. Moreover, to better reflect distinct personalities across different roles, we design a hyper-contrastive learning mechanism to help the hyper-network distinguish their unique characteristics. Extensive experimental results on both English and Chinese available benchmarks demonstrate the superiority of our framework. The further GPT-4 evaluations and visual analyses also verify the capability of HyCoRA to capture role characteristics.

HyCoRA: Hyper-Contrastive Role-Adaptive Learning for Role-Playing

A suitable choice of the representation of candidate solutions is crucial for the efficiency of evolutionary algorithms and related metaheuristics. We focus on problems in permutation spaces, which are at the core of numerous practical applications of such algorithms, e.g., in scheduling and transportation. Inversion vectors (also called Lehmer codes) are an alternative representation of the permutation space $S_n$ compared to the classical encoding as a vector of $n$ unique entries. In particular, they do not require any constraint handling. Using rigorous mathematical runtime analyses, we compare the efficiency of inversion vector encodings to the classical representation and give theory-guided advice on their choice. Moreover, we link the effect of local changes in the inversion code space to classical measures on permutations like the number of inversions. Finally, through experimental studies on linear ordering and quadratic assignment problems, we demonstrate the practical efficiency of inversion vector encodings.

Theoretical and Empirical Analysis of Lehmer Codes to Search Permutation Spaces with Evolutionary Algorithms

Multiphase flow systems, with their complex dynamics, field discontinuities, and interphase interactions, pose significant computational challenges for traditional numerical solvers. While neural operators offer efficient alternatives, they often struggle to achieve high-resolution numerical accuracy in these systems. This limitation primarily stems from the inherent spatial heterogeneity and the scarcity of high-quality training data in multiphase flows.
In this work, we propose the Interface Information-Aware Neural Operator (IANO), a novel framework that explicitly incorporates interface topology to enhance the prediction accuracy even in low-data regimes. The IANO architecture introduces two key components: 1) An interface-aware multiple function encoding mechanism jointly models multiple physical fields and interface topology, thus capturing the high-frequency physical features at the interface.
2) A geometry-aware positional encoding mechanism further establishes the relationship between interface topology, physical variables, and spatial positions, making it to achieve pointwise super-resolution prediction in a more refined spatial space. Experimental results demonstrate that IANO outperforms baselines by $\sim$10\% in accuracy for multiphase flow simulations while maintaining robustness under data-scarce and noise-perturbed conditions.

Cross-Field Interface-Aware Neural Operators for Multiphase Flow Simulation

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.

Content not yet available

Next from AAAI 2026

Beyond World Models: Rethinking Understanding in AI Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES