Singapore

Satire, a form of artistic expression combining humor with implicit critique, holds significant social value by illuminating societal issues. Despite its cultural and societal significance, satire comprehension, particularly in purely visual forms, remains a challenging task for current vision-language models. This task requires not only detecting satire but also deciphering its nuanced meaning and identifying the implicated entities. Existing models often fail to effectively integrate local entity relationships with global context, leading to misinterpretation, comprehension biases, and hallucinations. To address these limitations, we propose SatireDecoder, a training-free framework designed to enhance satirical image comprehension. Our approach proposes a multi-agent system performing visual cascaded decoupling to decompose images into fine-grained local and global semantic representations. In addition, we introduce a Chain-of-Thought reasoning strategy guided by uncertainty analysis, which breaks down the complex satire comprehension process into sequential subtasks with minimized uncertainty. Our method significantly improves interpretive accuracy while reducing hallucinations. Experimental results validate that SatireDecoder outperforms existing baselines in comprehending visual satire, offering a promising direction for vision-language reasoning in nuanced, high-level semantic tasks.

AAAI 2026

SatireDecoder: Visual Cascaded Decoupling for Enhancing Satirical Image Comprehension

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Training large language models (LLMs) with synthetic reasoning data has become a popular approach to enhancing their reasoning capabilities, while a key factor influencing the effectiveness of this paradigm is the quality of the generated multi-step reasoning data. To generate high-quality reasoning data, many recent methods generate synthetic reasoning paths and filter them based on final answer correctness, often overlooking flaws in intermediate reasoning steps. To enhance the verification of intermediate reasoning steps, prior work primarily resorts to code execution or symbolic reasoning engines. However, code-based validation is restricted to code or mathematical tasks, and reasoning engines require a well-structured and complete context. As a result, existing methods fail to function effectively in natural language reasoning tasks that involve ambiguous or incomplete contexts. In these tasks, synthetic data still lack reliable checks for verifying each reasoning step. To address this challenge, we introduce ORACLE, a structured data generation framework inspired by syllogistic reasoning. ORACLE integrates the generative strengths of LLMs with symbolic supervision: the LLM produces step-wise reasoning contexts, while a symbolic reasoning engine verifies the validity of each intermediate step. By employing a unified prompting template to elicit modular reasoning chains, ORACLE enables fine-grained, step-level validation, facilitating the construction of high-quality multi-step reasoning data. Across six logical, factual, and commonsense reasoning benchmarks, our ORACLE consistently outperforms strong baselines on multiple models.

ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation

Multimodal Large Language Models (MLLMs) have achieved remarkable performance across vision-language tasks. Recent advancements allow these models to process multiple images as inputs. However, the vulnerabilities of multi-image MLLMs remain unexplored. Existing adversarial attacks focus on single-image settings and often assume a white-box threat model which is impractical in many real-world scenarios. This paper introduces LAMP, a black-box method for learning UAPs targeting multi-image MLLMs. LAMP applies an attention-based constraint that which prevents the model from effectively aggregating information across images. LAMP also introduces a novel cross-image contagious constraint that forces perturbed tokens to influence clean tokens to spread adversarial effects without requiring all inputs to be modified. Additionally, an index-attention suppression loss creates a robust position invariant attack. Experimental results show that LAMP outperforms SOTA baselines and achieves the highest attack success rates across multiple vision-language tasks.

LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models

Large language models have demonstrated remarkable capabilities across many tasks, yet face significant challenges when dealing with recursive reasoning problems—those requiring the resolution of nested hierarchical structures. While prior research has extensively studied length generalization (a model’s ability to handle longer sequences than seen during training), we investigate a distinct and underexplored limitation: depth generalization. Here, depth refers to the number of nested levels in a hierarchical problem, such as the layers of parentheses in a mathematical expression or the nesting of logical clauses in a Boolean formula.

Our work reveals that standard transformer architectures struggle with problems involving deeper recursion than encountered during training, even when they perform well on longer but non-nested sequences. This limitation stems from their inability to maintain stack-like behavior—the capacity to track and resolve multiple levels of nested dependencies. Through systematic analysis, we demonstrate how this architectural constraint leads to rapid performance decay as the depth of recursion increases.

To address this challenge, we develop a novel looped locate-and-replace pipeline that decomposes recursive problems into manageable subcomponents. The approach employs two specialized models: a locator that identifies solvable subexpressions and a replacer that evaluates these components while preserving the overall structure. We evaluate this method in three carefully designed domains—Boolean algebra, recursive arithmetic, and propositional logic—each with a controllable depth of recursion. Our results show that the proposed method effectively alleviates performance decay when tested on out-of-distribution recursion depth.

Exploring Depth Generalization in Large Language Models for Solving Recursive Logic Tasks

Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite its growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often leading to subjective judgments misaligned with human perception. In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where). Our benchmark introduces a) FV-Score, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW³, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information. Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM's ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events that typically comprise strong visual cues.

FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding

Large Language Models (LLMs) and Vision-Language Models (VLMs) remain highly vulnerable to adversarial attacks despite widespread adoption. Existing defenses typically require retraining, rely on heuristics, or fail under adaptive and out-of-distribution (OOD) conditions. We introduce EigenShield, a principled, inference-time, architecture-agnostic defense that leverages Random Matrix Theory (RMT) to suppress adversarial noise in high-dimensional embeddings. EigenShield uses spiked covariance modeling and a Robustness-based Nonconformity Score (RbNS) with quantile thresholding to isolate and preserve causal eigenvectors, filtering out adversarial components without model access or adversarial training. We develop a theoretical framework establishing conditions for asymptotic noise suppression and demonstrate effectiveness in both unimodal and multimodal settings. Empirically, EigenShield consistently improves robustness across threat models, reducing attack success rates (ASR) by up to 48% over state-of-the-art defenses, including adversarial training, UNIGUARD, CIDER, and input transformations. On jailbreak attacks, EigenShield lowers LLM ASR by up to 92.9% relative to undefended models. Under multimodal adversarial attacks, it reduces VLM ASR by up to 76.5%. Against adaptive attacks on LLMs, it achieves ASR reductions of up to 77.7%. In OOD settings, EigenShield maintains strong performance, reducing ASR by up to 88.4% for LLMs and 80.4% for VLMs. Warning: This paper contains data, prompts, and model outputs that may be offensive.

EigenShield: Inference-Time, Model-Agnostic Jailbreaking Defense via Causal Subspace Filtering

Medical images exhibit inherent community structures, such as organs, tissues, and pathological regions, that standard Vision Transformers (ViTs) fail to exploit.
While recent work like SBM-Transformer attempts to incorporate such structures through stochastic binary masking, they suffers from non-differentiability, training instability, and inability to model complex community structure. 
We present DCMM-Transformer, a novel Vision Transformer architecture for medical image analysis that incorporates a Degree-Corrected Mixed-Membership (DCMM) model as an additive bias in self-attention. Unlike prior approaches that rely on multiplicative masking and binary sampling, our method introduces community structure and degree heterogeneity in a fully differentiable and interpretable manner. 
Comprehensive experiments across diverse medical imaging datasets, including brain, chest, breast, and ocular modalities, demonstrate the superior performance and generalizability of the proposed approach. Furthermore, the learned group structure and structured attention modulation substantially enhance interpretability by yielding attention maps that are anatomically meaningful and semantically coherent.

DCMM-Transformer: Degree-Corrected Mixed-Membership Attention for Medical Imaging

Modern preference alignment methods, such as DPO, rely on divergence regularization to a reference model for training stability—but this creates a fundamental problem we call "reference mismatch." In this paper, we investigate the negative impacts of reference mismatch in aligning text-to-image (T2I) diffusion models, showing that larger reference mismatch hinders effective adaptation given the same amount of data, e.g., as when learning new artistic styles, or personalizing to specific objects. We demonstrate this phenomenon across text-to-image (T2I) diffusion models and introduce margin-aware preference optimization (MaPO), a reference-agnostic approach that breaks free from this constraint. By directly optimizing the likelihood margin between preferred and dispreferred outputs under the Bradley-Terry model without anchoring to a reference, MaPO transforms diverse T2I tasks into unified pairwise preference optimization. We validate MaPO's versatility across five challenging domains: (1) safe generation, (2) style adaptation, (3) cultural representation, (4) personalization, and (5) general preference alignment. Our results reveal that MaPO's advantage grows dramatically with reference mismatch severity, outperforming both DPO and specialized methods like DreamBooth while reducing training time by 15%. MaPO thus emerges as a versatile and memory-efficient method for generic T2I adaptation tasks. 
Warning: This paper contains examples of harmful content, including explicit text and images.

Margin-Aware Preference Optimization for Aligning Diffusion Models Without Reference

The latest advancements in scene relighting have been predominantly driven by inverse rendering with 3D Gaussian Splatting (3DGS). However, existing methods remain overly reliant on precise camera parameters under static illumination conditions, which is prohibitively expensive and even impractical in real-world scenarios. In this paper, we propose a novel learning from Unposed views under Varied illuminations Relightable 3D Gaussian Splatting (dubbed UV-RGS), to address this challenge by jointly optimizing camera poses, 3DGS representations, surface materials, and environment illuminations (i.e., unknown and varied lighting conditions in training) using only unposed views under varied lightings. Firstly, UV-RGS presents a viewpoint dividing strategy to group inputs into constituent units, enabling each unit can perform similar poses and illuminations. Next, for each unit, to get the constituent model, UV-RGS establishes incrementally pose learning module to estimate coarse camera parameters, which also enjoy a proxy-view refinement to alleviate the sparse view learning. Additionally, for all constituent unit models, we introduce a holistic model learning strategy that integrates progressive unit aggregation component and the 3DGS coupled with camera poses joint optimization, which realizes the scene high-fidelity perception by the physical-based rendering. Extensive experiments on both real-world and synthetic challenging datasets demonstrate the effectiveness of UV-RGS, achieving the state-of-the-art performance for scene inverse rendering by learning 3DGS from only unposed views under varied illuminations.

UV-RGS: Relightable 3D Gaussian Splatting from Unposed Views Under Varied Illuminations

Multiplex heterogeneous networks are common in real-world scenarios, where entities interact through diverse types of relations across multiple semantic layers. Recent advances in multiplex heterogeneous graph neural networks have achieved remarkable results by incorporating node and relation types into message passing and designing relation-aware architectures. However, most existing methods either decouple relations and risk losing complex semantics or require handcrafted relation patterns, which limit scalability. Moreover, prevailing models are typically restricted to Euclidean space, making it difficult to capture non-Euclidean topologies and to distinguish complex interactions among heterogeneous nodes and relations. Standard GNN message passing, grounded in the homophily assumption, also proves inadequate for the intricate, coupled structures in multiplex heterogeneous graphs. To address these challenges, we propose MRiemGNN, a novel multiplex heterogeneous graph neural network that synergizes Euclidean and Riemannian spaces through a geometry-aware, relation-specific message passing scheme and cross-space mutual learning. Experiments on multiple real-world datasets show that MRiemGNN achieves superior performance, efficiency, and scalability on both node classification and link prediction tasks. The code of our model is provided in the appendix.

Multiplex Heterogeneous Graph Neural Networks with Euclidean-Riemannian Mutual Space Synergy

Large language model (LLM) agents have demonstrated strong capabilities across diverse domains, yet automated agent design remains a significant challenge. Current automated agent design approaches are often constrained by limited search spaces that primarily optimize workflows but fail to integrate crucial human-designed components like memory, planning, and tool use. Furthermore, these methods are hampered by high evaluation costs, as evaluating even a single new agent on a benchmark can require tens of dollars. The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource-intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. We formalize a hierarchical search space that jointly models agentic workflow and composable functional components. This structure moves beyond optimizing workflows alone by co-optimizing functional components, which enables the discovery of more complex and effective agent architectures. To make exploration within this expansive space feasible, we mitigate high evaluation costs by training a value model on a high-quality dataset, generated via a novel strategy combining combinatorial coverage and balanced Bayesian sampling for low-cost evaluation. Guiding the entire process is a hierarchical Monte Carlo Tree Search (MCTS) strategy, which is informed by uncertainty to efficiently navigate the search space. Evaluated across a comprehensive set of seven benchmarks spanning embodied, math, web, tool, and game domains, AgentSwift discovers agents that achieve an average performance gain of 8.34\% over both existing automated agent search methods and manually designed agents. Moreover, our framework exhibits steeper and more stable search trajectories. By enabling the efficient, automated composition of workflow with functional components, AgentSwift provides a scalable methodology to explore complex agent designs. Our framework serves as a launchpad for researchers to rapidly prototype and discover powerful agent architectures without the impediment of prohibitive evaluation costs.

Premium content

Next from AAAI 2026

ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES