Singapore

In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), the safety concerns of their outputs have earned significant attention. Although numerous datasets have been proposed, they may become outdated with MLLM advancements and are susceptible to data contamination issues. To address these problems, we propose SDEval, the first safety dynamic evaluation framework to controllably adjust the distribution and complexity of safety benchmarks. Specifically, SDEval mainly adopts three dynamic strategies: text, image, and text-image dynamics to generate new samples from original benchmarks. We first explore the individual effects of text and image dynamics on model safety. Then, we find that injecting text dynamics into images can further impact safety, and conversely, injecting image dynamics into text also leads to safety risks. SDEval is general enough to be applied to various existing safety and even capability benchmarks. Experiments across safety benchmarks, MLLMGuard and VLSBench, and capability benchmarks, MMBench and MMVet, show that SDEval significantly influences safety evaluation, mitigates data contamination, and exposes safety limitations of MLLMs.

AAAI 2026

SDEval: Safety Dynamic Evaluation for Multimodal Large Language Models

mllm

safety

evaluation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) shifts the contents of retrieval from narrative text to a relational knowledge network, empowering large language models (LLMs) to harness structured relationships between entities. However, conventional KG-RAG approaches are resource-intensive, requiring either query decomposition with multiple LLM rounds or parameterized static knowledge injection to update the model. Although subgraph reasoning aims to address these issues, most current methods are based on heuristic shortest path and multi-hop graph traversal algorithms. The retrieved subgraphs suffer from incompleteness and semantic drift, and neglect the interaction between subgraph and LLMs in terms of fine-grained structural semantics. We propose a dual-constraint subgraph optimization for KG-RAG (DCTR). It improves subgraph retrieval and generates high-quality subgraphs with structural integrity and information salience for LLMs. Specifically, it formulates subgraph generation as a two-stage graph-theoretic constrained optimization problem to create compact and complete pseudo-labels. Since these pseudo-labels are discrete, a smooth approximation is employed to convert them into a differentiable representation, thereby optimizing the retriever to highlight key information while extracting subgraphs. On two benchmark datasets, DCTR significantly enhances subgraph quality, achieving state-of-the-art performance in LLM reasoning.

DCTR: Dual-Constraint Subgraph Optimization for Knowledge Graph-based Retrieval-Augmented Generation

While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu-Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

Efficient reference structures are essential in video compression, enabling the exploitation of temporal dependencies across frames to reduce redundancy. In this paper, we delve into the inter-frame reference management mechanism in neural video codecs (NVCs). Previous schemes have inherited the reference propagation mechanism with the guidance of predefined reference structure, but the reference modeling across diverse reference sources remains underexplored. Moreover, the mismatch between the reference structure used for motion estimation and motion compensation limits the effectiveness of inter-frame prediction. To address the above limitations, we propose the unified reference hierarchy that integrates a learned hierarchical reference structure into the existing inherent reference propagation mechanism. Specifically, we first propose the hierarchical reference structure (HRS) to manage the multiple temporal contexts in the propagated reference feature, where a hierarchy-aware reference modulation module is integrated to select the most relevant reference features across different quality levels under the guidance of the reference balance loss. In addition, we propose the HRS-guided feature-wise inter-frame prediction that learns the low-rank approximation of the selected reference feature for ensuring the consistency and improving the inter-frame prediction performance. We conduct experiments on a state-of-the-art NVC, DCVC-DC. Experimental results show that our codec achieves an average 26\% bitrate saving over H.266/VVC, and a 28.2\% bitrate reduction compared to DCVC-DC without increasing the decoding complexity.

Neural Video Compression with Reference Hierarchy

3D human motion generation has seen a substantial rise in interest over the recent years, and while considerable progress has been made performance wise, many of the approaches in the state-of-the-art still struggle with complex and detailed generations unseen in the original data. This is commonly attributed to the scarcity of available motion datasets, and the prohibitive cost for generating more training examples. Motivated by this set of challenges, we introduce CoMA, A multimodal framework designed for complex human motion generation, editing and comprehension. CoMA employs multiple independent agents, powered by large language and vision models, as well as a mask transformer-based motion generator with body part specific encoders and codebooks for fine-grained, detailed generations. This recipe allows for generation of short and long motion sequences with detailed instructions, editing generations with user provided text instructions and also self-correcting output sequences for even better motions. We evaluate our method with the two most popular benchmark human motion datasets, using novel splits that separate them into basic and complex actions, and subsequently compare CoMA's performance with state-of-the-art methods.

CoMA: Compositional Human Motion Generation with Multi-modal Agents

3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

Lifelong Domain Adaptive 3D Human Pose Estimation

Generating editable 3D CAD geometry from natural language remains an open challenge: existing text-to-CAD systems either output surface meshes or require scarce design-history supervision, making them unsuitable for real-world engineering workflows. We introduce NURBGen, the first framework for generating high-fidelity 3D CAD models directly from natural language descriptions using Non-Uniform Rational B-Splines (NURBS). To achieve this, we fine-tune a large language model (LLM) to translate free-form texts into JSON representations containing NURBS surface parameters (\textit{i.e}, control points, knot vectors, degrees, and rational weights) which can be directly converted into BRep format using Python. Furthermore, we introduce partABC, a curated subset of the ABC dataset consisting of individual CAD components, annotated with detailed captions using an automated annotation pipeline. We further propose a hybrid representation that combines untrimmed NURBS with analytic primitives to handle trimmed surfaces and degenerate regions more robustly, while reducing token complexity. NURBGen demonstrates strong performance on diverse prompts, surpassing prior methods in geometric fidelity and dimensional accuracy, as confirmed by expert evaluations. Code and dataset will be released publicly.

NURBGen: High-Fidelity Text-to-CAD Generation Through LLM-Driven NURBS Modeling

Text-driven multi-object image editing which aims to precisely modify multiple objects within an image based on text descriptions, has recently attracted considerable interest. Existing works primarily follow the localize-editing paradigm, focusing on independent object localization and editing while neglecting critical inter-object interactions. However, this work points out that the neglected attention entanglements in inter-object conflict regions, inherently hinder disentangled multi-object editing, leading to either inter-object editing leakage or intra-object editing constraints. We thereby propose a novel multi-layer disentangled editing framework LayerEdit, a training-free method which, for the first time, through precise object-layered decomposition and coherent fusion, enables conflict-free object-layered editing. Specifically, LayerEdit introduces a novel “decompose-editing-fusion” framework, consisting of: (1) Conflict-aware Layer Decomposition module, which utilizes an attention-aware IoU scheme and time-dependent region removing, to enhance conflict awareness and suppression for layer decomposition. (2) Object-layered Editing module, to establish coordinated intra-layer text guidance and cross-layer geometric mapping, achieving disentangled semantic and structural modifications. (3) Transparency-guided Layer Fusion module, to facilitate structure-coherent inter-object layer fusion through precise transparency guidance learning. Extensive experiments verify the superiority of LayerEdit over existing methods, showing unprecedented intra-object controllability and inter-object coherence in complex multi-object scenarios.

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning

Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. 
However, MLLMs are vulnerable to adversarial attacks—particularly adversarial patch attacks—which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models. Due to the more complex architectures and strong reasoning capabilities of MLLMs, these approaches perform poorly when transferred to MLLM-based systems.
To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms state-of-the-art (SOTA) methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.

PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems

Since high-fidelity reference images are difficult to obtain in real underwater scenes, most deep models trained by synthetic paired data cannot match real-world data exactly. In this paper, we propose an unsupervised training framework for underwater image enhancement by leveraging an iterative training strategy and quantification of specific neural units. Specifically, to eliminate the heavy color cast and distortion in the underwater images, we decompose the unsupervised image enhancement as two targeted sub-tasks, namely colorization and color compensation. First, a diffusion model is introduced for colorization to correct the green and blue color casts. Then, to intensify the learning ability of balanced color information, we introduce an extra network branch and propose a quantification mechanism for color compensation. The extra branch encodes style information from normal images into the generative model, while the quantification mechanism identifies and adjusts neural units relevant to warm colors, improving the model’s ability to learn balanced color feature representations for robust generation. In the end, through iterative training, color cast and distortion are progressively reduced, leading to a gradual improvement in the quality of the generated images. Experimental results on various widely used underwater datasets demonstrate that our approach achieves competitive performance, even when compared to recent supervised methods.

Learning Underwater Image Enhancement Iteratively Without Reference Images

Large Language Models (LLMs) often produce fluent but factually incorrect responses, a phenomenon known as hallucination. Abstention, where the model chooses not to answer and instead outputs phrases such as "I don't know", is a common safeguard. However, existing abstention methods typically rely on post-generation signals, such as generation variations or feedback, which limits their ability to prevent unreliable responses in advance. In this paper, we introduce Aspect-Based Causal Abstention (ABCA), a new framework that enables early abstention by analysing the internal diversity of LLM knowledge through causal inference. This diversity reflects the multifaceted nature of parametric knowledge acquired from various sources, representing diverse aspects such as disciplines, legal contexts, or temporal frames. ABCA estimates causal effects conditioned on these aspects to assess the reliability of knowledge relevant to a given query. Based on these estimates, we enable two types of abstention: Type-1, where aspect effects are inconsistent (knowledge conflict), and Type-2, where aspect effects consistently support abstention (knowledge insufficiency). Experiments on standard benchmarks demonstrate that ABCA improves abstention reliability, achieves state-of-the-art performance, and enhances the interpretability of abstention decisions. The source code and technical appendix are provided in the Supplementary Material.

Downloads

Next from AAAI 2026

DCTR: Dual-Constraint Subgraph Optimization for Knowledge Graph-based Retrieval-Augmented Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

DCTR: Dual-Constraint Subgraph Optimization for Knowledge Graph-based Retrieval-Augmented Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads