Singapore

While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu-Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

AAAI 2026

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

cv: large vision models

cv: multi-modal vision

cv: language and vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Efficient reference structures are essential in video compression, enabling the exploitation of temporal dependencies across frames to reduce redundancy. In this paper, we delve into the inter-frame reference management mechanism in neural video codecs (NVCs). Previous schemes have inherited the reference propagation mechanism with the guidance of predefined reference structure, but the reference modeling across diverse reference sources remains underexplored. Moreover, the mismatch between the reference structure used for motion estimation and motion compensation limits the effectiveness of inter-frame prediction. To address the above limitations, we propose the unified reference hierarchy that integrates a learned hierarchical reference structure into the existing inherent reference propagation mechanism. Specifically, we first propose the hierarchical reference structure (HRS) to manage the multiple temporal contexts in the propagated reference feature, where a hierarchy-aware reference modulation module is integrated to select the most relevant reference features across different quality levels under the guidance of the reference balance loss. In addition, we propose the HRS-guided feature-wise inter-frame prediction that learns the low-rank approximation of the selected reference feature for ensuring the consistency and improving the inter-frame prediction performance. We conduct experiments on a state-of-the-art NVC, DCVC-DC. Experimental results show that our codec achieves an average 26\% bitrate saving over H.266/VVC, and a 28.2\% bitrate reduction compared to DCVC-DC without increasing the decoding complexity.

Neural Video Compression with Reference Hierarchy

3D human motion generation has seen a substantial rise in interest over the recent years, and while considerable progress has been made performance wise, many of the approaches in the state-of-the-art still struggle with complex and detailed generations unseen in the original data. This is commonly attributed to the scarcity of available motion datasets, and the prohibitive cost for generating more training examples. Motivated by this set of challenges, we introduce CoMA, A multimodal framework designed for complex human motion generation, editing and comprehension. CoMA employs multiple independent agents, powered by large language and vision models, as well as a mask transformer-based motion generator with body part specific encoders and codebooks for fine-grained, detailed generations. This recipe allows for generation of short and long motion sequences with detailed instructions, editing generations with user provided text instructions and also self-correcting output sequences for even better motions. We evaluate our method with the two most popular benchmark human motion datasets, using novel splits that separate them into basic and complex actions, and subsequently compare CoMA's performance with state-of-the-art methods.

CoMA: Compositional Human Motion Generation with Multi-modal Agents

3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

Lifelong Domain Adaptive 3D Human Pose Estimation

Generating editable 3D CAD geometry from natural language remains an open challenge: existing text-to-CAD systems either output surface meshes or require scarce design-history supervision, making them unsuitable for real-world engineering workflows. We introduce NURBGen, the first framework for generating high-fidelity 3D CAD models directly from natural language descriptions using Non-Uniform Rational B-Splines (NURBS). To achieve this, we fine-tune a large language model (LLM) to translate free-form texts into JSON representations containing NURBS surface parameters (\textit{i.e}, control points, knot vectors, degrees, and rational weights) which can be directly converted into BRep format using Python. Furthermore, we introduce partABC, a curated subset of the ABC dataset consisting of individual CAD components, annotated with detailed captions using an automated annotation pipeline. We further propose a hybrid representation that combines untrimmed NURBS with analytic primitives to handle trimmed surfaces and degenerate regions more robustly, while reducing token complexity. NURBGen demonstrates strong performance on diverse prompts, surpassing prior methods in geometric fidelity and dimensional accuracy, as confirmed by expert evaluations. Code and dataset will be released publicly.

NURBGen: High-Fidelity Text-to-CAD Generation Through LLM-Driven NURBS Modeling

Text-driven multi-object image editing which aims to precisely modify multiple objects within an image based on text descriptions, has recently attracted considerable interest. Existing works primarily follow the localize-editing paradigm, focusing on independent object localization and editing while neglecting critical inter-object interactions. However, this work points out that the neglected attention entanglements in inter-object conflict regions, inherently hinder disentangled multi-object editing, leading to either inter-object editing leakage or intra-object editing constraints. We thereby propose a novel multi-layer disentangled editing framework LayerEdit, a training-free method which, for the first time, through precise object-layered decomposition and coherent fusion, enables conflict-free object-layered editing. Specifically, LayerEdit introduces a novel “decompose-editing-fusion” framework, consisting of: (1) Conflict-aware Layer Decomposition module, which utilizes an attention-aware IoU scheme and time-dependent region removing, to enhance conflict awareness and suppression for layer decomposition. (2) Object-layered Editing module, to establish coordinated intra-layer text guidance and cross-layer geometric mapping, achieving disentangled semantic and structural modifications. (3) Transparency-guided Layer Fusion module, to facilitate structure-coherent inter-object layer fusion through precise transparency guidance learning. Extensive experiments verify the superiority of LayerEdit over existing methods, showing unprecedented intra-object controllability and inter-object coherence in complex multi-object scenarios.

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning

Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. 
However, MLLMs are vulnerable to adversarial attacks—particularly adversarial patch attacks—which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models. Due to the more complex architectures and strong reasoning capabilities of MLLMs, these approaches perform poorly when transferred to MLLM-based systems.
To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms state-of-the-art (SOTA) methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.

PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems

Since high-fidelity reference images are difficult to obtain in real underwater scenes, most deep models trained by synthetic paired data cannot match real-world data exactly. In this paper, we propose an unsupervised training framework for underwater image enhancement by leveraging an iterative training strategy and quantification of specific neural units. Specifically, to eliminate the heavy color cast and distortion in the underwater images, we decompose the unsupervised image enhancement as two targeted sub-tasks, namely colorization and color compensation. First, a diffusion model is introduced for colorization to correct the green and blue color casts. Then, to intensify the learning ability of balanced color information, we introduce an extra network branch and propose a quantification mechanism for color compensation. The extra branch encodes style information from normal images into the generative model, while the quantification mechanism identifies and adjusts neural units relevant to warm colors, improving the model’s ability to learn balanced color feature representations for robust generation. In the end, through iterative training, color cast and distortion are progressively reduced, leading to a gradual improvement in the quality of the generated images. Experimental results on various widely used underwater datasets demonstrate that our approach achieves competitive performance, even when compared to recent supervised methods.

Learning Underwater Image Enhancement Iteratively Without Reference Images

Large Language Models (LLMs) often produce fluent but factually incorrect responses, a phenomenon known as hallucination. Abstention, where the model chooses not to answer and instead outputs phrases such as "I don't know", is a common safeguard. However, existing abstention methods typically rely on post-generation signals, such as generation variations or feedback, which limits their ability to prevent unreliable responses in advance. In this paper, we introduce Aspect-Based Causal Abstention (ABCA), a new framework that enables early abstention by analysing the internal diversity of LLM knowledge through causal inference. This diversity reflects the multifaceted nature of parametric knowledge acquired from various sources, representing diverse aspects such as disciplines, legal contexts, or temporal frames. ABCA estimates causal effects conditioned on these aspects to assess the reliability of knowledge relevant to a given query. Based on these estimates, we enable two types of abstention: Type-1, where aspect effects are inconsistent (knowledge conflict), and Type-2, where aspect effects consistently support abstention (knowledge insufficiency). Experiments on standard benchmarks demonstrate that ABCA improves abstention reliability, achieves state-of-the-art performance, and enhances the interpretability of abstention decisions. The source code and technical appendix are provided in the Supplementary Material.

Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models

Accurate segmentation of ultrasound images is essential for reliable medical diagnoses but is challenged by poor image quality and scarce labeled data. Prior approaches have relied on manually designed, complex network architectures to improve multi-scale feature extraction. However, such handcrafted models offer limited gains when prior knowledge is inadequate and are prone to overfitting on small datasets. In this paper, we introduce DeNAS-ViT, a Data efficient NAS-optimized Vision Transformer, the first method to leverage neural architecture search (NAS) for ultrasound image segmentation by automatically optimizing model architecture through token-level search. Specifically, we propose an efficient NAS module that performs multi-scale token search prior to the ViT’s attention mechanism, effectively capturing
both contextual and local features while minimizing computational costs. Given ultrasound’s data scarcity and NAS’s inherent data demands, we further develop a NAS-guided semi-supervised learning (SSL) framework. This approach integrates network independence and contrastive learning within a stage-wise optimization strategy, significantly enhancing model robustness under limited-data conditions. Extensive experiments on public datasets demonstrate that DeNAS-ViT achieves state-of-the-art performance, maintaining robustness with minimal labeled data. Moreover, we highlight DeNAS-ViT’s generalization potential beyond ultrasound imaging, underscoring its broader applicability.

DeNAS-ViT: Data Efficient NAS-Optimized Vision Transformer for Ultrasound Image Segmentation

Facial Expression Recognition (FER) is crucial to human-computer interaction. Existing cross-domain FER (CD-FER) methods mainly focus on single-source closed-set scenarios, transferring knowledge from a single source domain to a target domain with identical class sets. However, CD-FER faces two real-world challenges: 1) the need to leverage information from multiple sources, leading to multi-domain shift, and 2) the necessity to recognize unseen target classes, resulting in class shift. These issues give rise to a novel and challenging task, which we define as Multi-domain Open-set FER (MO-FER). In this paper, we propose PromptEmo, a novel CLIP-based framework that leverages bilateral textual prompts to address both shifts in the MO-FER task. Leveraging the generalizability of LLM, PromptEmo constructs trainable positive prompts with LLM-generated emotion descriptions for seen classes, as well as template-derived negative prompts to enhance the reasoning for unseen classes. Then, we introduce a modal-task optimization paradigm organized from two perspectives: textual semantics and visual domains, yielding Intra-modal Space-specific Optimization (ISO) and Cross-modal Emotion-aware Interaction (CEI) strategies. ISO refines the CLIP-based textual space to ensure semantic separation between bilateral prompts and improves the latent visual space by promoting inter-domain alignment. Founded on ISO, CEI facilitates effective vision-language interactions, resulting in four joint loss terms that improve emotion recognition by shaping a domain-invariant, discriminative feature space. PromptEmo surpasses the current SOTA method by 7.7% AUC on unseen classes across four FER datasets, serving as a strong baseline for the MO-FER task.

Downloads

Next from AAAI 2026

Neural Video Compression with Reference Hierarchy

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Neural Video Compression with Reference Hierarchy

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads