Singapore

Recent advancements in instruction-based image editing methods have shown remarkable progress. However, current methods are often limited to simple instruction-based image editing, hindering real-world applications that usually encompass complex editing instructions.
In this work, we solve this from the perspective of architectural design, data, and evaluation protocols. Specifically, we identify the issue of insufficient instruction compliance and background inconsistency of previous models when performing this task. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing framework that includes two key modules: a Spatial-Aware Cross Attention module and a Background-Consistent Cross Attention module. The former significantly improves instruction-following capability by explicitly aligning semantic instructions with spatial locations through the injection of spatial guidance across denoising timesteps. The latter enhances background features, thereby preserving consistency in unedited regions. To facilitate MCIE-E1 training, we propose a dedicated data construction pipeline to address the scarcity of datasets for complex instruction-based image editing. This pipeline integrates both fine-grained automatic filtering by a powerful MLLM and rigorous human filtering to ensure high-quality data. To evaluate MCIE-E1&#39;s capability of conducting complex instruction-based image editing, we introduce CIE-Bench, along with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 surpasses the previous state-of-the-art method in both quantitative and qualitative evaluations, achieving 23.96\% improvement in instruction compliance.

AAAI 2026

MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance

complex instruction

diffusion models for vision

image editing

Recent advancements in instruction-based image editing methods have shown remarkable progress. However, current methods are often limited to simple instruction-based image editing, hindering real-world applications that usually encompass complex editing instructions.
In this work, we solve this from the perspective of architectural design, data, and evaluation protocols. Specifically, we identify the issue of insufficient instruction compliance and background inconsistency of previous models when performing this task. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing framework that includes two key modules: a Spatial-Aware Cross Attention module and a Background-Consistent Cross Attention module. The former significantly improves instruction-following capability by explicitly aligning semantic instructions with spatial locations through the injection of spatial guidance across denoising timesteps. The latter enhances background features, thereby preserving consistency in unedited regions. To facilitate MCIE-E1 training, we propose a dedicated data construction pipeline to address the scarcity of datasets for complex instruction-based image editing. This pipeline integrates both fine-grained automatic filtering by a powerful MLLM and rigorous human filtering to ensure high-quality data. To evaluate MCIE-E1's capability of conducting complex instruction-based image editing, we introduce CIE-Bench, along with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 surpasses the previous state-of-the-art method in both quantitative and qualitative evaluations, achieving 23.96\% improvement in instruction compliance.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. 
However, they exhibit inferior performance on tasks regarding fine-grained visual perception. We attribute this to the inner limitations of ViTs in capturing diverse visual semantic levels.
To address this, we present Hierarchical window (Hiwin) transformer as a plug-and-play solution for MLLMs, centered around our inverse semantic pyramid (ISP). 
Hiwin transformer comprises two key modules:
(i) a visual detail injection module, which progressively injects low-level visual details into high-level language-aligned semantics features, thereby constructing an ISP,
and 
(ii) a hierarchical window attention module, which leverages cross-scale windows to condense multi-level semantics from the ISP.
Notably, our design achieves an average boost of 3.7\% across 14 benchmarks compared with the baseline method, 9.3\% on DocVQA for instance.

LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid

Virtual screening (VS) is an essential task in drug discovery, focusing on the identification of small-molecule ligands that bind to specific protein pockets. Existing deep learning methods, from early regression models to recent contrastive learning approaches, primarily rely on structural data while overlooking protein sequences, which are more accessible and can enhance generalizability. However, directly integrating protein sequences poses challenges due to the redundancy and noise in large-scale protein-ligand datasets. To address these limitations, we propose S²Drug, a two-stage framework that explicitly incorporates protein Sequence information and 3D Structure context in protein-ligand contrastive representation learning. In the first stage, we perform protein sequence pretraining on ChemBL using an ESM2-based backbone, combined with a tailored data sampling strategy to reduce redundancy and noise on both protein and ligand sides. In the second stage, we fine-tune on PDBBind by fusing sequence and structure information through a residue-level gating module, while introducing an auxiliary binding site prediction task. This auxiliary task guides the model to accurately localize binding residues within the protein sequence and capture their 3D spatial arrangement, thereby refining protein-ligand matching. Across multiple benchmarks, S²Drug consistently improves virtual screening performance and achieves strong results on binding site prediction, demonstrating the value of bridging sequence and structure in contrastive learning.

S²Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening

Dynamic driving scene reconstruction is of great importance in fields like digital twin system and autonomous driving simulation. However, unacceptable degradation occurs when the view deviates from the input trajectory, leading to corrupted background and vehicle models. To improve reconstruction quality on novel trajectory, existing methods are subject to various limitations including inconsistency, deformation, and time consumption. This paper proposes LidarPainter, a one-step diffusion model that recovers consistent driving views from sparse LiDAR condition and artifact-corrupted renderings in real-time, enabling high-fidelity lane shifts in driving scene reconstruction. Extensive experiments show that LidarPainter outperforms state-of-the-art methods in speed, quality and resource efficiency, specifically 7 × faster than StreetCrafter with only one fifth of GPU memory required. LidarPainter also supports stylized generation using text prompts such as “foggy” and “night”, allowing for a diverse expansion of the existing asset library.

LidarPainter: One-Step Away from Any Lidar View to Novel Guidance

Goal-driven persuasive dialogue, exemplified by applications like telemarketing, requires sophisticated multi-turn planning and strict factual faithfulness, which remains a significant challenge for even state-of-the-art Large Language Models (LLMs). Previous works are often limited by a lack of task-specific data, and direct LLM application suffers from strategic brittleness and factual hallucination. In this paper, we first construct and release TeleSalesCorpus, the first real-world-grounded dialogue dataset for this domain. We then propose AI-Salesman, a novel framework featuring a dual-stage architecture. For the training stage, we design a Bayesian-supervised reinforcement learning algorithm that learns robust sales strategies from noisy dialogues. For the inference stage, we introduce the Dynamic Outline-Guided Agent (DOGA), which leverages a pre-built script library to provide dynamic, turn-by-turn strategic guidance. Moreover, we design a comprehensive evaluation framework that combines fine-grained metrics for key sales skills with the LLM-as-a-Judge paradigm. Experimental results demonstrate that our proposed AI-Salesman significantly outperforms baseline models in both automatic metrics and comprehensive human evaluations, showcasing its effectiveness in complex persuasive scenarios. The complete source code and dataset for this paper will be made available upon publication.

AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

Optimizing quantum programs is key to mitigating noise, reducing error-correction overhead, and improving performance on both near-term and fault-tolerant devices. Existing heuristic and learning-based optimizers, however, lack formal guarantees and risk semantic errors in the presence of entanglement and measurement. We present RelOpt, a semantics-preserving optimizer that enforces \emph{relational correctness} between original and optimized programs. RelOpt is built on a lightweight intermediate language (\texttt{QCore}) with a relational operational semantics supporting partial-trace equivalence, measurement-distribution preservation, and approximate correctness. Optimization is guided by a multi-objective cost model that considers gate count, circuit depth, and error-correction cost. Only rewrite rules that are formally verified against user-specified contracts are applied. The engine combines symbolic simulation, SMT reasoning, and cost analysis to achieve safe and effective optimizations. On standard benchmarks such as QFT, Grover, and QAOA, RelOpt consistently outperforms Qiskit, \texttt{t$|$ket$\rangle$}, and learning-based optimizers across multiple cost metrics while maintaining formal guarantees. By integrating formal verification with cost-aware compilation, RelOpt establishes a foundation for trustworthy and hardware-adaptive quantum toolchains.

Relational Verification for Cost-Aware Quantum Program Optimization

Most existing multi-modal trackers adopt uniform fusion strategies and propagate temporal information through mixed tokens, which fails to account for modality-specific differences and results in entangled temporal representations. To address these limitations, we propose a **Modality-aware fusion and Decoupled temporal propagation multi-modal object Tracking (MDTrack)**. 
Specifically, for modality-aware fusion, we allocate dedicated experts to each modality (Infrared, Event, Depth, and RGB) to process their respective representations. The gating mechanism within the mixture of experts (MoE) then dynamically selects the optimal experts based on the input features, enabling adaptive and modality-specific fusion.
For decoupled temporal propagation, we introduce two separate State Space Model (SSM) structures to independently store and update the hidden states $h$ of the RGB and X-modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate cross-attention between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone via cross-attention, enhancing MDTrack’s ability to leverage temporal information. 
**Extensive experiments demonstrate the effectiveness of our proposed MDTrack. Both MDTrack-S (Modality-Specific Training) and MDTrack-U (Unified-Modality Training) achieve state-of-the-art performance on five multi-modal tracking benchmarks.**

Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking

Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.

Composition-Incremental Learning for Compositional Generalization

Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets. Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values (also called categories interchangeably), which hampers the exploration of compact categorical data clusters. Although most attempts are made for developing appropriate distance metrics, they typically assume a fixed topological relationship between categories when learning distance metrics, which limits their adaptability to varying cluster structures and often leads to suboptimal clustering performance. This paper, therefore, breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly and accurately revealing various cluster distributions. As a result, the fitting ability of the clustering algorithm is significantly enhanced, benefiting from the learnable category relationships. Moreover, the learned category relationships are proved to be Euclidean distance metric-compatible, enabling a seamless extension to mixed datasets that include both numerical and categorical attributes. Comparative experiments on 12 real benchmark datasets with significance tests show the superior clustering accuracy of the proposed method with an average ranking of 1.25, which is significantly higher than the 5.21 ranking of the current best-performing method.

Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering

Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose *T-Rex-Omni*, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. *T-Rex-Omni* supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP$_r$ on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.

T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection

Semi-supervised medical image segmentation is an effective method for addressing scenarios with limited labeled data. Existing methods mainly rely on frameworks such as mean teacher and dual-stream consistency learning. These approaches often face issues like error accumulation and model structural complexity, while also neglecting the interaction between labeled and unlabeled data streams.
To overcome these challenges, we propose a Bidirectional Channel-selective Semantic Interaction (BCSI) architecture for semi-supervised medical image segmentation. First, we propose the Semantic-Spatial Perturbation (SSP) mechanism, which disturbs the data using two strong augmentation operations and leverages unsupervised learning with pseudo-labels from weak augmentations. Additionally, we employ consistency learning for the predictions from the two strong augmentations to further improve model stability and robustness. Second, to reduce noise during the interaction between labeled and unlabeled data, we propose the Channel-selective Router (CR), which dynamically selects the most relevant channels for information exchange. This mechanism ensures that only highly relevant features are activated, minimizing unnecessary interference. Finally, the Bidirectional Channel-wise Interaction (BCI) strategy is employed to supplement additional semantic information and enhance the representation of important channels. Experimental results on three publicly available 3D medical datasets demonstrate that the proposed method outperforms existing semi-supervised approaches. The code will be released upon acceptance.

Content not yet available

Next from AAAI 2026

LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES