Singapore

Recent diffusion-based image editing methods have made great strides in text-guided tasks but often struggle with complex, indirect instructions. Additionally, current models frequently exhibit poor identity preservation, unintended edits, or rely on manual masks. To overcome these limitations, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that bridges user intent with editing model capabilities. X-Planner uses chain-of-thought reasoning to systematically break down complex instructions into simpler sub-instructions. For each one, X-Planner automatically generates precise edit types and segmentation masks, enabling localized, identity-preserving edits without applying external tools or models during inference. To enable the training of such a planner, we also introduce a fully automated, reproducible pipeline to generate large-scale, high-quality training data. Our complete system achieves state-of-the-art results on both existing and newly proposed complex instruction-based editing benchmarks.

AAAI 2026

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

computational photography image & video synthesis

scene analysis & understanding

multi-modal vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

In robot manipulation, robot learning has become a prevailing approach. However, generative models within this field face a fundamental trade-off between the slow, iterative sampling of diffusion models and the architectural constraints of faster Flow-based methods, which often rely on explicit consistency losses. To address these limitations, we introduce MP1, which pairs 3D point-cloud inputs with the MeanFlow paradigm to generate action trajectories in one network function evaluation (1-NFE). By directly learning the interval-averaged velocity via the "MeanFlow Identity", our policy avoids any additional consistency constraints. This formulation eliminates numerical ODE-solver errors during inference, yielding more precise trajectories. MP1 further incorporates CFG for improved trajectory controllability while retaining 1-NFE inference without reintroducing structural constraints. Because subtle scene-context variations are critical for robot learning, especially in few-shot learning, we introduce a lightweight Dispersive Loss that repels state embeddings during training, boosting generalization without slowing inference. We validate our method on the Adroit and Meta-World benchmarks, as well as in real-world scenarios. Experimental results show MP1 achieves superior average task success rates, outperforming DP3 by 10.2% and FlowPolicy by 7.3%. Its average inference time is only 6.8 ms—19 times faster than DP3 and nearly 2 times faster than FlowPolicy. Our code can be accessed at https://github.com/LogSSim/MP1.

MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation

Natural Language-based Egocentric Task Verification (NLETV) aims to verify the alignment between action sequences in egocentric videos and their corresponding textual descriptions. 
However, existing NLETV approaches are still facing two critical challenges: 
(1) These methods are designed for simulating environments, ignoring the domain gap between synthetic and realistic data. 
(2) The matching processes are regarded as a simple binary classification problem, which undermines model reliability due to evaluation bias and uncalibrated decision settings. 
To address these challenges, we propose a novel method termed Prototypical Evidential Learning (PEL), which can be adapted to existing NLETV approaches and boost the model generalization and mitigate prediction bias. 
Our method leverages prototypes to guide cross-domain alignment and evidence collection. 
Specifically, PEL consists of two key components: 
(1) Prototypical Domain Adaptation module enabling cross-domain feature alignment and intra-domain prototype preservation between synthetic and realistic domains; 
(2) Matching Evidence Collector module, which quantifies prediction uncertainty on the prototypical representations through evidential deep learning. 
It enforces the model to collect the vision-text consistency and discrepancy evidence, thus addressing the issues of biased decisions in binary classification. 
Extensive experiments on two public datasets demonstrate that our PEL method outperforms existing state-of-the-art NLETV methods and shows remarkable generalizability.

De-biased Natural Language Egocentric Task Verification via Prototypical Evidence Learning

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed **DualParal**. The core idea is that, instead of generating an entire video on a single GPU, we parallelize computation by partitioning both video frames and model layers across multiple GPUs. However, a naive parallel implementation is not feasible. Because all frames need to share the same noise level, they can't be processed independently. Instead, every step must wait for all others to finish, which cancels out the speed benefits of parallel processing. We overcome this obstacle with a *block-wise denoising* scheme. Namely, we segment the video into sequential blocks, each with different noise level. As results, we process them in a pipeline across the GPUs. Each GPU, holding a subset of the model layers, processes a specific block of frames and passes the results to the next GPU, enabling asynchronous computation and communication.
To further optimize performance, we incorporate two key enhancements. Firstly, each GPU uses a feature cache technique to reduce the overhead of smooth transitions by reusing only features involved in cross-frame computation from the prior block, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs.

Minute-Long Videos with Dual Parallelisms

In-context learning (ICL) has emerged as a powerful paradigm for Large Visual Language Models (LVLMs), enabling them to leverage a few examples directly from input contexts. However, the effectiveness of this approach is heavily reliant on the selection of demonstrations, a process that poses significant challenges due to its NP-hard nature. Traditional strategies, including random, similarity-based sampling and infoscore-based sampling, often lead to inefficiencies or suboptimal performance, struggling to balance both efficiency and effectiveness in demonstration selection.
In this paper, we propose a novel demonstration selection framework named Coreset-based Dual Retrieval (CoDR).
We demonstrate that samples within the diverse subset achieve higher mutual information expectations.
To implement this, we introduce a cluster-pruning method to build a diverse coreset.
This coreset aligns more effectively with the input query while maintaining diversity.
Additionally, we introduce a dual retrieval mechanism to enhance the selection process, achieving a more global demonstration selection, while maintaining efficiency.
Experimental results demonstrate that our method significantly improves the ICL performance compared to the existing strategies, providing a robust solution for effective and efficient demonstration selection.

Efficient and Effective In-context Demonstration Selection with Coreset

Multimodal learning frequently faces two coupled challenges: modality imbalance, where dominant modalities suppress others during training, and modality conflict, where opposing gradient directions hinder optimization. Existing methods typically address these issues in isolation, yet they are intrinsically correlated and most fundamentally reflected in the gradient space—severe imbalance may obscure conflicts, while suppressing conflict may homogenize features and worsen imbalance, affecting fusion performance. To jointly address this coupled challenge, we propose Reconcile Gradient Modulation (RGM), a unified framework that adaptively adjusts gradient magnitude and direction for harmony multimodal learning. The core of RGM is SynOrth Grad, which minimizes Dirichlet energy to perform minimal-gradient surgery. It enhances cooperation synergy when modalities are aligned and enforces orthogonality to preserve uniqueness in conflict situations, thus promoting stable and balanced learning. To guide this modulation, we propose Cumulative Gradient Energy (CGE) as a convergence-guaranteed measure of modality-wise progress, and construct a Balance-nonConflict Plane (BCP) for real-time diagnosis and control of training dynamics. Experiments on diverse benchmarks validate our effectiveness and generalizability, consistently outperforming counterparts that are designed to handle multimodal imbalance or conflict independently.

Reconcile Gradient Modulation for Harmony Multimodal Learning

Scene text segmentation is a critical preprocessing step in various text-based applications. Specialist text segmentation methods, often relying on a detect-then-segment paradigm, tend to exhibit reduced robustness and can lead to cascading errors. The introduction of the Segment Anything Model (SAM) has revolutionized general segmentation by leveraging vision foundation models. However, SAM still falls short when applied to domain-specific tasks such as scene text segmentation. To bridge this gap between SAM and specialized scene text segmentation approaches, we propose ST-SAM (Scene Text SAM), a parameter-efficient fine-tuning framework tailored to adapt SAM for high-quality scene text segmentation without relying on explicit text detection. ST-SAM incorporates a multimodal prompting mechanism: a lightweight visual encoder generates multi-scale spatial features to provide precise visual context; and textual prompts generated by a large language model offer high-level semantic guidance. We demonstrate the advantages of the proposed ST-SAM as follows: (1) ST-SAM achieves new state-of-the-art performance on multiple scene text segmentation benchmarks, including 85.30% fgIoU on Total-Text and 91.03% fgIoU on TextSeg, outperforming both specialist and generalist models. (2) ST-SAM enables effective domain adaptation by flexibly adapting the general SAM architecture to the domain of scene text. (3) By discarding the detect-then-segment pipeline, ST-SAM simplifies the inference process while still achieving robust performance on complex text cases. Code will be publicly available.

ST-SAM: Multimodal Scene Text Segmentation with Dense Visual and Sparse Textual Prompts via SAM

Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, in the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel synthetic image detection method that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation generation and refinement module to identify and explain AI-generated flaws. Additionally, we construct the GenHard and GenExplain benchmarks to provide detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness. Our code and datasets will be made publicly available.

Explainable Synthetic Image Detection Through Diffusion Timestep Ensembling

Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks.Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our method repurposes the frozen text encoder to construct a visual codebook from video class labels due to the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This codebook effectively transforms temporal visual data into textual tokens via feature lookups and offers interpretable video representations through explicit video modeling. Then, to enhance robustness against irrelevant or noisy frames, we introduce a confidence-aware fusion module that dynamically weights keyframes by assessing their semantic relevance via the codebook. Furthermore, our method incorporates learnable text prompts to conduct adaptive codebook updates. Extensive experiments on HMDB-51, UCF-101, SSv2, and Kinetics-400 have validated the superiority of our approach, achieving more competitive improvements over state-of-the-art methods.

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Semi-supervised singing melody extraction (SSME) is one of the key tasks in the field of music information retrieval (MIR). Recently, several SSME methods have been proposed and achieved remarkable successes. However, existing methods are still facing two critical issues: firstly, there is a lack of an effective data augmentation method for SSME, which results in insufficient utilization of unlabeled data. Secondly, existing SSME methods discards too much unlabeled data in the stage of consistency regularization, which hinders the further improvements of SSME task. In this paper, we present \emph{ELH-SME}, a novel framework that better utilizes the unlabeled musical data for SSME task. Specifically, our proposed ELH-SME framework consists of three modules: (1) we first propose a diffusion-based multi-bands augmentation (DMA) method to increase the amounts of training data. The proposed DMA methods employs a diffusion model to generate perturbation at the specific frequency bands in an end-to-end manner, thereby avoiding sharply perturbations to the spectrogram. (2) To improve the utilization rate of unlabeled data, we suggest a global-class confidence (GCC) module. During the phase of consistency regularization, we consider both the global-wise and class-wise confidence values, improving the utilization rate of unlabeled data. (3) To further improve the utilization of unlabeled data, we also propose to enhance the representation capability of unlabeled data by extracting channel-level features from labeled data via channel cross attention (CCA). We evaluate our proposed framework on several well-known public available datasets, and the conducted experiments demonstrate the effectiveness of our method.

Downloads

Next from AAAI 2026

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads