Content not yet available

This lecture has no active video or poster.

AAAI 2026

January 24, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Recent advancements in instruction-based image editing methods have shown remarkable progress. However, current methods are often limited to simple instruction-based image editing, hindering real-world applications that usually encompass complex editing instructions. In this work, we solve this from the perspective of architectural design, data, and evaluation protocols. Specifically, we identify the issue of insufficient instruction compliance and background inconsistency of previous models when performing this task. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing framework that includes two key modules: a Spatial-Aware Cross Attention module and a Background-Consistent Cross Attention module. The former significantly improves instruction-following capability by explicitly aligning semantic instructions with spatial locations through the injection of spatial guidance across denoising timesteps. The latter enhances background features, thereby preserving consistency in unedited regions. To facilitate MCIE-E1 training, we propose a dedicated data construction pipeline to address the scarcity of datasets for complex instruction-based image editing. This pipeline integrates both fine-grained automatic filtering by a powerful MLLM and rigorous human filtering to ensure high-quality data. To evaluate MCIE-E1's capability of conducting complex instruction-based image editing, we introduce CIE-Bench, along with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 surpasses the previous state-of-the-art method in both quantitative and qualitative evaluations, achieving 23.96\% improvement in instruction compliance.

Downloads

Paper

Next from AAAI 2026

LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid
poster

LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid

AAAI 2026

+8
Chi Chen and 10 other authors

24 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved