Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Recent advancements in instruction-based image editing methods have shown remarkable progress. However, current methods are often limited to simple instruction-based image editing, hindering real-world applications that usually encompass complex editing instructions. In this work, we solve this from the perspective of architectural design, data, and evaluation protocols. Specifically, we identify the issue of insufficient instruction compliance and background inconsistency of previous models when performing this task. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing framework that includes two key modules: a Spatial-Aware Cross Attention module and a Background-Consistent Cross Attention module. The former significantly improves instruction-following capability by explicitly aligning semantic instructions with spatial locations through the injection of spatial guidance across denoising timesteps. The latter enhances background features, thereby preserving consistency in unedited regions. To facilitate MCIE-E1 training, we propose a dedicated data construction pipeline to address the scarcity of datasets for complex instruction-based image editing. This pipeline integrates both fine-grained automatic filtering by a powerful MLLM and rigorous human filtering to ensure high-quality data. To evaluate MCIE-E1's capability of conducting complex instruction-based image editing, we introduce CIE-Bench, along with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 surpasses the previous state-of-the-art method in both quantitative and qualitative evaluations, achieving 23.96\% improvement in instruction compliance.
