Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Precise and controllable image editing, especially object removal and insertion, represents one of the most common demands in image manipulation. However, existing methods suffer from severe limitations. Mask-based inpainting often introduces visual artifacts and semantic inconsistencies, while instruction-based approaches lack accurate spatial control and tend to unintentionally modify background regions. To address these issues, we propose two key contributions. First, we develop a fully automated and self-improving pipeline for synthetic data generation. This pipeline utilizes a Large Language Model (LLM) to generate diverse prompts, a Diffusion Transformer (DiT) fine-tuned evolutionarily to synthesize high-quality images, and a Multimodal LLM (MLLM) combined with open-set object detector for automated quality control and annotation. This process produces the Remove/Add Dataset (RAD), consisting of over 514,510 high-quality image pairs, each richly annotated with bounding boxes, segmentation masks, and a variety of editing instructions. Second, based on RAD, we introduce Remove/Add Anything (RAA), a novel editing framework with precise spatial control. Built upon a diffusion-based inpainting model, RAA achieves high editing accuracy by conditioning on both textual instructions and an explicitly defined region of interest (ROI), enabling efficient fine-tuning while maintaining global visual coherence. Extensive experiments demonstrate that RAA significantly outperforms existing open-source methods on both addition and removal tasks, and even slightly surpasses costly proprietary models.