Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Although deep networks excel on RGB images, their performance degrades sharply under severe domain shifts—such as sketch recognition, where color and texture cues are missing. In this work, we propose a novel pipeline that leverages semantic cues extracted from sketches to guide the synthesis of photorealistic RGB images using diffusion-based generative models. Our framework operates by extracting two crucial cues from the input sketch: semantic captions via the BLIP model and structural outlines via Canny edge detection. These cues are then integrated using ControlNet to guide a Stable Diffusion model, ensuring the synthesized RGB image is both semantically consistent with the content and structurally faithful to the original sketch. We evaluated our synthesized images by benchmarking classification performance. We trained standard architectures (from convolutional to transformer-based) on Tiny-ImageNet subsets and tested them on sketches, their synthesized counterparts, and the original RGB images. Experimental results demonstrate that our approach produces realistic, identity-preserving images, which significantly improve classification accuracy and effectively bridge the semantic gap. While BLIP-based captioning and ControlNet-guided diffusion are established methods, our contribution lies in their integration into a unified, caption-guided pipeline that enhances sketch-to-RGB translation with improved semantic consistency. The proposed method generalizes well across architectures, providing a scalable and cost-efficient solution for sketch-based image synthesis.