Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Recent advances in controllable text-to-image (T2I) generation have shown promising results in natural image generation. However, controllable remote sensing (RS) T2I generation remains a challenging task due to the unique characteristics and requirements of geospatial data. Existing methods struggle to effectively integrate diverse spatial control condition (e.g., edge maps, segmentation masks) into a coherent generation process. They often fail to model the complex spatial relationships among different geographic elements and maintain semantic consistency with textual descriptions, which are typically vague or incomplete in RS applications. Additionally, constrained by the small scale, low description quality, and limited scene variety of existing datasets, these models tend to produce outputs with structurally inconsistent layouts and visually unrealistic content. To address these issues, we propose Any2RSI, a flexible framework for controllable RS T2I generation that supports the flexible combination of various control conditions. At its core, Any2RSI introduces a Cross-Modal Multi-Control Adapter capable of extracting modality-agnostic embeddings from heterogeneous inputs, enabling precise spatial guidance. Furthermore, to overcome the limitations of sparse and ambiguous textual prompts commonly found in RS tasks, we design a Vision Language Model (VLM)-Empowered Enriched Description Generation module. This module enhances input descriptions by integrating cross-modal semantic information, generating richer and more accurate textual representations that guide the generation of semantically coherent images. Finally, to mitigate the data scarcity in RS T2I generation task, we construct RST2I-110K, a new large-scale, multi-scene dataset containing over 115,000 high quality RS images paired with detailed textual descriptions. Extensive experiments on both existing and newly proposed datasets demonstrate that Any2RSI achieves state-of-the-art performance, significantly improving both the realism and structural accuracy of generated RS imagery.
