Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Spatial understanding is a critical capability for LVLMs (Large Vision-Language Models) to advance embodied AI applications. Existing works primarily focus on enhancing spatial understanding within a single frame, i.e., injecting 3D spatial concepts into LVLMs under single coordinate system. However, such improvements struggle in real-world tasks that require consistent cross-view spatial reasoning. In this paper, we propose \textbf{CVVG-Reasoner}(\textbf{C}ross-\textbf{V}iew \textbf{V}isual \textbf{G}eometries) that lifts single-frame spatial comprehension to unified cross-view spatial understanding by mimicking \textit{\textbf{human-like cross-view reasoning mechanisms}}. First, we introduce \textbf{MV3DSR}(\textbf{M}ulti-\textbf{V}iew \textbf{3D} \textbf{S}patial \textbf{R}easoning), a scalable pipeline for cross-view spatial reasoning data generation, and construct MV3DSR-Dataset, a large-scale dataset with diverse 3D cross-view reasoning tasks. Based on MV3DSR, we propose MV3DSR-Bench, a comprehensive benchmark for evaluating cross-view spatial reasoning capabilities. Second, we design a three-stage training strategy: the first two stages progressively equip the model with (1) fundamental spatial knowledge and (2) human-like cross-view reasoning patterns, while the final stage employs reinforcement learning to further boost its performance. Extensive experiments demonstrate that our \textbf{CVVG-Reasoner} significantly outperforms existing 3D LLMs(Large Language Models) and advanced LVLMs in cross-view tasks while maintaining robust performance on out-of-domain data. Ablation studies further reveal that injecting human-like reasoning patterns yields a remarkable 44\% performance gain, validating the effectiveness of our design.