AAAI 2026

January 25, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Spatial understanding is a critical capability for LVLMs (Large Vision-Language Models) to advance embodied AI applications. Existing works primarily focus on enhancing spatial understanding within a single frame, i.e., injecting 3D spatial concepts into LVLMs under single coordinate system. However, such improvements struggle in real-world tasks that require consistent cross-view spatial reasoning. In this paper, we propose \textbf{CVVG-Reasoner}(\textbf{C}ross-\textbf{V}iew \textbf{V}isual \textbf{G}eometries) that lifts single-frame spatial comprehension to unified cross-view spatial understanding by mimicking \textit{\textbf{human-like cross-view reasoning mechanisms}}. First, we introduce \textbf{MV3DSR}(\textbf{M}ulti-\textbf{V}iew \textbf{3D} \textbf{S}patial \textbf{R}easoning), a scalable pipeline for cross-view spatial reasoning data generation, and construct MV3DSR-Dataset, a large-scale dataset with diverse 3D cross-view reasoning tasks. Based on MV3DSR, we propose MV3DSR-Bench, a comprehensive benchmark for evaluating cross-view spatial reasoning capabilities. Second, we design a three-stage training strategy: the first two stages progressively equip the model with (1) fundamental spatial knowledge and (2) human-like cross-view reasoning patterns, while the final stage employs reinforcement learning to further boost its performance. Extensive experiments demonstrate that our \textbf{CVVG-Reasoner} significantly outperforms existing 3D LLMs(Large Language Models) and advanced LVLMs in cross-view tasks while maintaining robust performance on out-of-domain data. Ablation studies further reveal that injecting human-like reasoning patterns yields a remarkable 44\% performance gain, validating the effectiveness of our design.

Downloads

SlidesPaper

Next from AAAI 2026

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
poster

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

AAAI 2026

+7
Chubin Chen and 9 other authors

25 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved