Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Navigating real-world urban environments using natural language instructions introduces unique challenges, such as ambiguous spatial references, diverse landmark types, and dynamic street scenes. Existing approaches often rely on synthetic environments or simplified goal formats, failing to generalize to city-scale, language-driven navigation. To address these limitations, we present UrbanNav, a large-scale framework for training embodied agents to follow free-form language commands in complex urban settings. We leverage web-scale human navigation videos and introduce a multimodal supervision pipeline that aligns visual trajectories with automatically extracted language instructions grounded in real-world landmarks. UrbanNav comprises over 1,500 hours of city navigation data and 3 million grounded instruction-landmark pairs, covering diverse urban contexts. Experiments demonstrate that agents trained with UrbanNav exhibit improved spatial reasoning, robustness to ambiguous commands, and generalization to unseen real-world urban layouts. Our work highlights the importance of large-scale, language-grounded supervision for enabling practical deployment of language-guided robots in real-world cities.
