Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects based on visual inputs without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object ambiguity, and the exploration-exploitation dilemma. To advance research and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of static urban objects. It features 2,420 tasks of varying difficulty across six object categories, designed to rigorously evaluate UAV search strategies. To solve the AVOS task, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that enables a UAV agent to think and reason like humans on visual cues when searching for objects. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic "attraction" values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Moreover, we propose a denoising mechanism to mitigate interference from similar objects and design an Inspiration Promote Thought prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). Our work paves the way for future advances in embodied visual target search.