Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Large models achieve strong performance on Vision-and-Language Navigation (VLN) tasks but are costly to run in resource-limited environments. Token pruning, by reducing model input size, offers appealing tradeoffs for efficiency with minimal performance loss. However, existing pruning strategies, designed for general Vision-Language tasks, overlook VLN-specific challenges: input information loss makes navigation agents uncertain about their decisions and thereby their walks longer, increasing computational cost. Non-VLN-specific strategies often fail to recognize uninformative instruction tokens, thus undermining efficiency gains. We propose NAP, a navigation-aware pruning framework that addresses these issues from three angles: pruning only background view tokens while preserving action-relevant ones, removing low-importance nodes from the navigation map to discourage backtracking and thus shorten navigation length, and leveraging a Large Language Model to pre-identify instruction words irrelevant to navigation, which helps the agent prioritize pruning them during navigation. Experiments on standard VLN benchmarks show that NAP significantly outperforms prior pruning strategies, preserving higher success rates while saving more than 50% FLOPS.