
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

poster
Action Inference for Destination Prediction in Vision-and-Language Navigation
keywords:
destination prediction
action inference
vision-and-language navigation
Vision-and-Language Navigation (VLN) encompasses interacting with autonomous vehicles using language and visual input from the perspective of mobility. Most of the previous work in this field focuses on spatial reasoning and the semantic grounding of visual information. However, reasoning based on the actions of pedestrians in the scene is not much considered. In this study, we provide a VLN dataset for destination prediction with action inference to investigate the extent to which current VLN models perform action inference. We introduce a crowd-sourcing process to construct a dataset for this task in two steps: (1) collecting beliefs about the next action for the user and (2) annotating the destination considering the user's next action. Our results on benchmarking using models working on destination prediction lead us to believe that although the models can learn to reason about the action and the next action to a certain extent, there is still much scope for improvement.