EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Vision-language-action models have gained significant attention for their ability to model multimodal sequences in embodied instruction following tasks. However, most existing models rely on causal attention, which we find to be suboptimal for processing sequences composed of interleaved segments from different modalities. In this paper, we introduce Actra, a novel Transformer architecture featuring trajectory attention and learnable action queries, designed to efficiently process segmented multimodal trajectories and predict 6D pose actions for imitation learning. Furthermore, we propose a contrastive dynamics learning objective to enhance the model's understanding of environment dynamics and multimodal alignment, complementing the primary behavior cloning objective. Through extensive experiments on three large-scale robot manipulation benchmarks, Actra demonstrates substantial performance improvements over previous models.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Compositional Generalisation for Explainable Hate Speech Detection
poster

Compositional Generalisation for Explainable Hate Speech Detection

EMNLP 2025

+1Björn RossMirella LapataAgostina Calabrese
Agostina Calabrese and 3 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved