China

Vision-language-action models have gained significant attention for their ability to model multimodal sequences in embodied instruction following tasks. However, most existing models rely on causal attention, which we find to be suboptimal for processing sequences composed of interleaved segments from different modalities. In this paper, we introduce Actra, a novel Transformer architecture featuring trajectory attention and learnable action queries, designed to efficiently process segmented multimodal trajectories and predict 6D pose actions for imitation learning. Furthermore, we propose a contrastive dynamics learning objective to enhance the model&#39;s understanding of environment dynamics and multimodal alignment, complementing the primary behavior cloning objective. Through extensive experiments on three large-scale robot manipulation benchmarks, Actra demonstrates substantial performance improvements over previous models.

EMNLP 2025

Astra: Efficient Transformer Architecture and Contrastive Dynamics Learning for Embodied Instruction Following

dynamics learning

embodied instruction following

contrastive learning

multimodality

Vision-language-action models have gained significant attention for their ability to model multimodal sequences in embodied instruction following tasks. However, most existing models rely on causal attention, which we find to be suboptimal for processing sequences composed of interleaved segments from different modalities. In this paper, we introduce Actra, a novel Transformer architecture featuring trajectory attention and learnable action queries, designed to efficiently process segmented multimodal trajectories and predict 6D pose actions for imitation learning. Furthermore, we propose a contrastive dynamics learning objective to enhance the model's understanding of environment dynamics and multimodal alignment, complementing the primary behavior cloning objective. Through extensive experiments on three large-scale robot manipulation benchmarks, Actra demonstrates substantial performance improvements over previous models.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Hate speech detection is key to online content moderation, but current models struggle to generalise beyond their training data. This has been linked to dataset biases and the use of sentence-level labels, which fail to teach models the underlying structure of hate speech. In this work, we show that even when models are trained with more fine-grained, span-level annotations (e.g., "artists" is labeled as target and "are parasites" as dehumanising comparison), they struggle to disentangle the meaning of these labels from the surrounding context. As a result, combinations of expressions that deviate from those seen during training remain particularly difficult for models to detect. We investigate whether training on a dataset where expressions occur with equal frequency across all contexts can improve generalisation. To this end, we create U-PLEAD, a dataset of ~364,000 synthetic posts, along with a novel compositional generalisation benchmark of ~8,000 manually validated posts. Training on a combination of U-PLEAD and real data improves compositional generalisation while achieving state-of-the-art performance on the human-sourced PLEAD.

Compositional Generalisation for Explainable Hate Speech Detection

Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Our code and data will be made available recently.

Teaching Your Models to Understand Code via Focal Preference Alignment

Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback---Feed-Target, Feed-Check, and Feed-Path---to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE’s potential for enhancing multimodal reasoning.

Retrieval Enhanced Feedback via In-context Neural Error-book

LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM's judge ability. In this work, we conceptualize judging ability as a general capability of LLMs and adapt the two-stage SFT-DPO training framework—commonly used in traditional general model training—to the development of judge models. We introduce an efficient data synthesis method, which includes the automatic generation of various judge templates, dual verification for data accuracy and consistency. A difficulty-based data stratification strategy allows us to distribute more effective data to the SFT and DPO stages respectively. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task with CoT outputs. We further validate the effectiveness of our model by deploying it to provide reward signals in a real-world RLHF scenarios. We will open-source our model weights and training data to facilitate further research.

Improve LLM-as-a-Judge Ability as a General Ability

Large Language Models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, these models exhibit a critical limitation in output diversity, often generating highly similar content across multiple attempts. This limitation significantly affects tasks requiring diverse outputs, from creative writing to reasoning. Existing solutions, like temperature scaling, enhance diversity by modifying probability distributions but compromise output quality. We propose Guide-to-Generation (G2), a training-free plug-and-play method that enhances output diversity while preserving generation quality. G2 employs a base generator alongside dual Guides, which guide the generation process through decoding-based interventions to encourage more diverse outputs conditioned on the original query. Comprehensive experiments demonstrate that G2 effectively improves output diversity while maintaining an optimal balance between diversity and quality.

G2: Guided Generation for Enhanced Output Diversity in LLMs

Automated Audio Captioning (AAC) aims to generate natural language descriptions of audio content, enabling machines to interpret and communicate complex acoustic scenes. However, current AAC datasets often suffer from short and simplistic captions, limiting model expressiveness and semantic depth. To address this, we introduce VggCaps, a new multi-modal dataset that pairs audio with correpsonding video and leverages large language models (LLMs) to generate rich, descriptive captions. VggCaps significantly outperforms existing benchmarks in caption length, lexical diversity, and human-rated quality. Furthermore, we propose Multi2Cap, a novel AAC framework that learns audio-visual representations through a AV-grounding module during pre-training and reconstructs visual semantics using audio alone at inference. This enables visually grounded captioning in audio-only scenarios. Experimental results on Clotho and AudioCaps demonstrate that Multi2Cap achieves state-of-the-art performance across multiple metrics, validating the effectiveness of cross-modal supervision and LLM-based generation in advancing AAC.

Learning to See through Sound: From VggCaps to Multi2Cap for Richer Automated Audio Captioning

We propose MuCAL (Multilingual Contrastive Alignment Learning) to tackle the challenge of Knowledge Graphs (KG)-to-Text generation using preference learning, where reliable preference data is scarce. MuCAL is a multilingual KG/Text alignment model achieving robust cross-modal retrieval across multiple languages and difficulty levels. Building on MuCAL, we automatically create preference data by ranking candidate texts from three LLMs (Qwen2.5, DeepSeek-v3, Llama-3). We then apply Direct Preference Optimization (DPO) on these preference data, bypassing typical reward modelling steps to directly align generation outputs with graph semantics. Extensive experiments on KG-to-English Text generation show two main advantages: (1) Our KG/text similarity models provide a better signal for DPO than similar existing metrics, and (2) significantly better generalisation on out-of-domain datasets compared to standard instruction tuning. Our results highlight MuCAL’s effectiveness in supporting preference learning for KG-to-English Text generation and lay the foundation for future multilingual extensions. All code and data will be released upon acceptance.

MuCAL: Contrastive Alignment for Preference-Driven KG-to-Text Generation

Multi-modal large language models (MLLMs) have achieved remarkable success in fine-grained visual understanding across a range of tasks. However, they often encounter significant challenges due to inadequate alignment for fine-grained knowledge, which restricts their ability to accurately capture local details and attain a comprehensive global perception. While recent advancements have focused on aligning object expressions with grounding information, they typically lack explicit integration of object images, which contain affluent information beyond mere texts or coordinates. To bridge this gap, we introduce a novel fine-grained visual knowledge alignment method that effectively aligns and integrates multi-scale knowledge of objects, including texts, coordinates, and images. This innovative method is underpinned by our multi-scale fine-grained enhancement data synthesis pipeline, which provides over 300K essential training data to enhance alignment and improve overall performance. Furthermore, we present TinyGroundingGPT, a series of compact models optimized for high-level alignments. With a scale of approximately 3B parameters, TinyGroundingGPT achieves outstanding results in grounding tasks while delivering performance comparable to larger MLLMs in complex visual scenarios.

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Despite their impressive capabilities, Large Vision-Language Models (LVLMs) frequently generate plausible yet incorrect or unsupported responses, referred to as hallucinations. In this study, we investigate whether different types of hallucinations are internally perceptible by probing the model’s internal representations. We identify two primary sources of hallucination in multimodal reasoning: (1) over-reliance on textual priors, and (2) preference for user prompts over conflicting visual evidence. Our probing results reveal that hallucinations exhibit distinguishable representational patterns, suggesting a representation-level approach to characterize and mitigate them. Motivated by this, we propose Steering HAllucination via RePresentation Engineering (SHARP), a representation-level intervention framework that modulates hallucination-related features during inference. SHARP identifies functional representations responsible for prior-driven and visual-context conflicts, and jointly adjusts the model’s internal activations during inference. We evaluate our approach extensively using three large vision-language models across various benchmarks. Experimental results show that our proposed intervention effectively reduces hallucinations without compromising the performance and generalization of the LVLMs.

SHARP: Steering Hallucination in LVLMs via Representation Engineering

Detecting rumors on social media has become a critical task in combating misinformation. Existing propagation-based rumor detection methods often focus on the static propagation graph, overlooking that rumor propagation is inherently dynamic and incremental in the real world. So recent propagation-based rumor detection models attempt to use the dynamic graph that is associated with coarse-grained temporal information. However, these methods fail to capture the long-term time dependency and detailed temporal features of propagation. To address these issues, we propose a novel adaptive Sliding Window and memory-augmented Attention Model (SWAM) for rumor detection. The adaptive sliding window divides the sequence of posts into consecutive disjoint windows based on the propagation rate of nodes. We also propose a memory-augmented attention to capture the long-term dependency and nodes' depths in the propagation graph. Multi-head attention mechanism is applied between nodes in the memorybank and incremental nodes to iteratively update the memorybank, and the depth information of nodes is also considered. Finally, the propagation features of nodes in the memorybank are utilized for rumor detection. Experimental results on two public real-world datasets demonstrate the effectiveness of our model compared with the state-of-the-art baselines.

Downloads

Next from EMNLP 2025

Compositional Generalisation for Explainable Hate Speech Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES