China

Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in ProAssist, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks.

EMNLP 2025

Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

multi-modal dialogue system

evaluation and metrics

streaming video

embodied agent

vision-language model

technical paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Current instruction data synthesis methods primarily focus on single-turn instructions and often neglect cross-turn coherence, resulting in context drift and reduced task completion rates in extended conversations. To address this limitation, we propose Skeleton-Guided Multi-Turn Dialogue Generation, a framework that constrains multi-turn instruction synthesis by explicitly modeling human conversational intent. It operates in two stages: (1) Intent Modeling, which captures the global structure of human dialogues by assigning each conversation to one of nine well-defined intent trajectories, ensuring a coherent and goal-oriented information flow; and (2) Skeleton Generation, which constructs a structurally grounded sequence of user queries aligned with the modeled intent, thereby serving as a scaffold that constrains and guides the downstream instruction synthesis process. Based on this process, we construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances. Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20–30% improvement in chat consistency and up to a 15% increase in task success rate, significantly outperforming models trained on existing single-turn and multi-turn instruction datasets.

ConsistentChat: Building Skeleton-Guided Consistent Multi-Turn Dialogues for Large Language Models from Scratch

Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository. With recent strides in Code Large Language Models (code LLMs), various repository-level code completion methods have been proposed and show promising results. Nevertheless, they still suffer from issues such as inappropriate query construction, single-path code retrieval, and misalignment between code retriever and code LLM. To address these problems, this paper introduces CodeRAG, a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion. The main techniques used by CodeRAG include the log probability guided query construction, a multi-path code retrieval mechanism, and finding necessary code knowledge through preference aligned reranking. Extensive experiments on benchmarks ReccEval and CrossCodeEval using four representative code LLMs demonstrate that CodeRAG outperforms state-of-the-art methods. We provide our implementation at https://anonymous.4open.science/r/CodeRAG-B2E0.

CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion

Large Language Models (LLMs) face a crucial challenge from fixed context windows and inadequate memory management, leading to a severe shortage of long-term memory capabilities and limited personalization in the interactive experience with AI agents. To overcome this challenge, we innovatively propose a Memory Operating System, i.e., MemoryOS, to achieve comprehensive and efficient memory management for AI agents. Inspired by the memory management principles in operating systems, MemoryOS designs a hierarchical storage architecture and consists of four key modules: memory Storage, Updating, Retrieval, and Generation. Specifically, the architecture comprises three levels of storage units: short-term memory, mid-term memory, and long-term personal memory. Key operations within MemoryOS include dynamic updates between storage units: short-term to mid-term updates follow a dialogue-chain-based FIFO principle, while mid-term to long-term updates use a segmented page organization strategy. Our pioneering MemoryOS enables hierarchical memory integration and dynamic updating. Extensive experiments on the LoCoMo benchmark show an average improvement of 48.36% on F1 and 46.18% on BLEU-1 over the baselines on GPT-4o-mini, showing contextual coherence and personalized memory retention in long conversations.

Memory OS of AI Agent

Recent advances in large language models (LLMs) have shown impressive performance in passage reranking tasks. Despite their success, LLM-based methods still face challenges in efficiency and sensitivity to external biases. (i) Existing models rely mostly on autoregressive generation and sliding window strategies to rank passages, which incurs heavy computational overhead as the number of passages increases. (ii) External biases, such as positional or semantic bias, hinder the model’s ability to accurately represent passages and the input-order sensitivity. To address these limitations, we introduce a novel passage reranking model, called Multi-View-guided Passage Reranking (MVP). MVP is a non-generative LLM-based reranking method that encodes query–passage information into diverse view embeddings without being influenced by external biases. For each view, it combines query-aware passage embeddings to produce a distinct anchor vector, used to directly compute relevance scores in a single decoding step. Besides, it employs an orthogonal loss to make the views more distinctive. Extensive experiments demonstrate that MVP, with just 220M parameters, matches the performance of much larger 7B-scale fine-tuned models while achieving a 100× reduction in inference latency. Notably, the 3B-parameter variant of MVP achieves state-of-the-art performance on both in-domain and out-of-domain benchmarks.

Multi-view-guided Passage Reranking with Large Language Models

Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character-level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.

Lemmatization as a Classification Task: Results from Arabic across Multiple Genres

Pairwise preference optimization, such as Direct Preference Optimization (DPO), was originally designed to align large language models (LLMs) with human values. It has recently been used to improve the supervised fine-tuning (SFT) performance of LLMs. Using pairs of single samples, DPO estimates the probability distribution of the preferences of picking one response over another. However, in tasks that involve more complicated preferences (e.g., reasoning tasks) than those in the human value alignment task, this sampling method is likely to bring deviations from the ground-truth distribution. To solve the problem, extra efforts (e.g., external annotations or amendment of the loss function) are often required. In this paper, we hypothesize that the preferences can be better estimated through a multi-sampling process. Accordingly, we propose an Expectation Preference Optimization (EPO) algorithm that takes pairs of sample groups, instead of pairs of single samples as in DPO, for preference learning. Compared to pairwise DPO, the proposed EPO tends to produce more reliable preference estimations. Applying different preference optimization methods in a self-training paradigm, we have conducted extensive experiments on various reasoning benchmarks. The results show that our EPO approach outperforms a range of baseline approaches in terms of zero-shot accuracy on all benchmarks.

Expectation Preference Optimization: Reliable Preference Estimation for Improving the Reasoning Capability of Large Language Models

Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just "simple" or "complex," but exists on a continuum between extremes. Scalar measurement of text---or text scoring---is thus a central problem for computational social science and text-as-data applications. Although large language models (LLMs) are an attractive measurement tool, their idiosyncratic treatment of numerical output complicates their application in text scoring, raising questions of how to best use these tools for scalar measurement. This paper addresses these questions by conducting a comprehensive study of text scoring with LLMs for various complex social science constructs. We compare finetuned and prompted models in a variety of configurations, evaluating with multiple datasets sourced from the political science literature. Our study yields actionable findings for applied researchers: LLMs prompted to score texts directly can yield discontinuous distributions over scales; adjusting their response through token-probability weighting can mitigate this problem; adjusted LLM scores align moderately to strongly with human ground-truth; finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.

Measuring scalar constructs in social science with LLMs

Multilingual speakers often switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) exhibit language mixing—alternating languages within their chain of thought. Discouraging language mixing in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning performance. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with outcome-based rewards as the critical training stage that leads to language mixing. We demonstrate that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 2% on math reasoning tasks. We further show that a lightweight probe can predict whether a potential language switch would benefit or harm reasoning, and use this to guide decoding, increasing accuracy by up to 4.10%. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.

The Impact of Language Mixing on Bilingual LLM Reasoning

Recent studies have shown that deep vision-only and language-only models—trained on disjoint modalities—nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, and whether it endures the many-to-many nature of real image–text relationships. In this work, we systematically investigate these questions. We show that representational alignment emerges most strongly in mid-to-late layers of both vision and language models, suggesting a hierarchical progression from modality-specific to conceptually shared representations. Second, this alignment is robust to appearance-only changes but collapses when semantic content is altered—e.g., object removal in images or word order shuffling that disrupts thematic roles in sentences—highlighting that the shared code is truly semantic rather than form-based. Critically, we move beyond the conventional one-to-one image-caption paradigm to investigate alignment in many-to-many contexts, acknowledging that neither modality uniquely determines the other. Using a forced-choice "Pick-a-Pic" task, we find that human preferences for image-caption matches are mirrored in the learned embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions similar to human judgments. Surprisingly, aggregating embeddings across multiple images or phrases referring to the same concept amplifies alignment. Rather than "blurring" representational detail, aggregation appears to distill a more universal semantic core. Together, these results demonstrate that vision and language networks converge on a shared semantic code, where the alignment mirrors human judgements, and becomes more pronounced when multiple exemplars of the same concept within a single modality are averaged in representational space. Our work provides compelling evidence for a universal code of meaning that transcends modality, offering critical insights into how neural networks represent and align semantic information across the vision-language divide.

Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models

The remarkable ability of large language models (LLMs) to comprehend, interpret, and generate complex language has rapidly integrated LLM-generated text into various aspects of daily life, where users increasingly accept it. However, the growing reliance on LLMs underscores the urgent need for effective detection mechanisms to identify LLM-generated text. Such mechanisms are critical to mitigating misuse and safeguarding domains like artistic expression and social networks from potential negative consequences. LLM-generated text detection, conceptualized as a binary classification task, seeks to determine whether an LLM produced a given text. Recent advances in this field stem from innovations in watermarking techniques, statistics-based detectors, and neural-based detectors. Human-assisted methods also play a crucial role. In this survey, we consolidate recent research breakthroughs in this field, emphasizing the urgent need to strengthen detector research. Additionally, we review existing datasets, highlighting their limitations and developmental requirements. Furthermore, we examine various LLM-generated text detection paradigms, shedding light on challenges like out-of-distribution problems, potential attacks, real-world data issues, and ineffective evaluation frameworks. Finally, we outline intriguing directions for future research in LLM-generated text detection to advance responsible artificial intelligence. This survey aims to provide a clear and comprehensive introduction for newcomers while offering seasoned researchers valuable updates in the field.

Downloads

Next from EMNLP 2025

ConsistentChat: Building Skeleton-Guided Consistent Multi-Turn Dialogues for Large Language Models from Scratch

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES