China

Escape rooms present a unique cognitive challenge that demands exploration-driven planning: with the sole instruction to escape the room, players must actively search their environment, collecting information, and finding solutions through repeated trial and error. Motivated by this, we introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under these challenging conditions, where success depends not only on solving isolated puzzles but also on iteratively constructing and refining spatial-temporal knowledge of a dynamically changing environment. On VisEscape, we observe that even state-of-the-art multi-modal models generally fail to escape the rooms, showing considerable variation in their progress and problem-solving approaches. We find that integrating memory management and reasoning contributes to efficient exploration and enables successive hypothesis formulation and testing, thereby leading to significant improvements in dynamic and exploration-driven environments.

EMNLP 2025

VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms

self-driven exploration

agent

reasoning

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Despite the rapid advancements in LLM agents, they still face the challenge of generating meaningful reflections due to inadequate error analysis and a reliance on rare successful trajectories, especially in complex tasks. In this work, we propose SAMULE, a new framework for self-learning agents powered by a retrospective language model that is trained based on Multi-Level Reflection Synthesis. It first synthesizes high-quality reflections across three complementary levels: Single-Trajectory Learning (micro-level) for detailed error correction; Intra-Task Learning (meso-level) to build error taxonomies across multiple trials of the same task, and Inter-Task Learning (macro-level) to extract transferable insights based on same typed errors from diverse task failures. Then we fine-tune a language model serving as the retrospective model to generate reflections during inference. We further extend our framework to interactive settings through a foresight-based reflection mechanism, enabling agents to proactively reflect and adapt during user interactions by comparing predicted and actual responses. Extensive experiments on three challenging benchmarks—TravelPlanner, NATURAL PLAN, and Tau-bench—demonstrate that our approach significantly outperforms reflection-based baselines. Our results highlight the critical role of well-designed reflection synthesis and failure-centric learning in building self-improving LLM agents.

SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection

Large language models (LLMs) are trained on vast amounts of text from the Internet, but do they truly understand the viral content that rapidly spreads online---commonly known as memes? In this paper, we introduce CHIME, a dataset for CHinese Internet Meme Explanation. The dataset comprises popular phrase-based memes from the Chinese Internet, annotated with detailed information on their meaning, origin, example sentences, types, etc. To evaluate whether LLMs understand these memes, we designed two tasks. In the first task, we assessed the models' ability to explain a given meme, identify its origin, and generate appropriate example sentences. The results show that while LLMs can explain the meanings of some memes, their performance declines significantly for culturally and linguistically nuanced meme types. Additionally, they consistently struggle to provide accurate origins for the memes. In the second task, we created a set of multiple-choice questions (MCQs) requiring LLMs to select the most appropriate meme to fill in a blank within a contextual sentence. While the evaluated models were able to provide correct answers, their performance remains noticeably below human levels. We include CHIME with the submission and hope it will facilitate future research on computational meme understanding.

Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation

Forecasting weather and climate events is crucial for making appropriate measures to mitigate environmental hazards and minimize losses. However, existing environmental forecasting research focuses narrowly on predicting numerical meteorological variables (e.g., temperature), neglecting the translation of these variables into actionable textual narratives of events and their consequences. To bridge this gap, we proposed Weather and Climate Event Forecasting (WCEF), a new task that leverages numerical meteorological raster data and textual event data to predict weather and climate events. This task is challenging to accomplish due to difficulties in aligning multimodal data and the lack of supervised datasets. To address these challenges, we present CLLMate, the first multimodal dataset for WCEF, using 26,156 environmental news articles aligned with ERA5 reanalysis data. We systematically benchmark 32 existing models on CLLMate, including closed-source, open-source, and our fine-tuned models. Our experiments reveal the advantages and limitations of existing MLLMs and the value of CLLMate for the training and benchmarking of the WCEF task.

CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting

Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI collaboration. This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science. Through the lens of the scientific method, we introduce a foundational three-level taxonomy—Tool, Analyst, and Scientist—to delineate their escalating autonomy and evolving responsibilities within the research lifecycle. We further identify pivotal challenges and future research trajectories such as robotic automation, self-improvement, and ethical governance. Overall, this survey provides a conceptual architecture and strategic foresight to navigate and shape the future of AI-driven scientific discovery, fostering both rapid innovation and responsible advancement.

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Non-monotonic reasoning (NMR) refers to the fact that conclusions may be invalidated by new information. It is widely used in daily life and legal reasoning. An NMR task usually has multiple extensions, which are sets of plausible conclusions. There are two reasoning modes -- skeptical and credulous reasoning, depending on whether to believe facts in all extensions or any one extension. Despite some preliminary works exploring NMR abilities of LLMs, the multi-extension NMR capabilities of LLMs remain underexplored. In this paper, we synthesize a multi-extension NMR dataset MultiLogicNMR, and construct two variants of the dataset with more extensions or text diversity. We propose a neural-symbolic framework MultiLogicNMRer for multi-extension NMR. Experimental evaluation with the datasets show that LLMs still face significant challenges in NMR abilities, and reveal the effectiveness of our neural-symbolic framework, with an average accuracy gain of about 15\% compared to prompt-based methods, and even outperforming some fine-tuning methods. All the code and data are publicly available\footnote{https://anonymous.4open.science/r/NMRer-C5B3 }.

MultiLogicNMR(er): A Benchmark and Neural-Symbolic Framework for Non-monotonic Reasoning with Multiple Extensions

In this paper, we propose a novel, automatic tree-based evaluation metric for LLM-generated step-by-step assembly instructions, that more accurately reflects spatiotemporal aspects of construction than traditional metrics such as BLEU and BERT similarity scores. We apply our proposed metric to the domain of sewing instructions, and show that our metric better correlates with manually-annotated error counts, demonstrating our metric's superiority for evaluating the spatiotemporal soundness of sewing instructions. Further experiments show that our metric is more robust than traditional approaches against artificially-constructed counterfactual examples that are specifically constructed to confound metrics that rely on textual similarity.

Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions

Although the power of LLM tool-use agents has ignited a flurry of recent research in this area, the curation of tool-use training data remains an open problemtextemdashespecially for online RL training. Existing approaches to synthetic tool-use data generation tend to be non-interactive and/or non-compositional. We introduce RandomWorld, a pipeline for the procedural generation of interactive tools and compositional tool-use data. We show that models tuned via SFT and RL on synthetic RandomWorld data improve on a range of tool-use benchmarks, and set the new SoTA for two metrics on the NESTFUL dataset. Further experiments show that downstream performance scales with the amount of RandomWorld-generated training data, opening up the possibility of further improvement through the use of entirely synthetic data.

Procedural Environment Generation for Tool-Use Agents

Despite Bengali being among the most spoken languages bearing cultural importance and richness, the NLP endeavors on it, remain relatively limited. Figures of Speech (FoS) not only contribute to the phonetic and semantic nuances of a language, but they also exhibit aesthetics, expression, and creativity in literature. To our knowledge, in this paper, we present the first ever Bengali figures of speech classification dataset, **BengFoS**, on works of six renowned poets of Bengali literature. We deploy state-of-the-art Large Language Models (LLMs) to this dataset in the zero-shot setup, thereafter fine-tuning the best performing models, and finally dissect them for language model probing. This reveals novel insights on the intrinsic behavior of two open-source LLMs (Llama and DeepSeek) in FoS detection. **Though we have limited ourselves to Bengali, the experimental framework can be reproduced for English as well as for other low-resource languages**.

Can LLMs be Literary Companions?: Analysing LLMs on Bengali Figures of Speech Identification

When leveraging large language models (LLMs) for multiple-choice question answering (MCQA), no convention exists regarding how the space following the last colon should be tokenized. We highlight this as an example for unclean LLM evaluation that complicates both NLP research and model use in an interdisciplinary context: we observe accuracy differences of up to 6% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings – a practical concern for leaderboards reliant on this prompt style. Surprisingly, we are able to recommend one specific strategy – tokenizing the space *together* with the answer letter – as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.

Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs

Prompt sensitivity, referring to the phenomenon where minor variations in phrasing lead to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.

Downloads

Next from EMNLP 2025

SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES