China

Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks, and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.

EMNLP 2025

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

The rise of Large Language Models (LLMs) has enabled a new paradigm for bridging authorial intent and player agency in interactive narrative. We consider this paradigm through the example of Dramamancer, a system that uses an LLM to transform author-created story schemas into player-driven playthroughs. This extended abstract outlines some design techniques and evaluation considerations associated with this system.

Design Techniques for LLM-Powered Interactive Storytelling: A Case Study of the Dramamancer System

Tabletop role-playing games (TRPGs) require game masters (GMs) to manage complex scenarios, enforce rules, and maintain narrative consistency. Large language models (LLMs) have shown promise as automated GMs, but preliminary experiments reveal challenges such as rule violations, scenario deviations, and giving spoilers. To address these issues, we propose a multi-agent system in which specialized LLM agents provide feedback to refine GM responses.Experimental evaluation with experienced TRPG players showed that the multi-agent approach improved scenario progression, but also led to increased rule violations and spoilers due to inappropriate feedback from agent. Furthermore, response times were slower, negatively impacting conversational smoothness. These results highlight both the potential and current limitations of multi-agent LLM-based TRPG game mastering, suggesting directions for future improvement.

TRPG Game Mastering Using LLM-Based Multi-Agent System

Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult to scale up and generate realistic, contextually coherent personas. We create the first QA dataset for BaZi-based persona reasoning, where real human experiences categorized into wealth, health, kinship, career, and relationships are represented as life-event questions and answers. Furthermore, we propose the first BaZi–LLM system that integrates symbolic reasoning with large language models to generate temporally dynamic and fine-grained virtual personas. Compared with mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a 30.3%–62.6% accuracy improvement. In addition, when incorrect BaZi information is used, our model's accuracy drops by 20%–45%, showing the potential of culturally grounded symbolic–LLM integration for realistic character simulation.

BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning

This paper explores the application of Large Language Models (LLMs) and reasoning to predict Dungeons \& Dragons (DnD) player actions and format them as Avrae Discord bot commands. Using the FIREBALL dataset, we evaluated a reasoning model, DeepSeek-R1-Distill-LLaMA-8B, and an instruct model, LLaMA-3.1-8B-Instruct, for command generation. Our findings highlight the importance of providing specific instructions to models, that even single sentence changes in prompts can greatly affect the output of models, and that instruct models are sufficient for this task compared to reasoning models.

Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment

Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Evaluating the creative capabilities of large language models (LLMs) often requires human assessments that are difficult to scale, and previous studies have not focused on LLMs' capabilities to produce realistic social structures in storytelling. We introduce a novel, scalable methodology for evaluating LLM story generation by analyzing underlying social structures in narratives as signed character networks. In this study, we conduct a large-scale comparative analysis using networks from over 1,200 stories, generated by four leading LLMs and a human-written corpus. Our analysis of network properties like density, clustering, and signed edge weights shows that LLM-generated stories consistently exhibit a strong bias toward tightly-knit, positive relationships, which aligns with findings from prior research using human assessment. Our proposed approach provides a valuable tool for understanding limitations and tendencies in the storytelling of current and future LLMs and is also applicable to the setting of interactive narratives.

Evaluating LLM Story Generation through Large-scale Network Analysis of Social Structures

Simulating interactive world models remains a core challenge in Large Language Models(LLMs). In this work, we introduce the ByteSized32Refactored, a refactored, modular, and extensible implementation of the original ByteSized32 corpus to explore the task of text game generation. We further optimize the code structure of each text game and create the GameBasic.py foundation library, which centralizes common logic across all 32 games by abstracting 7 base classes (GameObject, etc.) into reusable modules, thereby reducing from 20k to 10k total lines of Python code compared to the original Bytesized32. Our refactored implementation enables extendability - with our centralized design, ByteSized32Refactored can be more efficiently extended to include text games of new scenarios and specifications by reusing the shared logic and functionalities. Extensive experiments with GPT-4o demonstrate a mix of performance - with Bytesized32Refactored, the generated text games for unseen scenarios showcase quality improvements on two of the four evaluation dimensions while decreases on the other two, indicating that the hierarchical structure of the refactored code presents new challenges for LLMs. Overall, we highlight that our extensible code structure, centered on the foundation library and the modular optimization, not only facilitates LLM adaptation to environment specifications but also establishes a scalable environment that supports future extensions.

ByteSized32Refactored: Towards an Extensible Interactive Text Games Corpus for LLM World Modeling and Evaluation

In the rapidly expanding streaming media landscape, engaging Promotional Introduction Texts (PIT) are essential for attracting viewers to various forms of media arts, such as movies and comics. Traditionally, these texts are manually written, leading to inconsistencies in quality and higher production costs. This paper addresses these challenges by proposing an end-to-end framework for automatically generating attractive PITs directly from storylines. However, currently, there is insufficient data and a lack of evaluation methods specifically designed for PIT generation. We constructed a dataset of 263 storylines extracted from Japanese media arts and their associated PITs. Using the dataset, We evaluated generations of six large language models by manual evaluation and automated evaluation (GPT-4) on attractiveness, consistency, and quality. Results demonstrated that there are trade-offs between generating attractive texts and maintaining the storyline, and achieving both objectives at the same time is a challenging task. We also find that there is a significant gap between automatic evaluation and human evaluation.

On Generating Consistent and Attractive Promotional Introduction Text for Narrative Media Arts

The emergence of large language models (LLMs) has opened new opportunities for cre- ating dynamic NPCs in gaming environments, enabling both functional task execution and persona-consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona- Grounded Dialogue Challenge (CPDC) 2025 Round 2. which evaluates agents across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. Our ap- proach combines two complementary strate- gies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompt- ing method to suppress excessive role-play and improve task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging Qwen3-14B with supervised finetuning (SFT) and LoRA adaptation. Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Large Language Models (LLMs) increasingly serve as role-playing agents, yet their ability to consistently portray version-specific characters, such as superhero from different comic or movie universes, remains underexplored. Superhero universes such as Marvel and DC offer a uniquely rich and interesting testbed: decades of narratives define multiple incarnations of the same character, each with distinct histories, values, and moral codes. To study this challenge, we introduce Beyond One World, a benchmark designed to evaluate LLMs on character-based role-playing across 30 iconic heroes and 90 specific versions. The benchmark contains two tasks: (i) Canon Events, probing factual recall of pivotal life stages, and (ii) Moral Dilemmas, confronting models with ethically charged scenarios. Responses are scored for both accuracy and reasoning fidelity using a framework that distinguishes internal deliberation (thinking) from external actions (acting). We propose a Think–Act Matching metric to quantify alignment between reasoning and action, potentially indicating model trustworthiness. Experimental results across reasoning and non-reasoning models reveal three key findings: (1) chain-of-thought prompting improves coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-character generalization across versions remains a significant obstacle; and (3) models often excel in either thinking or acting but rarely both. Beyond One World highlights critical gaps in multiversal consistency and reasoning alignment, providing a new and challenging role-playing LLM evaluation.

Premium content

Next from EMNLP 2025

Design Techniques for LLM-Powered Interactive Storytelling: A Case Study of the Dramamancer System

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES