China

Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). The results indicate that even advanced models fail to achieve perfect Self-Known scores, while their Self-Unknown scores remain notably above zero, reflecting ongoing uncertainty in their self-assessments. Moreover, we find a correlation between higher Self-Known scores and improved factuality, while higher Self-Unknown scores are associated with lower factuality. Even without significant changes in the models’ self-judgment (Self-Known and Self-Unknown), the number of unsupported claims can increases, likely as an artifact of long-form generation. Additional Retrieval-Augmented Generation (RAG) experiments also show the limitations of current LLMs in long-form generation, and provide the more research is needed to improve factuality in long-form text generation.

EMNLP 2025

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

long-form generation

factuality

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

This paper examines the potential of combining gamification and speech technology to support early reading development in children. We introduce Wordcaster, a game where players cast spells by pronouncing words aloud, using real-time automatic speech recognition (ASR) to provide feedback and encourage participation. A pilot study was conducted with three bilingual children—native Arabic speakers learning English—using a fixed list of 20 English words. Results showed increased ASR accuracy between sessions, and all children reported high enjoyment. These findings support a direction for scalable, speech-driven reading support.

Wordcaster: A Gamified Speech-Based Reading Game for Young Learners Using Real-Time ASR

Large Language Model (LLM) agents are reshaping the game industry, particularly with more intelligent and human-preferable game characters. However, existing game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets for aligning pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a foundational benchmark designed to train and evaluate LLM agents across diverse real-world video games. Unlike existing benchmarks, Orak includes 12 popular video games spanning all major genres, enabling comprehensive studies of LLM capabilities and agentic modules essential for intricate game scenarios. To support consistent evaluation of LLMs, we introduce a plug-and-play interface that enables LLMs to seamlessly connect with games and manipulate agentic modules. Additionally, we propose a fine-tuning dataset, consisting of LLM gameplay trajectories across diverse game genres. Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects, establishing a foundation towards building generic gaming agents.

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

We present a novel architecture for safely integrating Large Language Models (LLMs) into interactive game engines, allowing players to "program" new behaviors using natural language. Our framework mitigates risks by using an LLM to translate commands into a constrained Domain-Specific Language (DSL), which configures a custom Entity-Component-System (ECS) at runtime. We evaluated this system in a 2D spell-crafting game prototype by experimentally assessing models from the Gemini, GPT, and Claude families with various prompting strategies. A validated LLM judge qualitatively rated the outputs, showing that while larger models better captured creative intent, the optimal prompting strategy is task-dependent: Chain-of-Thought improved creative alignment, while few-shot examples were necessary to generate more complex DSL scripts. This work offers a validated LLM-ECS pattern for emergent gameplay and a quantitative performance comparison for developers.

Real-Time World Crafting: Generating Structured Game Behaviors from Natural Language with Large Language Models

Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs' performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.

TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap—the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

We present a multi-expert system for creating Non-Player Characters (NPCs) capable of both natural dialogue and contextual action execution in interactive environments. Our approach leverages Qwen3 as the base model with specialized Low-Rank Adaptation (LoRA) adapters to create three distinct expert modules: tool calling, tool response interpretation, and direct dialogue. The system not only meets but exceeds the computational constraints, delivering responses in an average of 3 seconds (well under the 7-second limit) on L40S GPUs while utilizing less than 30GB of the available 48GB VRAM, demonstrating efficiency alongside performance. This computational efficiency also contributes to reduced energy consumption and lower carbon footprint compared to less optimized approaches. The proposed solution achieved top performance in the Commonsense Persona-Grounded Dialogue Challenge 2025, securing the second position in the competition.

Efficient Tool-Calling Multi-Expert NPC Agent for Commonsense Persona-Grounded Dialogue

Large Language Models (LLMs) have demonstrated significant potential in story generation and role-playing, enabling the creation and exploration of novel content based on existing stories. However, previous interactive storytelling systems have shown limitations in immersion and agency. To address these challenges, we propose ECHIDNA, a novel two-stage framework designed around three core dimensions of interactive narrative systems: author intent, character autonomy, and player modeling. We employ Perturbation-Driven Branching to expand the interactive space in the generation stage and introduce the GM-NPC architecture to enable dynamic character responses and state management during the interactive narrative phase. The evaluation results show that ECHIDNA achieves superior performance in creating diverse branching narratives and providing engaging interactive experiences compared to existing methods. Our code and prompts are available at https://github.com/MoidzzZ/Echidna.

Explore Branches the Story didn't Narrate: An LLM Solution

The rise of Large Language Models (LLMs) has enabled a new paradigm for bridging authorial intent and player agency in interactive narrative. We consider this paradigm through the example of Dramamancer, a system that uses an LLM to transform author-created story schemas into player-driven playthroughs. This extended abstract outlines some design techniques and evaluation considerations associated with this system.

Design Techniques for LLM-Powered Interactive Storytelling: A Case Study of the Dramamancer System

Tabletop role-playing games (TRPGs) require game masters (GMs) to manage complex scenarios, enforce rules, and maintain narrative consistency. Large language models (LLMs) have shown promise as automated GMs, but preliminary experiments reveal challenges such as rule violations, scenario deviations, and giving spoilers. To address these issues, we propose a multi-agent system in which specialized LLM agents provide feedback to refine GM responses.Experimental evaluation with experienced TRPG players showed that the multi-agent approach improved scenario progression, but also led to increased rule violations and spoilers due to inappropriate feedback from agent. Furthermore, response times were slower, negatively impacting conversational smoothness. These results highlight both the potential and current limitations of multi-agent LLM-based TRPG game mastering, suggesting directions for future improvement.

TRPG Game Mastering Using LLM-Based Multi-Agent System

Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult to scale up and generate realistic, contextually coherent personas. We create the first QA dataset for BaZi-based persona reasoning, where real human experiences categorized into wealth, health, kinship, career, and relationships are represented as life-event questions and answers. Furthermore, we propose the first BaZi–LLM system that integrates symbolic reasoning with large language models to generate temporally dynamic and fine-grained virtual personas. Compared with mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a 30.3%–62.6% accuracy improvement. In addition, when incorrect BaZi information is used, our model's accuracy drops by 20%–45%, showing the potential of culturally grounded symbolic–LLM integration for realistic character simulation.

Content not yet available

Downloads

Next from EMNLP 2025

Wordcaster: A Gamified Speech-Based Reading Game for Young Learners Using Real-Time ASR

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Content not yet available

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

Wordcaster: A Gamified Speech-Based Reading Game for Young Learners Using Real-Time ASR

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads