China

Large Language Models (LLMs) are increasingly applied to socially grounded tasks, yet their ability to mirror human behavior in emotionally and strategically complex contexts remains unclear. This study assesses the behavioral fidelity of personality-prompted LLMs in adversarial dispute resolution by simulating multi-turn negotiation dialogues. Each LLM is guided by a matched Five-Factor personality profile to control for individual variation and enhance realism. We evaluate alignment across three dimensions: linguistic style, emotional expression, and strategic behavior. GPT-4.1 aligns most closely with humans in language and emotion, while Claude-3.7-Sonnet best reflects strategic behavior. Despite these strengths, notable gaps remain. Our findings provide a benchmark for LLM-human alignment in socially complex interactions and highlight both the promise and limitations of personality conditioning in dialogue modeling.

EMNLP 2025

Evaluating Behavioral Alignment in Conflict Dialogue: A Multi-Dimensional Comparison of LLM Agents and Humans

conflict resolution

llm alignment

discourse and pragmatics

dialogue and interactive systems

negotiation

computational social science

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

This paper introduces an algorithm to select demonstration examples for in-context learning of a query set. Given a set of n examples, how can we quickly select k out of n to best serve as the conditioning for a downstream task? This problem has broad applications in prompt tuning and chain-of-thought reasoning, to name a few. Since model weights remain fixed during in-context learning, previous work has sought to design methods based on similarity scores measured in the input embeddings. This work proposes a new approach based on gradients of the model output taken in the input embedding space. Our approach estimates model outputs through a first-order approximation using the gradients. Then, we apply this estimation to multiple randomly sampled subsets. Finally, we aggregate the sampled subset outcomes to form an influence score for each demonstration, and select k most relevant examples. This procedure only requires pre-computing model outputs and gradients once, leading to a linear-time algorithm relative to model and training set sizes. Extensive experiments across various LLMs and datasets validate the efficiency of our approach. We show that the gradient estimation procedure yields approximations of full inference with less than 1\% error across six datasets. This allows us to scale up subset selection methods that would otherwise run full inference by up to 37.7times on LLMs with up to 34 billion parameters, and outperform existing selection methods based on input embeddings by 11\% on average.

Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation

Edge deployment of task-oriented semantic parsers demands high accuracy under tight latency and memory budgets. We present Grammar Pruning, a lightweight zero-shot framework that begins with a user-defined schema of API calls and couples a rule-based entity extractor with an iterative grammar-constrained decoder: extracted items dynamically prune the context-free grammar, limiting generation to only those intents, slots, and values that remain plausible at each step. This aggressive search-space reduction both eliminates hallucinations and slashes decoding time. On the adapted FoodOrdering, APIMixSNIPS, and APIMixATIS benchmarks, Grammar Pruning with small language models achieves an average execution accuracy of over 90%—rivaling State-of-the-Art, cloud-based solutions—while sustaining at least 2x lower end-to-end latency than existing methods. By requiring nothing beyond the domain’s full API schema values yet delivering precise, real-time natural-language understanding, Grammar Pruning positions itself as a practical building block for future edge-AI applications that cannot rely on large models or cloud offloading.

Grammar Pruning: Enabling Low-Latency Zero-Shot Task-Oriented Language Models for Edge AI

Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs) but often leads to verbose and redundant outputs, thus increasing inference cost. We hypothesize that many reasoning steps are unnecessary for producing correct answers. To investigate this, we start with a systematic study to investigate what is the minimum reasoning required for a model to reach a stable decision. Based on the insights, we propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations. Experiments across five benchmarks and five open-weights LLMs show that our methods largely reduce token usage with little or no accuracy drop. In particular, on NaturalQuestions, Answer Consistency reduces tokens by over 40% while further improving accuracy. Our work underscores the importance of cost-effective reasoning methods that operate at inference time, offering practical benefits for real-world applications.

Answer Convergence as a Signal for Early Stopping in Reasoning

In this paper, we introduce a novel and exciting task: the automated generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We present results from a series of experiments using Large Language Models (LLMs), both with and without explicit reasoning capabilities, applying a range of prompting techniques. Automating puzzle generation—even for relatively simple puzzles—holds promise for expanding interest in linguistics and introducing the field to a broader audience. We also explore the use of LLMs for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.

Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

Assessing the originality of creative ideas often relies on their statistical infrequency within a population---an approach long used in creativity research but difficult to automate at scale. Human annotation via manual bucketing of idea rephrasings is labor-intensive, subjective, and brittle under large corpora. We introduce a fully automated, psychometrically validated pipeline for frequency-based originality scoring. Our method, MuseRAG, combines large language models (LLMs) with an externally orchestrated retrieval-augmented generation (RAG) framework. Given a new idea, the system retrieves semantically similar prior idea buckets and zero-shot prompts the LLM to judge whether the new idea belongs to an existing bucket or forms a new one. The resulting buckets enable computation of frequency-based originality metrics. MuseRAG matches human annotators in both idea clustering (AMI = 0.59) and participant-level originality scores (r = 0.89), while exhibiting strong convergent and external validity. Our work enables intent-sensitive, human-aligned originality scoring, aiding creativity research at scale.

MuseScorer: Idea Originality Scoring At Scale

Large Language Models (LLMs) not only have solved complex reasoning problems but also exhibit remarkable performance in tasks that require subjective decision-making. Existing studies suggest that LLM generations can convey subjectivity to some extent, yet exploring whether LLMs can account for individual-level subjectivity has not been sufficiently studied. In this paper, we characterize the subjectivity of individuals on social media and infer their moral judgments using LLMs. We propose a framework, SolAr (Subjective Ground with Value Abstraction), that observes value conflicts and trade-offs in the user-generated texts to better represent subjective ground of individuals. Empirical results demonstrate that our framework enhances overall inference performance, with notable improvements for users with limited data and in controversial situations. Additionally, we qualitatively show that SolAr provides explanations about individuals' value preferences, which can further account for their judgments.

SOLAR: Towards Characterizing Subjectivity of Individuals through Modeling Value Conflicts and Trade-offs

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation — a process that is time-consuming, labor-intensive, and expensive. In this paper, we address the label-efficient learning problem for supervised finetuning (SFT) by leveraging task-diversity as a fundamental principle for effective data selection. This is markedly different from existing methods based on the prompt-diversity. Our approach is based on two key observations: 1) task labels for different prompts are often readily available; 2) pre-trained models have significantly varying levels of confidence across tasks. We combine these facts to devise a simple yet effective sampling strategy: we select examples across tasks using an inverse confidence weighting strategy. This produces models comparable to or better than those trained with more complex sampling procedures, while being significantly easier to implement and less computationally intensive. Notably, our experimental results demonstrate that this method can achieve better accuracy than training on the complete dataset (a 4% increase in MMLU score). Across various annotation budgets and two instruction finetuning datasets, our algorithm consistently performs at or above the level of the best existing methods, while reducing annotation costs by up to 80%.

Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs

Multimodal aspect-based sentiment classification (MABSC) requires fine-grained reasoning over both textual and visual content to infer sentiments toward specific aspects. However, existing methods often rely on superficial correlations—particularly between aspect terms and sentiment labels—leading to poor generalization and vulnerability to spurious cues. To address this limitation, we propose DPCI, a novel Dual-Path Counterfactual Integration framework that enhances model robustness by explicitly modeling counterfactual reasoning in multimodal contexts. Specifically, we design a dual counterfactual generation module that simulates two types of interventions: replacing aspect terms and rewriting descriptive content, thereby disentangling the spurious dependencies from causal sentiment cues. We further introduce a sample-aware counterfactual selection strategy to retain high-quality, diverse counterfactuals tailored to each generation path. Finally, a confidence-guided integration mechanism adaptively fuses counterfactual signals into the main prediction stream. Extensive experiments on standard MABSC benchmarks demonstrate that DPCI not only achieves state-of-the-art performance but also significantly improves model robustness.

Dual-Path Counterfactual Integration for Multimodal Aspect-Based Sentiment Classification

Large language models use high-dimensional latent spaces to encode and process textual information. Much work has investigated how the conceptual content of words translates into geometrical relationships between their vector representations. Fewer studies analyze how the cumulative information of an entire prompt becomes condensed into individual embeddings under the action of transformer layers. We use literary pieces to show that information about intangible, rather than factual, aspects of the prompt are contained in deep representations. We observe that short excerpts (10 - 100 tokens) from different novels separate in the latent space independently from what next-token prediction they converge towards. Ensembles from books from the same authors are much more entangled than across authors, suggesting that embeddings encode stylistic features. This geometry of style may have applications for authorship attribution and literary analysis, but most importantly reveals the sophistication of information processing and compression accomplished by language models.

What’s in a prompt? Language models encode literary style in prompt embeddings

Large Language Models (LLMs) have revolutionized natural language processing, yet remain vulnerable to jailbreak attacks—particularly multi-turn jailbreaks that distribute malicious intent across benign exchanges, thereby bypassing alignment mechanisms. Existing approaches often suffer from limited exploration of the adversarial space, rely on hand-crafted heuristics, or lack systematic query refinement. We propose NEXUS (Network Exploration for eXploiting Unsafe Sequences), a modular framework for constructing, refining, and executing optimized multi-turn attacks. NEXUS comprises: (1) ThoughtNet, which hierarchically expands a harmful intent into a structured semantic network of topics, entities, and query chains; (2) a feedback-driven Simulator that iteratively refines and prunes these chains through attacker–victim–judge LLM collaboration using harmfulness and semantic-similarity benchmarks; and (3) a Network Traverser that adaptively navigates the refined query space for real-time attacks. This pipeline systematically uncovers stealthy, high-success adversarial paths across LLMs. Our experimental results on several closed-source and open-source LLMs show that NEXUS can achieve a higher attack success rate, between 2.1% and 19.4%, compared to state-of-the-art approaches. Our source code is available at https://github.com/AnonymousCoder04/NEXUS.

Downloads

Next from EMNLP 2025

Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES