China

The accurate trust assessment of large language models (LLMs), which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose textbfFunctionally textbfEquivalent textbfSampling for textbfTrust textbfAssessment (FESTA), an input sampling technique for multimodal models, which generates an uncertainty measure based on the equivalent and complementary input sampling. The sampling approach expands the input space to measure the consistency (through equivalent samples) and sensitivity (through complementary samples) properties of the model. These two uncertainty measures are combined to form the final FESTA estimate. Our approach only requires black-box access, and is unsupervised. The experiments are conducted with various off-the-shelf multi-modal LLMs, on visual and audio reasoning tasks. The proposed FESTA approach is shown to significantly improve (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) the area-under-receiver-operating-curve (AUROC) metric on these reasoning tasks.

EMNLP 2025

FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators ("LLM-as-a-judge") is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a comprehensive toxicity evaluation of LLMs across diverse dialects. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluate five LLMs on their ability to assess toxicity, measuring multilingual, dialectal, and LLM-human consistency. Our findings show that LLMs are sensitive to both dialectal shifts and low-resource multilingual variation, though the most persistent challenge remains aligning their predictions with human judgments.

Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties

This paper introduces a novel method for testing the components of theories of (dialogue) coherence through utterance substitution. The method is described and then applied to Inference Anchoring Theory (IAT) in a large scale experimental study with 933 dialogue snippets and 87 annotators. IAT has been used for substantial corpus annotation and practical applications. To address the aim of finding out if and to what extent two aspects of IAT -- illocutionary acts and propositional relations -- contribute to dialogue coherence, we designed an experiment for systematically comparing the coherence ratings for several variants of short debate snippets. The comparison is between original human-human debate snippets, snippets generated with an IAT-compliant algorithm and snippets produced with ablated versions of the algorithm. This allows us to systematically compare snippets that have identical underlying structures as well as IAT-deficient structures with each other. We found that propositional relations do impact on dialogue coherence (at a statistically highly significant level) whereas we found no such effect for illocutionary act expression. This result suggests that fine-grained inferential relations impact on dialogue coherence, complementing the higher-level coherence structures of, for instance, Rhetorical Structure Theory.

Coherence of Argumentative Dialogue Snippets: A New Method for Large Scale Evaluation with an Application to Inference Anchoring Theory

Negation reasoning remains a challenge for large language models (LLMs), often causing incorrect interpretations of negated statements. In this study, we analyze various LLMs for their handling of negation and propose two genres of prompts Warning-based and Persona-based, which improve overall accuracy by up to 3.17% and distractor negation accuracy by up to 25.14% over most competitive baselines. Next, we assess the robustness of LLMs by reordering prompts while preserving meaning, observing instability linked to positional encoding schemes. Further, we introduce a negative token attention score (NTAS) to quantify attention to negation words. From the comprehensive analysis, we point out that within a specific LLM family, the performance of a model (measured using accuracy) correlates more with NTAS than with model size.

This is not a Disimprovement: Improving Negation Reasoning in Large Language Models via Prompt Engineering

Developing more data-efficient training approaches depends on a better understanding of inductive biases. In this work, we hypothesize that the structural information encoded in a transformer's attention matrices is key to acquiring syntax because attention captures relationships between words -- a crucial part of syntax. Under this hypothesis, we would expect that inductive biases targeting attention should selectively improve data-efficiency on syntactic benchmarks. We use knowledge distillation (KD) as a methodological lens to test this hypothesis, comparing conventional KD through output logits against KD through attention matrices. Using GPT-2 as our teacher model, we train student models on datasets ranging from 10K to 5M sentences and evaluate them on both syntactic benchmarks and general language modeling tasks. Surprisingly, we find that while logit-based KD drastically improves data-efficiency across all metrics, attention-based KD offers minimal benefits even for syntactic tasks. This suggests that logits already effectively supervise syntactic information, challenging assumptions about how syntax is represented in transformers and informing more targeted approaches to data-efficient training.

Evaluating distillation methods for data-efficient syntax learning

Large language models (LLMs) have been used to synthesize persuasive dialogues for studying persuasive behavior. However, existing approaches often suffer from issues such as stance oscillation and low informativeness. To address these challenges, we propose reinforced instructional prompting, a method that ensures speaker characteristics consistently guide all stages of dialogue generation. We further introduce multilingual prompting, which aligns language use with speakers’ native languages to better capture cultural nuances. Our experiments involving speakers from eight countries show that continually reinforcing speaker profiles and cultural context improves argument diversity, enhances informativeness, and stabilizes speaker stances. Moreover, our analysis of inter-group versus intra-group persuasion reveals that speakers engaging within their own cultural groups employ more varied persuasive strategies than in cross-cultural interactions. These findings underscore the importance of speaker and cultural awareness in LLM-based persuasion modeling and suggest new directions for developing more personalized, ethically grounded, and culturally adaptive LLM-generated dialogues.

Enhancing LLM-Based Persuasion Simulations with Cultural and Speaker-Specific Information

Large Language Models (LLMs) have shown impressive capabilities across various text generation tasks; however, their potential for simple yet essential text classification remains underexplored, as LLM pre-training tends to emphasize generation over classification. While LLMs with instruction tuning can transform classification into a generation task, they struggle to categorize nuanced texts. One such example is text revision, which involves nuanced changes between pairs of texts. While simply fine-tuning LLMs for revision classification seems plausible, it requires a large amount of revision annotations, which are expensive and scarce. To address this issue, we introduce a plug-and-play parameter-efficient fine-tuning (PEFT) framework, named IR-Tuning, which only fine-tunes a subset of important LLM layers while freezing those of redundant ones. IR-Tuning improves fine-tuning convergence, reduces memory consumption, and is effective for small corpora. Experiments suggest that our proposed method can surpass multiple PEFT baselines over diverse revisions.

Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction

Despite recent advances in Reasoning Language Models (RLMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: *Is English the most efficient language for reasoning?* We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5, and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages consistently reduces token usage, often without sacrificing accuracy. These gains persist after translation into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the model’s multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations.

EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

Climate change communication on social media increasingly employs microtargeting strategies to effectively reach and influence specific demographic groups. This study presents a *post-hoc* analysis of microtargeting practices within climate campaigns by leveraging large language models (LLMs) to examine Meta (previously known as Facebook) advertisements. Our analysis focuses on two key aspects: **demographic targeting** and **fairness**. We evaluate the ability of LLMs to accurately predict the intended demographic targets, such as gender and age group. Furthermore, we instruct the LLMs to generate explanations for their classifications, providing transparent reasoning behind each decision. These explanations reveal the specific thematic elements used to engage different demographic segments, highlighting distinct strategies tailored to various audiences. Our findings show that ***young adults*** are primarily targeted through messages emphasizing *activism and environmental consciousness*, while **women** are engaged through themes related to *caregiving roles and social advocacy*. Additionally, we conduct a comprehensive fairness analysis to uncover biases in model predictions. We assess disparities in accuracy and error rates across demographic groups using established fairness metrics such as Demographic Parity, Equal Opportunity, and Predictive Equality. Our findings indicate that while LLMs perform well overall, certain biases exist, particularly in the classification of **male** audiences. The analysis of thematic explanations uncovers recurring patterns in messaging strategies tailored to various demographic groups, while the fairness analysis underscores the need for more inclusive targeting methods. This study provides a valuable framework for future research aimed at enhancing transparency, accountability, and inclusivity in social media-driven climate campaigns.

Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation

When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the \emph{confidence} in an LLM's generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. \emph{Black-box} UQ methods do not require access to internal model information from the generating LLM, and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized \emph{similarity-based aggregation} framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods result in better calibrated confidences than baselines.

SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models

Recommendation systems play a critical role in shaping user experiences and access to digital content. However, these systems can exhibit unfair behavior when their performance varies across user groups, especially in linguistically diverse populations. Recent advances in NLP have enabled the identification of user dialects, allowing for more granular analysis of such disparities. In this work, we investigate fairness disparities in recommendation quality among Arabic-speaking users, a population whose dialectal diversity is underrepresented in recommendation system research. By uncovering performance gaps linked to dialectal variation, we highlight the intersection of NLP and recommendation system and underscore the broader social impact of NLP. Our findings emphasize the importance of interdisciplinary approaches in building fair systems, particularly for global and local platforms serving diverse Arabic-speaking communities. The anonymized source code is available at https://anonymous.4open.science/r/FairArRecSys-E619/

Downloads

Next from EMNLP 2025

Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES