China

Safety alignment is critical in pre-trained large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose textitAdversary-aware DPO (ADPO), a novel training framework that explicitly considers adversary. textitAdversary-aware DPO (ADPO) integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. textitADPO introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversary-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, textitADPO ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that textitADPO outperforms baselines in terms of both safety alignment and general utility of VLMs.

EMNLP 2025

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

security and privacy

red teaming

robustness

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Link prediction in knowledge graphs (KGs) requires integrating structural information and semantic context to infer missing entities. While large language models (LLMs) offer strong generative reasoning capabilities, they often struggle with *structural sparsity*, *semantic ambiguity*, and limited exploitation of structural signals, especially under incomplete or zero-shot settings. To address these challenges, we propose **SLiNT** (**S**tructure-aware **L**anguage model with **I**njection and co**N**trastive **T**raining), a modular framework that injects KG-derived structural context into frozen LLMs for robust link prediction. Specifically, **Structure-Guided Neighborhood Enhancement (SGNE)** retrieves pseudo-neighbors to enrich sparse entities and mitigate missing context; **Dynamic Hard Contrastive Learning (DHCL)** introduces fine-grained supervision by interpolating hard positives and negatives to resolve entity-level ambiguity; and **Gradient-Decoupled Dual Injection (GDDI)** performs token-level structure-aware intervention without altering the LLM backbone. Experiments on WN18RR and FB15k-237 show that SLiNT outperforms both embedding-based and generation-based baselines, demonstrating the effectiveness of structure-aware representation learning for scalable knowledge graph completion.

SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs’ ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models’ integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.

Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics

Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression Through Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)—one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via GitHub.

ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization

Text-Attributed Graphs (TAGs), which integrate text and graph structures, have recently gained traction, especially in web applications. However, as a graph structure, TAG representation learning (TAGRL) naturally inherits issues from Graph Neural Networks (GNNs), such as fairness. Moreover, previous TAGRL research has mainly focused on using LM-as-encoder to boost downstream task performance, with little consideration given to whether this process may raise additional concerns related to fairness and other safety-related issues. As the first work to explore fairness in TAGRL, this paper proposes the concept of evolving LM-as-encoder to LM-as-fair-encoder, developing a two-stage fairness-aware alignment process called FairTAG based on the observed issues. Specifically, we first mitigate the tendency of LMs to overfit to homophily during downstream tasks fine-tuning, followed by subgraph-level connection behavior preference optimization for selected anchor nodes. We provide theoretical support and demonstrate the feasibility of LM-as-fair-encoder through extensive experiments and ablation studies. We also show that FairTAG can be seamlessly integrated with fairness-enhancing strategies on the GNNs decoder side, thus innovatively constructing a plug-and-play learning framework.

Fair Text-Attributed Graph Representation Learning

Influence functions are important for quantifying the impact of individual training data points on a model's predictions. Although extensive research has been conducted on influence functions in traditional machine learning models, their application to large language models (LLMs) has been limited. In this work, we conduct a systematic study to address a key question: do influence functions work on LLMs? Specifically, we evaluate influence functions across multiple tasks and find that they consistently perform poorly in most settings. Our further investigation reveals that their poor performance can be attributed to: (1) inevitable approximation errors when estimating the iHVP component due to the scale of LLMs, (2) uncertain convergence during fine-tuning, and, more fundamentally, (3) the definition itself, as changes in model parameters do not necessarily correlate with changes in LLM behavior. Thus, our study suggests the need for alternative approaches for identifying influential samples.

Do Influence Functions Work on Large Language Models?

Large language models (LLMs) are increasingly integral as productivity assistants, but existing benchmarks fall short in rigorously evaluating their real-world instruction-following capabilities. Current benchmarks often (i) lack sufficient multilinguality, (ii) fail to capture the implicit constraints inherent in user requests, and (iii) overlook the complexities of multi-turn dialogue. To address these critical gaps and provide a more realistic assessment, we introduce ProductivityBench, a novel benchmark specifically designed for LLM-based productivity assistants. ProductivityBench distinguishes itself by featuring input prompts across 12 languages, incorporating intra-instance multilingual instructions, employing rigorous evaluation criteria to capture both explicit and implicit constraints, and including complex multi-turn dialogue scenarios with both accumulating constraints and context switches. Furthermore, to ensure the reliability evaluation, we refined constraints using an LLM validator. Extensive experiments demonstrate that ProductivityBench presents significantly greater challenges than existing benchmarks; for instance, a strong model like GPT-o1 achieved only a 69.07% overall pass rate. ProductivityBench offers a demanding and realistic assessment of LLM in practical productivity settings, highlighting their capabilities and limitations.

TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

Warning: This paper contains examples of stereotypes and biases. Large Language Models (LLMs) exhibit considerable social biases, and various studies have tried to evaluate and mitigate these biases accurately. Previous studies use downstream tasks to examine the degree of social biases for evaluation and mitigation. While the output of LLMs highly depends on prompts, prior works evaluating and mitigating bias have often relied on a limited variety of prompts. In this paper, we investigate the sensitivity of LLMs when changing prompt variations (task instruction, few-shot examples, debias-prompt) by analyzing task performance and social bias of LLMs. Our experimental results reveal that LLM rankings fluctuate across prompts for both task performance and social bias. We also confirmed that the impact of format changes can differ for each bias category. Performance improvement from prompt settings may not result in reduced bias. Moreover, the ambiguity of instances is a common factor in LLM sensitivity to prompts across advanced LLMs. We recommend using diverse prompts, as in this study, to compare the effects of prompts on social bias in LLMs. Our code will be publicly available after acceptance.

Social Bias Evaluation for Large Language Models Requires Prompt Variations

Decoding strategies manipulate the probability distribution underlying the output of a language model and can therefore affect both generation quality and its uncertainty. In this study, we investigate the impact of decoding strategies for uncertainty estimation in Large Language Models (LLMs). Our experiments show that Contrastive Search produces better uncertainty estimates across a range of alignment-tuned LLMs on average. In contrast, the benefits of these strategies sometimes diverge when the model is only post-trained with supervised fine-tuning, i.e. without explicit alignment.

Decoding Uncertainty: The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models

As large language models (LLMs) take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle’s rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter-prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks.

Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Prior work on large language model (LLM) hallucinations has associated them with model uncertainty or inaccurate knowledge. In this work, we define and investigate a distinct type of hallucination, where a model can consistently answer a question correctly, but a seemingly trivial perturbation, which can happen in real-world settings, causes it to produce a hallucinated response with high certainty. This phenomenon, which we dub CHOKE (Certain Hallucinations Overriding Known Evidence), is particularly concerning in high-stakes domains such as medicine or law, where model certainty is often used as a proxy for reliability. We show that CHOKE examples are consistent across prompts, occur in different models and datasets, and are fundamentally distinct from other hallucinations. This difference leads existing mitigation methods to perform worse on CHOKE examples than on general hallucinations. Finally, we introduce a probing-based mitigation that outperforms existing methods on CHOKE hallucinations. These findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety.

Downloads

Next from EMNLP 2025

SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES