China

We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model&#39;s dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of 2% to 37% when treating OOD datapoints as positives and in-domain test datapoints as negatives.

EMNLP 2025

Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs

polysemanticity

trustworthiness

ood detection

interpretability

reliability

We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model's dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of 2% to 37% when treating OOD datapoints as positives and in-domain test datapoints as negatives.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: conversation, question-answering, and content composition. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset will be released upon acceptance.} Disclaimer: This paper contains sensitive content that may be disturbing to some readers.

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore’s Low-Resource Languages

Nüshu is an endangered language from Jiangyong County, China, and the world’s only known writing system created and used exclusively by women. Recent Natural Language Processing (NLP) work has digitized small Nüshu-Chinese corpora, but the script remains computationally inaccessible due to its handwritten, mixed-media form and dearth of multimodal resources. We address this gap with two novel datasets: NüshuVision, an image corpus of 500 rendered sentences in traditional vertical, right-to-left orthography, and NüshuStrokes, the first sequential handwriting recordings of all 397 Unicode Nüshu characters by an expert calligrapher. Evaluating five state-of-the-art Chinese Optical Character Recognition (OCR) systems on NüshuVision shows that all fail entirely, each yielding a Character Error Rate (CER) of 1.0. Fine-tuning Microsoft’s TrOCR on NüshuVision lowers CER to 0.67, a modest yet meaningful improvement. These contributions establish the first multimodal foundation for Nüshu revitalization and offer a culturally grounded framework for language preservation.

Recontextualizing Revitalization: A Mixed Media Approach to Reviving the Nüshu Language

As Large Language Models (LLMs) demonstrate increasingly strong human-like capabilities, the need to align them with human values has become significant. Recent advanced techniques, such as prompt learning and reinforcement learning, are being employed to bring LLMs closer to aligning with human values. While these techniques address broad ethical and helpfulness concerns, they rarely consider simulating individualized human values. To bridge this gap, we propose SimVBG, a framework that simulates individual values based on individual backstories that reflect their past experience and demographic information. SimVBG transforms structured data on an individual to a backstory and utilizes a multi‐module architecture inspired by the Cognitive–Affective Personality System to simulate individual value based on the backstories. We test SimVBG on a self-construct benchmark derived from the World Values Survey and show that SimVBG improves top-1 accuracy by more than 10% over the retrieval-augmented generation method. Further analysis shows that performance increases as additional interaction user history becomes available, indicating that the model can refine its persona over time. Code, dataset, and complete experimental results are anonymously available at https://anonymous.4open.science/r/SimVBG-029C.

SimVBG: Simulating Individual Values by Backstory Generation

Conversational agents have typically been developed for either task-oriented dialogue (TOD) or open-ended chitchat, with limited success in integrating both. Yet, real-world conversations often involve fluid transitions between these modes. To address this, we introduce TACT (TOD-And-Chitchat Transition), a dataset for transition-aware dialogue modeling that features structurally diverse and integrated mode flows. TACT supports both user- and agent-driven mode switches, enabling robust modeling of complex dialogue dynamics. To evaluate an agent’s ability to initiate and recover from mode transitions, we propose new performance metrics---Switch and Recovery. Models trained on TACT outperform baselines in both intent detection and mode transition handling. Moreover, applying Direct Preference Optimization (DPO) to TACT-trained models yields extra gains, achieving 75.74% joint mode-intent accuracy and a 40.86% win rate against GPT-4o in human evaluation. This shows that pairing structurally diverse data with DPO boosts response quality and transition control, facilitating the development of proactive agents.

Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents

Effective teaching necessitates adapting pedagogical strategies to the inherent diversity of students, encompassing variations in aptitude, learning styles, and personality, a critical challenge in education and teacher training. Large Language Models (LLMs) offer a powerful tool to simulate complex classroom dynamics, providing a controlled environment for exploring optimal teaching patterns. However, existing simulation frameworks often fall short by neglecting comprehensive student modeling beyond basic knowledge states and, more importantly, by lacking mechanisms for teachers to dynamically adapt their approach based on student feedback and collective performance. Addressing these limitations, \textbf{we propose a simulation framework that integrates LLM-based diverse student agents with a self-evolving teacher agent}. We use genetic algorithms to automatically tune and optimize the teacher's pedagogical parameters based on simulated student performance, enabling the teacher agent to discover and refine teaching patterns tailored to specific class characteristics. Complementing this, \textbf{we introduce Persona-RAG, a novel Retrieval-Augmented Generation method specifically designed for personalized knowledge retrieval in pedagogical contexts, allowing students to retrieve information as per their learning styles}. We show how Persona-RAG remains competitive with standard RAG baselines in accurately retrieving relevant information while adding a touch of personalization for students. Crucially, we perform extensive experiments and highlight the different patterns learnt by the teacher agent while optimizing over classes with students of various learning styles. Our work presents a significant step towards creating adaptive educational technologies and improving teacher training through realistic, data-driven simulation.

Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval-Augmented Generation Across Learning Styles

Channel prediction can greatly reduce the pilot overhead and is a critical technology in the fifth-generation (5G) and the coming 6G wireless communications systems. Conventional model-based channel prediction methods suffer from limited accuracy due to imperfect temporal modeling, while existing AI-based methods suffer from limited generalization due to inadequate training strategies. Recently, large language models (LLMs) have demonstrated remarkable generalization and generation capabilities across diverse domains such as computer vision, quantitative economics, and bioinformatics, which motivates us to apply LLMs in channel prediction. In this paper, we formulate the `channel sentence' based on channel correlation, where the channel is regarded as a 'word'. Subsequently, we propose a generative pre-trained language model for channel prediction (CP-GPT). We collect 12M channel data according to the 3GPP 38.901 protocol and train CP-GPT based on the transformer decoder architecture. Moreover, we design two pre-training tasks based on the characteristics of wireless channels to enhance CP-GPT's understanding of communications channels. We further propose a comprehensive benchmark to rigorously evaluate the capabilities of CP-GPT across multiple dimensions. The simulation results demonstrate that CP-GPT has successfully learned various channel characteristics and exhibits impressive capabilities across numerous downstream tasks.

A Generative Pre-Trained Language Model for Channel Prediction in Wireless Communications Systems

Large language models (LLMs) have demonstrated remarkable capabilities in tool learning. In real-world scenarios, user queries are often ambiguous and incomplete, requiring effective clarification. However, existing interactive clarification approaches face two critical limitations: reliance on manually constructed datasets, which inherently constrains training data scale and diversity, and lack of error correction mechanisms during multi-turn clarification, leading to error accumulation that compromises both accuracy and efficiency. We present AskToAct, which addresses these challenges by exploiting the structural mapping between queries and their tool invocation solutions. Our key insight is that tool parameters naturally represent explicit user intents. By systematically removing key parameters from queries while retaining them as ground truth, we enable automated construction of high-quality training data. We further enhance model robustness through error-correction pairs and selective masking, enabling dynamic error detection during clarification interactions. Comprehensive experiments demonstrate that AskToAct significantly outperforms existing approaches, achieving above 57% accuracy in recovering critical unspecified intents and enhancing clarification efficiency by an average of 10.46% while maintaining high accuracy in tool invocation. Our framework exhibits robust performance across different model architectures and successfully generalizes to entirely unseen APIs without additional training, achieving performance comparable to GPT-4o with substantially fewer computational resources.

AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification

With the rise of mental health challenges, social media has become a key platform for emotional expression. Deep learning offers a promising solution for analyzing mental health but lacks flexibility and interpretability. Large language models (LLMs) introduce greater adaptability and can explain their decisions, yet they still underperform deep learning in complex psychological analysis. We present C-IMHI, the first multi-task Chinese social media interpretable mental health instruction dataset (9K samples) with quality control and manual validation. Additionally, we introduce MentalGLM, the first open-source Chinese LLMs for explainable mental health analysis, trained on 50K instructions. The proposed models excelled in three mental health downstream tasks, outperforming or matching deep learning and LLMs. A portion of the generated decision explanations was validated by experts, demonstrating promising accuracy and reliability. We evaluated the proposed models on a clinical dataset, where they significantly outperformed other LLMs, demonstrating their potential for clinical applications. Our models show strong performance, validated across tasks and domains. The decision explanations enhance usability and facilitate better understanding and practical application of the models. Both the constructed dataset and the models are publicly available via: https://anonymous.4open.science/r/MentalGLM-F416.

MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social Media

The widespread deployment of large language models (LLMs) across various domains has made their safety a critical priority. Inspired by think-tank decision-making philosophy, we propose DiplomacyAgent, an LLM-based multi-agent system for diplomatic position analysis. With DiplomacyAgent, we are able to systematically assess how LLMs balance “interests” against “ethical principles” when addressing various international events, hence understanding the safety implications of LLMs in diplomacy. Specifically, this will help to assess the consistency of LLM stance with globally recognized ethical standards, as well as the potential risks or ideological biases that may arise. Through integrated quantitative metrics, our research uncovers unexpected decision-making patterns in LLM responses to sensitive issues including human rights protection, environmental sustainability, regional conflicts, etc. It discloses that LLMs could exhibit a strong bias towards interests, leading to unsafe decisions that violate ethical and moral principles. Our experiment results suggest that deploying LLMs in high-stakes domains, particularly in the formulation of diplomatic policies, necessitates a comprehensive assessment of potential moral and social implications, as well as the implementation of stringent safety protocols. Our codes and data will be publicly released soon.

DiplomacyAgent: Do LLMs Balance Interests and Ethical Principles in International Events?

Converging societal and technical factors have transformed language technologies into user-facing applications employed across languages. Machine Translation (MT) has become a global tool, with cross-lingual services now also supported by dialogue systems powered by multilingual Large Language Models (LLMs). Such accessibility has expanded MT’s reach to a vast base of lay users, often with little to no expertise in the languages or the technology itself. And yet, the understanding of MT consumed by this diverse group of users—their needs, experiences, and interactions with these systems—remains limited. We trace the shift in MT user profiles, focusing on non-expert users and how their engagement with these systems may change with LLMs. We identify three key factors—usability, trust, and literacy—that shape these interactions and must be addressed to align MT with user needs. By exploring these dimensions, we offer insights to guide future MT with a user-centered approach.

Downloads

Next from EMNLP 2025

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore’s Low-Resource Languages

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES