China

With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs&#39; performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community.

EMNLP 2025

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

large audio-language models

survey

evaluation

With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are constrained by the fixed sparsity level chosen during training; meeting different sparsity requirements therefore demands separate models and increases the computational footprint during both training and evaluation. We introduce a novel training objective, HierarchicalTopK, which trains a single SAE to optimise reconstructions across multiple sparsity levels simultaneously. Experiments with Gemma-2 2B demonstrate that our approach achieves Pareto-optimal trade-offs between sparsity and explained variance, outperforming traditional SAEs trained at individual sparsity levels. Further analysis shows that HierarchicalTopK preserves high interpretability scores even at higher sparsity. The proposed objective thus closes an important gap between flexibility and interpretability in SAE design.

Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy

Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed *counterbias* data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present **CoBA**: **Co**unter**B**ias **A**ugmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, **CoBA** generates *counterbias* data that mitigates spurious patterns. Through extensive experiments, we demonstrate that **CoBA** not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.

CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples

Conversational breakdowns in close relationships are deeply shaped by personal histories and emotional context, yet most NLP research treats conflict detection as a general task, overlooking the relational dynamics that influence how messages are perceived. In this work, we leverage nonviolent communication (NVC) theory to evaluate LLMs in detecting conversational breakdowns and assessing how relationship backstory influences both human and model perception of conflicts. Given the sensitivity and scarcity of real-world datasets featuring conflict between familiar social partners with rich personal backstories, we contribute the PersonaConflicts Corpus, a dataset of N=5,772 naturalistic simulated dialogues spanning diverse conflict scenarios between friends, family members, and romantic partners. Through a controlled human study, we annotate a subset of dialogues and obtain fine-grained labels of communication breakdown types on individual turns, and assess the impact of backstory on human and model perception of conflict in conversation. We find that the polarity of relationship backstories significantly shifted human perception of communication breakdowns and impressions of the social partners, yet models struggle to meaningfully leverage those backstories in the detection task. Additionally, we find that models consistently overestimate how positively a message will make a listener feel. Our findings underscore the critical role of personalization to relationship contexts in enabling LLMs to serve as effective mediators in human communication for authentic connection.

Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication

We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model's dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of 2% to 37% when treating OOD datapoints as positives and in-domain test datapoints as negatives.

Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs

The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: conversation, question-answering, and content composition. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset will be released upon acceptance.} Disclaimer: This paper contains sensitive content that may be disturbing to some readers.

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore’s Low-Resource Languages

Nüshu is an endangered language from Jiangyong County, China, and the world’s only known writing system created and used exclusively by women. Recent Natural Language Processing (NLP) work has digitized small Nüshu-Chinese corpora, but the script remains computationally inaccessible due to its handwritten, mixed-media form and dearth of multimodal resources. We address this gap with two novel datasets: NüshuVision, an image corpus of 500 rendered sentences in traditional vertical, right-to-left orthography, and NüshuStrokes, the first sequential handwriting recordings of all 397 Unicode Nüshu characters by an expert calligrapher. Evaluating five state-of-the-art Chinese Optical Character Recognition (OCR) systems on NüshuVision shows that all fail entirely, each yielding a Character Error Rate (CER) of 1.0. Fine-tuning Microsoft’s TrOCR on NüshuVision lowers CER to 0.67, a modest yet meaningful improvement. These contributions establish the first multimodal foundation for Nüshu revitalization and offer a culturally grounded framework for language preservation.

Recontextualizing Revitalization: A Mixed Media Approach to Reviving the Nüshu Language

As Large Language Models (LLMs) demonstrate increasingly strong human-like capabilities, the need to align them with human values has become significant. Recent advanced techniques, such as prompt learning and reinforcement learning, are being employed to bring LLMs closer to aligning with human values. While these techniques address broad ethical and helpfulness concerns, they rarely consider simulating individualized human values. To bridge this gap, we propose SimVBG, a framework that simulates individual values based on individual backstories that reflect their past experience and demographic information. SimVBG transforms structured data on an individual to a backstory and utilizes a multi‐module architecture inspired by the Cognitive–Affective Personality System to simulate individual value based on the backstories. We test SimVBG on a self-construct benchmark derived from the World Values Survey and show that SimVBG improves top-1 accuracy by more than 10% over the retrieval-augmented generation method. Further analysis shows that performance increases as additional interaction user history becomes available, indicating that the model can refine its persona over time. Code, dataset, and complete experimental results are anonymously available at https://anonymous.4open.science/r/SimVBG-029C.

SimVBG: Simulating Individual Values by Backstory Generation

Conversational agents have typically been developed for either task-oriented dialogue (TOD) or open-ended chitchat, with limited success in integrating both. Yet, real-world conversations often involve fluid transitions between these modes. To address this, we introduce TACT (TOD-And-Chitchat Transition), a dataset for transition-aware dialogue modeling that features structurally diverse and integrated mode flows. TACT supports both user- and agent-driven mode switches, enabling robust modeling of complex dialogue dynamics. To evaluate an agent’s ability to initiate and recover from mode transitions, we propose new performance metrics---Switch and Recovery. Models trained on TACT outperform baselines in both intent detection and mode transition handling. Moreover, applying Direct Preference Optimization (DPO) to TACT-trained models yields extra gains, achieving 75.74% joint mode-intent accuracy and a 40.86% win rate against GPT-4o in human evaluation. This shows that pairing structurally diverse data with DPO boosts response quality and transition control, facilitating the development of proactive agents.

Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents

Effective teaching necessitates adapting pedagogical strategies to the inherent diversity of students, encompassing variations in aptitude, learning styles, and personality, a critical challenge in education and teacher training. Large Language Models (LLMs) offer a powerful tool to simulate complex classroom dynamics, providing a controlled environment for exploring optimal teaching patterns. However, existing simulation frameworks often fall short by neglecting comprehensive student modeling beyond basic knowledge states and, more importantly, by lacking mechanisms for teachers to dynamically adapt their approach based on student feedback and collective performance. Addressing these limitations, \textbf{we propose a simulation framework that integrates LLM-based diverse student agents with a self-evolving teacher agent}. We use genetic algorithms to automatically tune and optimize the teacher's pedagogical parameters based on simulated student performance, enabling the teacher agent to discover and refine teaching patterns tailored to specific class characteristics. Complementing this, \textbf{we introduce Persona-RAG, a novel Retrieval-Augmented Generation method specifically designed for personalized knowledge retrieval in pedagogical contexts, allowing students to retrieve information as per their learning styles}. We show how Persona-RAG remains competitive with standard RAG baselines in accurately retrieving relevant information while adding a touch of personalization for students. Crucially, we perform extensive experiments and highlight the different patterns learnt by the teacher agent while optimizing over classes with students of various learning styles. Our work presents a significant step towards creating adaptive educational technologies and improving teacher training through realistic, data-driven simulation.

Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval-Augmented Generation Across Learning Styles

Channel prediction can greatly reduce the pilot overhead and is a critical technology in the fifth-generation (5G) and the coming 6G wireless communications systems. Conventional model-based channel prediction methods suffer from limited accuracy due to imperfect temporal modeling, while existing AI-based methods suffer from limited generalization due to inadequate training strategies. Recently, large language models (LLMs) have demonstrated remarkable generalization and generation capabilities across diverse domains such as computer vision, quantitative economics, and bioinformatics, which motivates us to apply LLMs in channel prediction. In this paper, we formulate the `channel sentence' based on channel correlation, where the channel is regarded as a 'word'. Subsequently, we propose a generative pre-trained language model for channel prediction (CP-GPT). We collect 12M channel data according to the 3GPP 38.901 protocol and train CP-GPT based on the transformer decoder architecture. Moreover, we design two pre-training tasks based on the characteristics of wireless channels to enhance CP-GPT's understanding of communications channels. We further propose a comprehensive benchmark to rigorously evaluate the capabilities of CP-GPT across multiple dimensions. The simulation results demonstrate that CP-GPT has successfully learned various channel characteristics and exhibits impressive capabilities across numerous downstream tasks.

Downloads

Next from EMNLP 2025

Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES