China

Neural models for Japanese pronunciation estimation often suffer from errors such as
hallucinations (generating pronunciations that are not grounded in the input) and 
omissions (skipping parts of the input).
Although attention-based alignment has been used to detect such errors,
selecting reliable attention heads is difficult,
and developing methods that can both detect and correct these errors
remains challenging.


In this paper, we propose a simple method called
existence-based alignment check.
In this approach,
we consider alignment candidates
independently extracted from all attention heads,
and check whether at least one of these candidates satisfies two conditions
derived from the linguistic properties of Japanese pronunciation:
monotonicity and pronunciation length per character.
We generate multiple hypotheses using beam search
and use the alignment check as a filtering mechanism
to correct hallucinations and omissions.


We apply this method to a dataset of Japanese facility names
and demonstrate that it improves pronunciation estimation accuracy
by over 2.5\%.

EMNLP 2025

Just One is Enough: An Existence-based Alignment Check for Robust Japanese Pronunciation Estimation

japanese nlp

transformer attention

error detection and correction

hallucination

grapheme-to-phoneme conversion

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance "more data" versus "better data" for real-world reasoning tasks.

More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

As large language models (LLMs) become increasingly prevalent, ensuring their robustness against adversarial misuse is crucial. This paper introduces the Graph of Attacks with Pruning (GAP) framework, an advanced approach for generating stealthy jailbreak prompts to evaluate and enhance LLM safeguards. GAP addresses limitations in existing tree-based LLM jailbreak methods by implementing an interconnected graph structure that enables knowledge sharing across attack paths. Our experimental evaluation demonstrates GAP's superiority over existing techniques, achieving a 20.8% increase in attack success rates while reducing query costs by 62.7%. GAP consistently outperforms state-of-the-art methods for attacking both open and closed LLMs, with attack success rates of >96%. Additionally, we present specialized variants like GAP-Auto for automated seed generation and GAP-VLM for multimodal attacks. GAP-generated prompts prove highly effective in improving content moderation systems, increasing true positive detection rates by 108.5% and accuracy by 183.6% when used for fine-tuning.

Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation

Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to bridge the gap. We show that each of these measures reduces the gap substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.

SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

In customer-oriented industries, compliance-guaranteed chatbots are essential for delivering accurate, verifiable responses. Retrieval-based systems leverage a Match-and-Respond mechanism with predefined knowledge bases of human-verified question-answer (QA) pairs, providing hallucination-free responses. To effectively handle diverse customer inquiries, augmenting the knowledge base with similar questions that maintain semantic consistency and increase schematic diversity is crucial for improved query matching. In this paper, we define the Similar Question Generation (SQG) task for LLM training and inference, presenting context-aware methods for generating similar questions through a cost-efficient one-to-many paradigm. This approach facilitates broader semantic exploration and better alignment with the source question-answer pairs. We also propose combinatorial optimization techniques to construct in-context prompt demonstrations and to select an optimal subset of similar questions for knowledge base expansion. Experiments on a conversational QA-based SQG dataset demonstrate the effectiveness of these methods in both quantitative and human evaluations. A 92\% user satisfaction rate in deployed chatbot systems, reflecting an 18\% relative improvement, further highlights the practical advantages of the proposed augmentation approach in ensuring compliance and enhancing customer service chatbot applications.

Augmenting Compliance Guaranteed Conversational AI: Context-Aware Knowledge Base Expansion with LLMs and Combinatorial Optimization

LLMs often fail to meet specialized needs of distinct user groups due to their one-size-fits-all approach, and there is limited understanding of what personalization each group expects.
To address this, we propose GPA a group-aware personalization framework that captures context-specific preference variations and steers LLMs accordingly.
Our approach involves: (1) Group-Aware Preference Extraction, which distills divergent preferences from real-world conversation logs into interpretable rubrics, and (2) Tailored Response Generation, using (a) GPA-CT, which adapts responses using learnt rubrics, and (b) GPA-FT, which finetunes models using rubric-guided synthetic data.
Automatic and Human evaluations confirm that GPA improves group alignment without compromising perfomance on standard instruction-following benchmarks.

Group Preference Alignment: Customizing LLM Responses from In-Situ Conversations Only When Needed

Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks — structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations — remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals (both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.13\%, with an average gain of +2.47\% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being twice efficient in terms of the number of LLM tokens.

Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning

The ubiquity of payment networks generates vast transactional data encoding rich consumer and merchant behavioral patterns. Recent foundation models for transaction analysis process tabular data sequentially but rely on index-based representations for categorical merchant fields, causing substantial semantic information loss by converting rich textual data into discrete tokens. While Large Language Models (LLMs) can address this limitation through superior semantic understanding, their computational overhead challenges real-time financial deployment. We introduce a hybrid framework that uses LLM-generated embeddings as semantic initializations for lightweight transaction models, balancing interpretability with operational efficiency. Our approach employs multi-source data fusion to enrich merchant categorical fields and a one-word constraint principle for consistent embedding generation across LLM architectures. We systematically address data quality through noise filtering and context-aware enrichment. Experiments on large-scale transaction datasets demonstrate significant performance improvements across multiple transaction understanding tasks.

Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings

Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem exacerbated by limited consistency-focused data and the limitations of existing fine-tuning methods for improving consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving approximately 47.5% improvement in response similarity over the baseline, thus offering a practical solution for increasing the the reliability of an industrial RAG system.

Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation

Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework via a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models---Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting challenges in clinical QA. Our code is available at: \url{https://anonymous.4open.science/r/TTT-C21B/README.md}

Downloads

Next from EMNLP 2025

More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads