China

Old Abstract: Sign Language Translation has advanced with deep learning, yet evaluations remain signer-dependent, with overlapping signers in training, development, and test sets. This raises concerns about whether models truly generalise or rely on signer-specific features. To address this, signer-fold cross-validation is conducted on GFSLT-VLP, GASLT, and SignCL—three leading, publicly available, non-proprietary gloss-free sign language translation models, with SignCL being among the most prominent. Experiments are performed on two benchmarking datasets, CSL-Daily and PHOENIX14T. The results reveal a significant performance drop under signer-independent settings. On PHOENIX14T, GFSLT-VLP sees BLEU-4 fall from 21.44 to as low as 3.59 and ROUGE-L from 42.49 to 11.89; GASLT drops from a reported 15.74 to 8.26; and SignCL from 22.74 to 3.66. These findings highlight the substantial overestimation of SLT model performance when evaluations are conducted under signer-dependent assumptions. This work proposes two key recommendations: (1) adopting signer-independent evaluation protocols, and (2) restructuring datasets to include signer-independent splits. Updated Abstract: Sign Language Translation has advanced with deep learning, yet evaluations remain signer-dependent, with overlapping signers in training, development, and test sets. This raises concerns about whether models truly generalise or rely on signer-specific features. To address this, signer-fold cross-validation is conducted on GFSLT-VLP, GASLT, and SignCL—three leading, publicly available, non-proprietary gloss-free sign language translation models. Experiments are performed on two benchmarking datasets, CSL-Daily and PHOENIX14T. The results reveal a significant performance drop under signer-independent settings. On PHOENIX14T, GFSLT-VLP sees BLEU-4 fall from 21.44 to as low as 3.59 and ROUGE-L from 42.49 to 11.89; GASLT drops from a reported 15.74 to 8.26; and SignCL from 22.74 to 3.66. Similarly, on CSL-Daily, GASLT’s BLEU-4 drops from 4.07 to an average of 3.63 under signer-fold cross-validation, despite the increased training data. These findings highlight the substantial overestimation of SLT model performance when evaluations are conducted under signer-dependent assumptions. This work proposes three key recommendations: (1) adopting signer-independent evaluation protocols to ensure generalisation to unseen signers, (2) restructuring existing datasets to include explicit signer-independent splits for consistent benchmarking, and (3) encouraging the reporting of both signer-dependent and signer-independent results to improve transparency and comparability.

EMNLP 2025

Rethinking Sign Language Translation: The Impact of Signer Dependence on Model Evaluation

sign language translation

computer vision

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Suicide remains a major global mental health challenge, and early intervention hinges on recognizing signs of suicidal ideation. In private conversations, such ideation is often expressed in subtle or conflicted ways, making detection especially difficult. Existing data sets are mainly based on public help-seeking platforms such as Reddit, which fail to capture the introspective and ambiguous nature of suicidal ideation in more private contexts. To address this gap, we introduce \ourdata, a novel dataset of 1,200 test cases simulating implicit suicidal ideation within psychologically rich dialogue scenarios. Each case is grounded in psychological theory, combining the Death/Suicide Implicit Association Test (D/S-IAT) patterns, expanded suicidal expressions, cognitive distortions, and contextual stressors. In addition, we propose a psychology-guided evaluation framework to assess the ability of LLMs to identify implicit suicidal ideation through their responses. Experiments with eight widely used LLMs across varied prompting conditions reveal that current models often struggle significantly to recognize implicit suicidal ideation. Our findings highlight the urgent need for more clinically grounded evaluation frameworks and design practices to ensure the safe use of LLMs in sensitive support systems.

Can Large Language Models Identify Implicit Suicidal Ideation? An Empirical Evaluation

We present a novel three-stage framework leveraging Large Language Models (LLMs) within a risk-aware multi-agent system for automate strategy finding in quantitative finance. Our approach addresses the brittleness of traditional deep learning models in financial applications by: employing prompt-engineered LLMs to generate executable alpha factor candidates across diverse financial data, implementing multimodal agent-based evaluation that filters factors based on market status, predictive quality while maintaining category balance, and deploying dynamic weight optimization that adapts to market conditions. Experimental results demonstrate the robust performance of the strategy in Chinese & US market regimes compared to established benchmarks. Our work extends LLMs capabilities to quantitative trading, providing a scalable architecture for financial signal extraction and portfolio construction. The overall framework significantly outperforms all benchmarks with 53.17% cumulative return on SSE50 (Jan 2023 to Jan 2024), demonstrating superior risk-adjusted performance and downside protection on the market.

Automate Strategy Finding with LLM in Quant Investment

While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can map discrete token counts to continuous wall-clock time, and validate this through a dialogue duration judgment task. Second, we demonstrate that LLMs could use this awareness to adapt their response length while maintaining accuracy when users express urgency in question answering tasks. Finally, we develop BombRush, an interactive navigation challenge that examines how LLMs modify behavior under progressive time pressure in dynamic environments. Our findings indicate that LLMs possess certain awareness of time passage, enabling them to bridge discrete linguistic tokens and continuous physical time, though this capability varies with model size and reasoning abilities. This work establishes a theoretical foundation for enhancing temporal awareness in LLMs for time-sensitive applications.

Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

The unlearning method aims at effectively removing harmful, sensitive, or outdated knowledge without costly retraining the model. However, existing methods suffer from two critical limitations: (1) collateral forgetting, where erasing target data inadvertently removes related but desirable knowledge, and (2) generality forgetting, where aggressive unlearning degrades the model's general capabilities. To address these challenges, we propose DirectiOn Guide unlEarning (DOGE), a novel method that enables precise knowledge erasure by identifying and leveraging a targeted "unlearning direction" in the model’s parameter space. DOGE first extracts this direction through differential analysis of representations for forgotten and retained samples, pinpointing the exact subspace associated with unwanted knowledge. It then selectively applies updates along this direction, ensuring minimal interference with retained information and general model performance. Experiments across multiple benchmarks demonstrate that Doge achieves state-of-the-art unlearning precision while preserving both related knowledge and general capabilities.

Lock on Target! Precision Unlearning via Directional Control

Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have shown improved performance in generating accurate responses. However, the dependence on external knowledge bases introduces potential security vulnerabilities, particularly when these knowledge bases are publicly accessible and modifiable. While previous studies have exposed knowledge poisoning risks in RAG systems, existing attack methods suffer from critical limitations: they either require injecting multiple poisoned documents (resulting in poor stealthiness) or can only function effectively on simplistic queries (limiting real-world applicability). This paper reveals a more realistic knowledge poisoning attack against RAG systems that achieves successful attacks by poisoning only a single document while remaining effective for complex multi-hop questions involving complex relationships between multiple elements. Our proposed AuthChain address three challenges to ensure the poisoned documents are reliably retrieved and trusted by the LLM, even against large knowledge bases and LLM's own knowledge. Extensive experiments across six popular LLMs demonstrate that AuthChain achieves significantly higher attack success rates while maintaining superior stealthiness against RAG defense mechanisms compared to state-of-the-art baselines.

One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems

Robustness evaluation applied to linguistic variation is key to understanding the generalization capabilities of Natural Language to SQL (NL2SQL) models, yet existing benchmarks seldom address this factor in a systematic or controlled manner. We propose a novel schema-aligned paraphrasing framework that leverages SQL-to-NL (SQL2NL) to automatically generate semantically equivalent, lexically diverse queries while maintaining alignment with the original schema and intent. This enables the first targeted evaluation of NL2SQL robustness to linguistic variation in isolation---distinct from prior work that primarily investigates ambiguity or schema perturbations. Our analysis reveals that state-of-the-art models are far more brittle than standard benchmarks suggest. For example, LLaMa3.3-70B exhibits a 10.23% drop in execution accuracy (from 77.11% to 66.9%) on paraphrased Spider queries, while LLaMa3.1-8B suffers an even larger drop of nearly 20% (from 62.9% to 42.5%). Smaller models are disproportionately affected. We also find that robustness degradation varies significantly with query complexity, dataset, and domain---highlighting the need for evaluation frameworks that explicitly measure linguistic generalization to ensure reliable performance in real-world settings.

Evaluating NL2SQL via SQL2NL

As large language models (LLMs) are increasingly applied across various domains, enhancing safety while maintaining the helpfulness of LLMs has become a critical challenge. Recent studies solve this problem through safety-constrained online preference optimization or safety-constrained offline preference optimization. However, the safety-constrained online methods often suffer from excessive safety, which might reduce helpfulness, while the safety-constrained offline methods perform poorly in adaptively balancing safety and helpfulness. To address these limitations, we propose MidPO, a Mixture of Experts (MoE) framework for safety-helpfulness dual Preference Optimization. Firstly, MidPO devises single-preference enhanced direct preference optimization approach to transform the base model into two independent experts, termed safety and helpfulness experts, and fine-tunes the two independent experts for optimal safety or helpfulness performance. Secondly, to achieve an effective balance between safety and helpfulness, MidPO incorporates the two experts into the MoE framework and designs a dynamic routing mechanism to allocate contributions from each expert adaptively. We conduct quantitative and qualitative experiments on three popular datasets to demonstrate the proposed MidPO significantly outperforms state-of-the-art approaches in both safety and helpfulness. Code is available at https://shorturl.at/BmNa8.

MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework

We introduce FreqRank, a mutation-based defense to localize malicious components in LLM outputs and their corresponding backdoor triggers. FreqRank assumes that the malicious sub-string(s) consistently appear in outputs for triggered inputs and uses a frequency-based ranking system to identify them. Our ranking system then leverages this knowledge to localize the backdoor triggers present in the inputs. We train six malicious models for three downstream tasks, namely, code completion (CC), code generation (CG), and code summarization (CS), and show that they have an average attack success rate (ASR) of 80.9%. Furthermore, FreqRank’s ranking system highlights the malicious outputs as one of the top five suggestions in over 99.0% of cases. We also demonstrate that FreqRank is capable of localizing the backdoor trigger effectively even with a limited number of triggered samples. Finally, we show that our approach is 40-50% more effective than other defense methods.

Localizing Malicious Outputs from CodeLLM

Modern ASR systems are increasingly deployed in high-stakes settings, including clinical interviews, public services, and educational tools, where equitable performance across speaker groups is essential. While pre-trained models like Whisper achieve strong overall accuracy, they often exhibit inconsistent group-level performance that varies across domains. These disparities are not fixed properties of the model, but emerge from the interaction between model, data, and task—posing challenges for fairness interventions designed in-domain. We frame fairness in ASR as a generalisation problem. We fine-tune Whisper on a diverse corpus using four strategies: standard fine-tuning, demographic rebalancing, gender-swapped data augmentation, and a novel contrastive learning objective that encourages gender-invariant representations. We evaluate performance across multiple aspects of fairness and utility, both in-domain and on three out-of-domain test sets: LibriSpeech, EdAcc, and CognoSpeak. Our findings show that the method with the best in-domain fairness performed worst out-of-domain, illustrating that fairness gains do not always generalise. Demographic balancing generalises more consistently, while our contrastive method offers a practical alternative: it achieves stable, cross-domain fairness improvements without requiring changes to the training data distribution, and with minimal accuracy trade-offs.

Fairness in Automatic Speech Recognition Isn’t a One-Size-Fits-All

Large Language Models (LLMs) often exhibit tendencies that diverge from human preferences, such as favoring certain writing styles or producing verbose outputs. While crucial for improvement, identifying the factors driving these misalignments remains challenging due to existing evaluation methods' reliance on coarse-grained comparisons and lack of explainability. To address this, we introduce PROFILE, an automated framework to uncover and measure the alignment of factor-level preferences of humans and LLMs. Using PROFILE, we analyze preference alignment across summarization, instruction-following, and document-based question-answering tasks. We find a significant discrepancy: while LLMs show poor factor-level alignment with human preferences when generating texts, they demonstrate strong alignment in evaluation tasks. We demonstrate how leveraging the identified generation-evaluation gap can be used to improve LLM alignment through multiple approaches, including fine-tuning with self-guidance.

Downloads

Next from EMNLP 2025

Can Large Language Models Identify Implicit Suicidal Ideation? An Empirical Evaluation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES