China

We assess whether AI systems can credibly evaluate investment risk appetite—a task that must be thoroughly validated before automation. Our analysis was conducted on proprietary systems (GPT, Claude, Gemini) and open-weight models (LLaMA, DeepSeek, Mistral), using carefully curated user profiles that reflect real users with varying attributes such as country and gender. As a result, the models exhibit significant variance in score distributions when user attributes—such as country or gender—that should not influence risk computation are changed. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles. While some models align closely with expected scores in the low- and mid-risk ranges, none maintain consistent scores across regions and demographics, thereby violating AI and finance regulations.

EMNLP 2025

Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk Appetite?

generative ai evaluations

finance safety

bias

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Training large language models at scale suffers from costly instabilities. We introduce the R-Metric, a proactive reliability metric that predicts failures before they occur by combining hardware monitoring, training dynamics, and model performance. Achieving 0.973-1.00 F1-Score with 12-minute lead time, our lightweight approach (1.8% overhead) democratizes enterprise-grade reliability monitoring for resource-constrained organizations.

A Proactive Reliability Metric for Detecting Failures in Language Model Training

Large language models (LLMs) acquire general linguistic knowledge from massive-scale pretraining. However, pretraining data mainly comprised of web-crawled texts contain undesirable social biases which can be perpetuated or even amplified by LLMs. In this study, we propose an efficient yet effective annotation pipeline to investigate social biases in the pretraining corpora. Our pipeline consists of protected attribute detection to identify diverse demographics, followed by regard classification to analyze the language polarity towards each attribute. Through our experiments, we demonstrate the effect of our bias analysis and mitigation measures, focusing on Common Crawl as the most representative pretraining corpus.

Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification

People increasingly seek healthcare information from Large Language Models (LLMs), yet the nature of these conversational interactions and their inherent risks remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-14K, a curated dataset of 14K real-world conversations composed of 62K user messages. We use HealthChat-14K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study users' conversational trajectories, interaction patterns, emotional behaviors, and sycophancy-inducing interactions. Our analysis reveals insights into how users seek healthcare information, including the nature of health information users seek, their typical conversational trajectories, expressions of affect, and specific interaction patterns related to conversational challenges and leading questions, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. We will release our analyzed conversations and corresponding analysis artifacts in a curated dataset to foster future research.

What's Up, Doc?: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets

Sentiment analysis of historical literature provides valuable insights for humanities research, yet remains challenging due to scarce annotations and limited generalization of models trained on modern texts. Prior work has primarily focused on two directions: using sentiment lexicons or leveraging large language models (LLMs) for annotation. However, lexicons are often unavailable for historical texts due to limited linguistic resources, and LLM-generated labels often reflect modern sentiment norms and fail to capture the implicit, ironic, or morally nuanced expressions typical of historical literature, resulting in noisy supervision. To address these issues, we introduce a role-guided annotation strategy that prompts LLMs to simulate historically situated perspectives when labeling sentiment. Furthermore, we design a prototype-aligned framework that learns sentiment prototypes from high-resource data and aligns them with low-resource representations via symmetric contrastive loss, improving robustness to noisy labels. Experiments across multiple historical literature datasets show that our method outperforms state-of-the-art baselines, demonstrating its effectiveness.

Role-Guided Annotation and Prototype-Aligned Representation Learning for Historical Literature Sentiment Classification

Contextual biasing in ASR systems is critical for recognizing rare, domain-specific terms but becomes impractical with large keyword dictionaries due to prompt size and latency constraints. We present RECAST--a lightweight retrieval-augmented approach that repurposes decoder states of a pretrained ASR model to retrieve relevant keywords without requiring audio exemplars. RECAST introduces a contrastively trained retriever that aligns decoder-state embeddings with textual keyword representations, enabling fast token-level retrieval over large dictionaries. Retrieved keywords are ranked and formatted into a prompt to guide a downstream speech language model. Trained solely on LibriSpeech and evaluated on out-of-domain benchmarks covering up to 4,000 keywords across diverse domains, RECAST consistently outperforms full-list prompt biasing and strong phonetic/text baselines. It achieves up to 54.3% relative reduction in entity WER and 41.3% overall WER improvement over the baseline, along with up to 2.5x higher recall in challenging settings. Furthermore, RECAST remains effective for diverse languages such as Hindi, demonstrating its scalability, language-agnostic design, and practicality for real-world contextual ASR.

RECAST: Retrieval-Augmented Contextual ASR via Decoder-State Keyword Spotting

While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systemati**C** evaluat**I**on **V**ia controll**E**d s**T**imuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs' understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.

CIVET: Systematic Evaluation of Understanding in VLMs

We propose a training-free approach to improve sentence embeddings leveraging test-time compute by applying generative text models for data augmentation at inference time. Unlike conventional data augmentation that utilises synthetic training data, our approach does not require access to model parameters or the computational resources typically required for fine-tuning state-of-the-art models. Generatively Augmented Sentence Encoding variates the input text by paraphrasing, summarising, or extracting keywords, followed by pooling the original and synthetic embeddings. Experimental results on the Massive Text Embedding Benchmark for Semantic Textual Similarity (STS) demonstrate performance improvements across a range of embedding models using different generative models for augmentation. We find that generative augmentation leads to larger performance improvements for embedding models with lower baseline performance. These findings suggest that integrating generative augmentation at inference time adds semantic diversity and can enhance the robustness and generalisability of sentence embeddings for embedding models. Our results show that performance gains depend on the embedding model and the dataset.

GASE: Generatively Augmented Sentence Encoding

The generative large language models (LLMs) are increasingly used for data augmentation tasks, where text samples are paraphrased (or generated anew) and then used for downstream model fine-tuning. This is useful, especially for low-resource settings. For better augmentations, LLMs are prompted with examples (few-shot scenarios). Yet, the samples are mostly selected randomly, and a comprehensive overview of the effects of other (more ''informed'') sample selection strategies is lacking. In this work, we compare sample selection strategies existing in the few-shot learning literature and investigate their effects in LLM-based textual augmentation in a low-resource setting. We evaluate this on in-distribution and out-of-distribution model performance. Results indicate that while some ''informed'' selection strategies increase the performance of models, especially for out-of-distribution data, it happens only seldom and with marginal performance increases. Unless further advances are made, a default of random sample selection remains a good option for augmentation practitioners.

Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation

The capabilities of large language models (LLMs) are advancing at an remarkable pace, along with a surge in cloud services that are powered by LLMs. Their convenience has gradually transformed the routines people work. However, for services such as document summarizing, editing, and so on, users need to upload relevant files or context to obtain the desired services, which may inadvertently expose their privacy. This paper aims to address the challenging balance between the convenience of LLMs services and user privacy concerns. Specifically, based on the structural and functional characteristics of LLMs, we have developed a strategy that safeguards user prompt while accessing LLM cloud services, even in scenarios where advanced reconstruction attacks are adopted. We comprehensively evaluate the efficacy of our method across prominent LLM benchmarks. The empirical results show that our method not only effectively thwarts reconstruction attacks but also, in certain tasks, even improves model performance, surpassing the outcomes reported in official model cards.

LLMs are Privacy Erasable

Understanding the core dimensions of conceptual semantics is fundamental to uncovering how meaning is organized in language and the brain. Existing approaches often rely on predefined semantic dimensions that offer only broad representations, overlooking finer conceptual distinctions. This paper proposes a novel framework to investigate the subdimensions underlying coarse-grained semantic dimensions. Specifically, we introduce a textbfDisentangled textbfContinuous textbfSemantic textbfRepresentation textbfModel (textbfDCSRM) that decomposes word embeddings from large language models into multiple sub-embeddings, each encoding specific semantic information. Using these subembeddings, we identify a set of interpretable semantic subdimensions. To assess their neural plausibility, we apply voxel-wise encoding models to map these subdimensions to brain activation. Our work offers a more fine-grained interpretable semantic subdimensions of conceptual meaning. Further analyses reveal that semantic dimensions are structured according to distinct principles, with polarity emerging as a key factor driving their decomposition into subdimensions. The neural correlates of the identified subdimensions support their cognitive and neuroscientific plausibility.

Downloads

Next from EMNLP 2025

A Proactive Reliability Metric for Detecting Failures in Language Model Training

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

A Proactive Reliability Metric for Detecting Failures in Language Model Training

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads