China

Two-step approaches combining pre-trained large language model embeddings and anomaly detectors show good performance in text anomaly detection by leveraging rich semantic representations. However, high-dimensional dense embeddings extracted by large language models create challenges in substantial memory requirements and high computation time. To address this challenge, we introduce the Simplified Isolation Kernel (SIK), which maps high-dimensional dense embeddings to lower-dimensional sparse representations while preserving crucial anomaly characteristics. SIK has linear-time complexity and significantly reduces space complexity through its innovative boundary-focused feature mapping. Experiments across 7 datasets demonstrate that SIK achieves better detection performance than 11 SOTA anomaly detection algorithms while maintaining computational efficiency and low memory cost. All code and demonstrations are available at https://anonymous.4open.science/r/SIK-6577/.

EMNLP 2025

Text Anomaly Detection with Simplified Isolation Kernel

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Recent advancements in Large Multimodal Models (LMMs) have showcased their impressive capabilities in mathematical reasoning tasks in visual contexts. As a step toward developing AI models to conduct rigorous multi-step multimodal reasoning, we introduce StatsChartMWP, a real-world educational dataset for evaluating visual mathematical reasoning abilities on math word problems (MWPs) with statistical charts. Our dataset contains 8,514 chart-based MWPs, meticulously curated by K-12 educators within real-world teaching scenarios. We provide detailed preprocessing steps and manual annotations to help evaluate state-of-the-art models on StatsChartMWP. Comparing baselines, we find that current models struggle in undertaking meticulous multi-step mathematical reasoning among technical languages, diagrams, tables, and equations. Towards alleviate this gap, we introduce CoTAR, a chain-of-thought (CoT) augmented reasoning solution that fine-tunes the LMMs with solution-oriented CoT-alike reasoning steps. The LMM trained with CoTAR is more effective than current open-source approaches. We conclude by shedding lights on challenges and opportunities in enhancement in LMMs and steer future research and development efforts in the realm of statistical chart comprehension and analysis. The code and data are available at \url{https://anonymous.4open.science/r/StatsChartMWP-D25A}.

StatsChartMWP: A Dataset for Evaluating Multimodal Mathematical Reasoning Abilities on Math Word Problems with Statistical Charts

Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM³, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.

ChartM³: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

As the textual data given as the context of various tasks lengthens, having necessary information scattered throughout makes it more difficult for large language models (LLMs) to capture relevant details. This challenge is particularly prominent in tasks such as question answering (QA), where key information is often not evenly distributed within the context. This problem of information sparsity has led to the attempts of various approaches, such as direct context adjustment and retrieval-based methods. However, these approaches typically leverage compressed contexts, which increases the risk that key information may be contained in the dropped portions. Therefore, research from the perspective of addressing the information sparsity while not losing key details in contexts is required. To address this issue, we propose Highlighting entity-AWare Knowledge (HAWK) framework. HAWK consists of three main steps: i) entity extraction, ii) entity-aware subcontext selection, and iii) triplet construction. The core mechanism of HAWK is to highlight key information in a context and structuralize it in an entity-aware manner, facilitating knowledge-enhanced generation. Through extensive experiments and comprehensive analysis, HAWK confirms significant improvements in QA tasks with long contexts, achieving up to a 27.6-point F1 score increase and at least an average win rate of 76.75% over existing methods.

HAWK: Highlighting Entity-aware Knowledge for Alleviating Information Sparsity in Long Contexts

Code-switching (CS), the alternation between two or more languages within a single speaker's utterances, is common in real-world conversations and poses significant challenges for multilingual speech technology. However, systems capable of handling this phenomenon remain underexplored, primarily due to the scarcity of suitable datasets. To resolve this issue, we propose Universal Code-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CS samples without altering sentence semantics. Our approach utilizes an algorithm we call Substituting WORDs with Synonyms (SWORDS), which generates CS speech by replacing selected words with their translations while considering their parts of speech. Using UniCoM, we construct Code-Switching FLEURS (CS-FLEURS), a multilingual CS corpus designed for automatic speech recognition (ASR) and speech-to-text translation (S2TT). Experimental results show that CS-FLEURS achieves high intelligibility and naturalness, performing comparably to existing datasets on both objective and subjective metrics. We expect our approach to advance CS speech technology and enable more inclusive multilingual systems.

UniCoM: A Universal Code-Switching Speech Generator

We study continual information extraction (IE), which aims to extract emerging information across diverse IE tasks incessantly while avoiding forgetting. Existing approaches are either task-specialized for a single IE task or suffer from catastrophic forgetting and insufficient knowledge transfer in continual IE. This paper proposes a new continual IE model using token-level mixture of LoRA experts with LLMs. We leverage a LoRA router to route each token to the most relevant LoRA experts, facilitating effective knowledge transfer among IE tasks. We guide task experts' selection by task keys to retain the IE task-specific knowledge and mitigate catastrophic forgetting. We design a gate reflection method based on knowledge distillation to address forgetting in the LoRA router and task keys. The experimental results show that our model achieves state-of-the-art performance, effectively mitigating catastrophic forgetting and enhancing knowledge transfer in continual IE.

Mixture of LoRA Experts for Continual Information Extraction with LLMs

As general large language models continue to advance, their real-world adaptation through effective fine-tuning remains a significant challenge. We introduce Hierarchical Multilevel Contrastive Learning (HMCL), a new contrastive learning framework that improve task-specific text representation for general models. HMCL integrates three-level semantic differentiation (positive, weak-positive, and negative) and unifies contrastive learning, pair classification, and ranking objectives into a cohesive optimization strategy. HMCL demonstrates exceptional results across multi-domain and multilingual benchmarks, including text similarity, retrieval, reranking and RAG tasks. It outperforms top unsupervised methods and supervised fine-tuning approaches while maintaining broad compatibility with architectures ranging from BERT to Qwen, 330M to 7B. In real-world merchant consultation scenarios, HMCL shows a 0.70-6.24 point improvement over original fine-tuning methods in large-scale base models. This establishes HMCL as a versatile solution that bridges the gap between general-purpose models and specialized industrial applications.

HMCL: Task-Optimal Text Representation Adaptation through Hierarchical Contrastive Learning

Although retrieval-augmented generation (RAG) remains essential for knowledge-based question answering (KBQA), current paradigms face critical challenges under specific domains. Existing methods struggle with targeted adaptation on small-scale KBs: vanilla unsupervised training exhibits poor effectiveness, while fine-tuning incurs prohibitive costs of external signals. We present KBAlign, a self-supervised framework that enhances RAG systems through efficient model adaptation. Our key insight is to leverage the model's intrinsic capabilities for knowledge alignment through two innovative mechanisms: multi-grained self-annotation that captures global knowledge for data construction, and iterative tuning that accelerates convergence through self verification. This framework enables cost-effective model adaptation to specific textual KBs, without human supervision or external model assistance. Experiments demonstrate that KBAlign can achieve 90% of the performance gain obtained through GPT-4-supervised adaptation, while relying entirely on self-annotation of much smaller models. KBAlign significantly improves downstream QA accuracy across multiple domains with tiny costs, particularly benefiting scenarios requiring deep knowledge integration from specialized corpora. We release our experimental data, models, and process analyses to the community for further exploration(https://anonymous.4open.science/r/KBAlign-D160).

KBAlign: Efficient Self Adaptation on Specific Textual Knowledge Bases

The ability to critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-Critic, a holistic benchmark for evaluating the critique ability of LMMs across three dimensions: basic, correlation, and comparison. Covering 8 task types and over 500 tasks, MM-Critic collects responses from LMMs with various model sizes. To enhance the evaluation reliability, we design expert-informed scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-Critic and provide a comprehensive assessment of leading LMMs’ critique capabilities. Further analysis reveals key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions.

MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

In this paper, we investigate the relationship between Information Content (IC) and the squared norm of embeddings at the text level, focusing on the mechanisms and composition functions used to combine token embeddings. i) We formally derive two sufficient conditions for this correspondence to hold in embedding models. ii) We empirically examine the correspondence and the validity of these conditions at word level for both static and contextual embeddings and different subword token composition mechanisms. iii) Building on Shannon’s Constant Entropy Rate principle, we explore whether embedding mechanisms exhibit a linearly monotonic increase in information content as text length increases. Our formal analysis and experiments reveal that: i) At the word embedding level, models satisfy the sufficient conditions and show a strong correspondence when certain subword composition functions are applied. ii) Only scaled embedding averages proposed in this paper and certain information-theoretic composition functions preserve the correspondence. Some non-compositional representations—such as the CLS token in BERT or the EOS token in LLaMA—tend to converge toward a fixed point. The CLS token in ModernBERT, however, exhibits behavior that aligns more closely with the CER hypothesis.

On the Correspondence between the Squared Norm and Information Content in Text Embeddings

Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our anonymous code is available at https://anonymous.4open.science/r/SimpleDeepSearcher.

Downloads

Next from EMNLP 2025

StatsChartMWP: A Dataset for Evaluating Multimodal Mathematical Reasoning Abilities on Math Word Problems with Statistical Charts

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES