China

This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study investigates the sub-problems within these core challenges, such as input representation, chunking, prompting, selection of LLMs, and multimodal models. It examines the effect of different design choices through a new layout-aware IE test suite, benchmarking against traditional, fine-tuned IE models. Our results on two datasets show that our one-factor-at-a-time (OFAT) method achieves near-optimal results. It is only 0.8--1.8 points lower than the best full factorial exploration with a fraction ~2.8 of the required computation. Compared to a baseline configuration, it gains 13.3--37.5 points. We demonstrate that, if well-configured, general-purpose LLMs match the performance of specialized models, providing a cost-effective, label-free alternative.

EMNLP 2025

Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large Language Models (LLMs) are pivotal in enabling intelligent experiences across various applications, from summarization to advanced content organization and retrieval functionalities. However, deploying LLMs for diverse tasks is fundamentally constrained by memory and compute limitations, making it impractical to fine-tune separate models for each task. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a scalable solution for multi-task LLM deployment. Despite its potential, LoRA faces challenges in selecting optimal ranks and layers for each task-model pair, often resulting in inefficiencies and unnecessary parameters. We introduce **Norm A**daptive **L**ocalized (**NormAL**) **LoRA**, a novel variant that employs rank-norm regularization to dynamically determine the optimal rank for each weight matrix, ensuring adaptation is concentrated where it is most impactful. Our approach **reduces adapter parameters by 37%** while preserving full fine-tuning performance, making NormAL LoRA a transformative tool for enabling efficient, scalable, and space-constrained AI deployments across diverse industries and applications.

NormAL LoRA: What is the perfect size?

In this paper, we propose a unified approach to model calibration for emotion detection that exploits the complementary strengths of knowledge distillation and the MixUp data augmentation technique to enhance the trustworthiness of emotion detection models. Specifically, we use a MixUp method informed by training dynamics that generates augmented data by interpolating easy-to-learn with ambiguous samples based on their similarity and dissimilarity provided by saliency maps. We use this MixUp method to calibrate the teacher model in the first generation of the knowledge distillation process. To further calibrate the teacher models in each generation, we employ dynamic temperature scaling to update the temperature used for scaling the teacher predictions. We find that calibrating the teachers with our method also improves the calibration of the student models. We test our proposed method both in-distribution (ID) and out-of-distribution (OOD). To obtain better OOD performance, we further fine-tune our models with a simple MixUp method that interpolates a small number of OOD samples with ambiguous ID samples.

Model Calibration for Emotion Detection

Multimodal retrieval models, such as Open-CLIP, rely on aligned textual and visual representations, yet their robustness to lexical variations remains underexplored, especially in low-resource languages. In this study, we introduce three methods for generating synonym substitutions in Ukrainian: a dictionary-based approach, a large language model (LLM)-generated approach, and a hybrid method combining both. We further demonstrate that fine-tuning Open-CLIP with synonym-augmented data improves retrieval robustness, leading to a 7% increase in HIT@5. Our findings provide insights into synonym substitution techniques for low-resource languages and offer a pathway to enhancing the robustness of multimodal models in diverse linguistic settings.

From Benchmark to Better Embeddings: Leveraging Synonym Substitution to Enhance Multimodal Models in Ukrainian

Adversarial text attack research plays a crucial role in evaluating the robustness of NLP models. However, the increasing complexity of transformer-based architectures has dramatically raised the computational cost of attack testing, especially for researchers with limited resources (e.g., GPUs). Existing popular black-box attack methods often require a large number of queries, which can make them inefficient and impractical for researchers. To address these challenges, we propose two new attack selection strategies called Hybrid and Dynamic Select, which better combines the strengths of previous selection algorithms. Hybrid Select merges generalized BinarySelect techniques with GreedySelect by introducing a size threshold to decide which selection algorithm to use. Dynamic Select provides an alternative approach of combining the generalized Binary and GreedySelect by learning which lengths of texts each selection method should be applied to. This greatly reduces the number of queries needed while maintaining attack effectiveness (a limitation of BinarySelect). We also extend this to a sentence level, and find that our method is able to reduce the number of required queries per attack up to 16.7% on average against both encoder and LLM models without losing the effectiveness of the attack.

Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms

Marmoset monkeys exhibit complex vocal communication, challenging the view that primate vocalization is entirely innate, and show features of human speech, such as individual naming and turn-taking. Studying their communication offers a unique opportunity to link language with neural activity—especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate solely through speech, applying standard LLM approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocalizations. We design novel zero-shot evaluation metrics using unsupervised in-the-wild data alongside weakly labeled conversational data to assess GmSLM, demonstrating its advantage over a basic human-speech-based baseline. Generated vocalizations closely match real resynthesized samples acoustically and perform well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguishes real from artificial conversations. Importantly, this tool supports investigating the neural basis of vocal communication and provides a practical framework linking vocal behavior and brain activity. We believe GmSLM can benefit future work in neuroscience, bioacoustics, and evolutionary biology. Audio samples: https://anonymous.4open.science/w/anon_demo-6162/

GmSLM : Generative Marmoset Spoken Language Modeling

Retrieval-augmented generation (RAG) enhances large language models by incorporating context retrieved from external knowledge sources. While the effectiveness of the retrieval module is typically evaluated with relevance-based ranking metrics, such metrics may be insufficient to reflect the retrieval's impact on the final RAG result, especially in long-form generation scenarios. We argue that providing a comprehensive retrieval-augmented context is important for long-form RAG tasks like report generation and propose metrics for assessing the context independent of generation. We introduce CRUX, a \textbf{C}ontrolled \textbf{R}etrieval-a\textbf{U}gmented conte\textbf{X}t evaluation framework designed to directly assess retrieval-augmented contexts. This framework uses human-written summaries to control the information scope of knowledge, enabling us to measure how well the context covers information essential for long-form generation. CRUX uses question-based evaluation to assess RAG's retrieval in a fine-grained manner. Empirical results show that CRUX offers more reflective and diagnostic evaluation. Our findings also reveal substantial room for improvement in current retrieval methods, pointing to promising directions for advancing RAG's retrieval. Our data and code are publicly available to support and advance future research on retrieval.\footnote{https://anonymous.4open.science/r/rag-rerank-85CF }

Controlled Retrieval-augmented Context Evaluation for Long-form RAG

Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

Small Language Models (SLMs) offer computational efficiency and accessibility, yet a systematic evaluation of their performance and environmental impact remains lacking. We introduce SLM-Bench, the first benchmark specifically designed to assess SLMs across multiple dimensions, including accuracy, computational efficiency, and sustainability metrics. SLM-Bench evaluates 15 SLMs on 9 NLP tasks using 23 datasets spanning 14 domains. The evaluation is conducted on 4 hardware configurations, providing a rigorous comparison of their effectiveness. Unlike prior benchmarks, SLM-Bench quantifies 11 metrics across correctness, computation, and consumption, enabling a holistic assessment of efficiency trade-offs. Our evaluation considers controlled hardware conditions, ensuring fair comparisons across models. We develop an open-source benchmarking pipeline with standardized evaluation protocols to facilitate reproducibility and further research. Our findings highlight the diverse trade-offs among SLMs, where some models excel in accuracy while others achieve superior energy efficiency. SLM-Bench sets a new standard for SLM evaluation, bridging the gap between resource efficiency and real-world applicability.

SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts

Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.

Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data

Scientific discovery catalyzes human intellectual advances, driven by the cycle of hypothesis generation, experimental design, evaluation, and assumption refinement. Central to this process is causal inference, uncovering the mechanisms behind observed phenomena. While randomized experiments provide strong inferences, they are often infeasible due to ethical or practical constraints. However, observational studies are prone to confounding or mediating biases. While crucial, identifying such backdoor paths is expensive and heavily depends on scientists' domain knowledge to generate hypotheses. We introduce a novel benchmark where the objective is to complete a partial causal graph. We design a benchmark with varying difficulty levels with over 4000 queries. We show the strong ability of LLMs to hypothesize the backdoor variables between a cause and its effect. Unlike simple knowledge memorization of fixed associations, our task requires the LLM to reason according to the context of the entire graph

Downloads

Next from EMNLP 2025

NormAL LoRA: What is the perfect size?

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES