China

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, in long-form question answering (LFQA), they often struggle with factual accuracy, frequently generating hallucinated responses. In this work, we introduce FinLFQA, a benchmark designed to evaluate LLMs&#39; ability to generate answers with reliable attributions. FinLFQA evaluates three key aspects: (1) evidence-supported content to enhance factual grounding and verifiability, (2) step-by-step calculations using executable code for numerical reliability, and (3) domain-specific reasoning informed by knowledge. We conduct an extensive evaluation of eight LLMs, leveraging the developed automated evaluation protocol to evaluate their performance. Our findings show that GPT-4o outperforms other models, while open-sourced models are closing the gap with proprietary models, which demonstrates that open-source models are becoming competitive alternatives for real-world applications. We also find that post-hoc and end-to-end generation perform similarly, while iterative self-feedback provides no significant improvement except external signal is provided.

EMNLP 2025

FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

financial reasoning

benchmark and evaluation

long-form qa

large language model

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, in long-form question answering (LFQA), they often struggle with factual accuracy, frequently generating hallucinated responses. In this work, we introduce FinLFQA, a benchmark designed to evaluate LLMs' ability to generate answers with reliable attributions. FinLFQA evaluates three key aspects: (1) evidence-supported content to enhance factual grounding and verifiability, (2) step-by-step calculations using executable code for numerical reliability, and (3) domain-specific reasoning informed by knowledge. We conduct an extensive evaluation of eight LLMs, leveraging the developed automated evaluation protocol to evaluate their performance. Our findings show that GPT-4o outperforms other models, while open-sourced models are closing the gap with proprietary models, which demonstrates that open-source models are becoming competitive alternatives for real-world applications. We also find that post-hoc and end-to-end generation perform similarly, while iterative self-feedback provides no significant improvement except external signal is provided.

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language (NL) statements into First-Order Logic (FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text, often go unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we conduct a comprehensive study on the sensitivity of existing metrics---NL, FOL, and graph-based--- and their alignment with LLM as a judge on FOL evaluation to measure robustness. We introduce operator and text-based perturbations to ground-truth FOL statements to assess metric sensitivity. We then evaluate metric robustness by comparing them against LLMs judgement. Our empirical findings highlight a clear oversensitivity in the n-gram metric BLEU for text perturbations. The operator perturbation affects the semantic graph metric Smatch++ for structural changes, and the FOL metric for specific operator changes. We observe a closer alignment between BertScore and LLM judgement, proving the importance of semantic evaluation. Additionally, we show that combining metrics enhances both robustness and sensitivity compared to using individual metrics.

Assessing the Sensitivity and Alignment of FOL Closeness Metrics

Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges. However, current dataset construction methods, which are risk-oriented, fail to cover the growing complexity of real-world multimodal safety scenarios (RMS). And due to the lack of a unified evaluation metric, their overall effectiveness remains unproven. This paper introduces a novel image-oriented self-adaptive dataset construction method for RMS, which starts with images and end constructing paired text and guidance responses. Using the image-oriented method, we automatically generate an RMS dataset comprising 35,610 image–text pairs with guidance responses. Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety datasets. Extensive experiments on various tasks demonstrate the effectiveness of the proposed image-oriented pipeline. The results confirm the scalability and effectiveness of the image-oriented approach, offering a new perspective for the construction of real-world multimodal safety datasets.

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose ***MONAQ***, a novel framework that reformulates NAS into ***M***ulti-***O***bjective ***N***eural ***A***rchitecture ***Q***uerying tasks. *MONAQ* is equipped with *multimodal query generation* for processing multimodal time-series inputs and hardware constraints, alongside an *LLM agent-based multi-objective search* to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, *MONAQ* improves an LLM's understanding of time-series data. Experiments on fifteen datasets demonstrate that *MONAQ*-discovered models outperform both handcrafted models and NAS baselines while being more efficient.

MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices

Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency–accuracy trade-offs on the MATH dataset, reducing token usage by up to 31 on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6 fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5 fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at URL hidden for review.

ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which distills the information from many-shot ICL into a concise textual summary (cheat sheet) used as the context at inference time. Experiments on challenging reasoning tasks show that cheat-sheet ICL achieves comparable or better performance than many-shot ICL with far fewer tokens, and matches retrieval-based ICL without requiring test-time retrieval. These findings demonstrate that cheat-sheet ICL is a practical alternative for leveraging LLMs in downstream tasks.

Distilling Many-Shot In-Context Learning into a Cheat Sheet

The rapid advancement of large language models (LLMs) significantly enhances long-context Retrieval-Augmented Generation (RAG), yet existing benchmarks focus primarily on English. This leaves low-resource languages without comprehensive evaluation frameworks, limiting their progress in retrieval-based tasks. To bridge this gap, we introduce Ko-LongRAG, the first Korean long-context RAG benchmark. Unlike conventional benchmarks that depend on external retrievers, Ko-LongRAG adopts a retrieval-free approach designed around Specialized Content Knowledge (SCK), enabling controlled and high-quality QA pair generation without the need for an extensive retrieval infrastructure. %By clustering domain-specific documents and generating intra-cluster question-answer pairs, Ko-LongRAG effectively simulates retrieval-based reasoning while maintaining high contextual fidelity. Our evaluation shows that o1 model achieves the highest performance among proprietary models, while EXAONE 3.5 leads among open-sourced models. Additionally, various findings confirm Ko-LongRAG as a reliable benchmark for assessing Korean long-context RAG capabilities and highlight its potential for advancing multilingual RAG research. The dataset and source code will be released publicly.

Ko-LongRAG: A Korean Long-Context RAG Benchmark Built with a Retrieval-Free Approach

One challenge of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3% boost in MMLU accuracy and a 15.5% boost in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6% MMLU gain and reduces perplexity by 2.9%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it more accessible.

KurTail : Kurtosis-based LLM Quantization

Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. We hence introduce HRDBench, a cognitively grounded benchmark for evaluating Human-centered Embodied Reasoning and Decision-making in MLLMs .HRDBench consists of 1,113 real-world situations paired with 6,126 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a holistic framework for assessing a model’s ability to perceive, reason, and act in socially meaningful ways. We evaluate the state-of-the-art commercial and open-source models on \benchmark, where we reveal distinct performance patterns and highlight significant challenges. Our in-depth analysis further offers insights into current model limitations and supports the development of MLLMs with more robust, context-aware, and socially adept embodied decision-making capabilities for real-world scenarios.

VIVA+: Human-Centered Situational Decision-Making

Verbs occur in a particular syntactic environment (frame) along with their arguments. In this paper we introduce a new Hindi verb alternations benchmark to investigate whether pretrained large language models (LLMs) can infer the frame-selectional properties of Hindi verbs. Our benchmark consists of minimal pairs such as 'Tina cut the wood' / 'Tina disappeared the wood' that are annotated with human judgments. We expect that LLMs will assign lower probability to the unacceptable sentence. We create four variants of these alternations for Hindi to test knowledge of verb morphology and argument case-marking. Our results show that a masked monolingual model performs the best, while causal models fare poorly. We further test the quality of the predictions using a cloze-style sentence completion task. While the models appear to infer the right mapping between verbal morphology and valency in the acceptability task, they do not generate the right verbal morphology in the cloze task. The model completions also lack pragmatic and world knowledge. LLMs need to make both syntactic and semantic generalizations about verbal alternations, unlike other syntactic phenomena (like agreement). Our work points towards the need for greater cross-linguistic investigation of verbal alternations.

Downloads

Next from EMNLP 2025

Assessing the Sensitivity and Alignment of FOL Closeness Metrics

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES