China

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split: Google&#39;s Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI&#39;s GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

EMNLP 2025

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split: Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

The Retrieval-Augmented Generation (RAG) systems' performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce a resource-efficient approach that aligns Large Language Models (LLMs) for improved citation accuracy and response quality using Group-Relative Policy Optimization (GRPO). Our proposed method leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to an LLM-based reward model. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains relative to the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our approach provides a practical and effective solution for enhancing legal LLMs in resource-constrained environments.

Aligning LLMs for Thai Legal Question Answering with Efficient Semantic-Similarity Rewards

Large language models (LLMs) are moving into legal workflows, yet we lack a jurisdiction-grounded way to gauge their basic competence in thereof. We use India’s public legal examinations as a transparent proxy. Our multi-year benchmark assembles objective screens from top national and state exams and evaluates open and frontier LLMs under real world exam conditions. To probe beyond MCQs, we also include a lawyer-graded, paired-blinded study of long-form answers from the Supreme Court’s Advocate-on-Record exam. This is, to our knowledge, the first exam-grounded, India-specific yardstick for LLM court-readiness released with datasets and protocols. Our work shows that while frontier systems consistently clear historical cutoffs and often match or exceed recent top-scorer bands on objective exams, none surpasses the human topper on long-form reasoning. Grader notes converge on three reliability failure modes—procedural/format compliance, authority/citation discipline, and forum-appropriate voice/structure. These findings delineate where LLMs can assist (checks, cross-statute consistency, statute and precedent lookups) and where human leadership remains essential: forum-specific drafting and filing, procedural and relief strategy, reconciling authorities and exceptions, and ethical, accountable judgment.

Are LLMs Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning

Definitions in alliance contracts play a critical role in shaping agreements, yet they can also lead to costly misunderstandings. This is exemplified by the multimillion-dollar AstraZeneca-Euopean Commission (EC) dispute, where the interpretation of `best reasonable effort' became the focal point of contention. In this interdisciplinary study, we leverage natural language processing (NLP) to systematically analyze patterns in the definitions included in alliance contracts. More specifically, we categorize the content of definitions into topics, identify common terms versus outliers that are semantically dissimilar and infrequently used, and track how definitions evolve over time. Analyzing a dataset of 380,131 definitions from 12,468 alliance contracts in the biopharmaceutical industry, we distinguish that definitions span legal, technological, and social topics, with social terms showing the highest dissimilarity across contracts. Using dynamic topic modeling, we explore how the content of definitions has shifted over two decades (2000–2020) and identify prevalent trends suggesting that contractual definitions reflect broader economic contexts. Notably, our results reveal that the AstraZeneca-EC dispute arose from an outlier—a highly unusual definition—that could have been flagged using NLP. Overall, these findings highlight the potential of data-driven approaches to uncover patterns in alliance contracts.

Tracing Definitions: Lessons from Alliance Contracts in the Biopharmaceutical Industry

Text summarization systems face significant adaptation costs when deployed across diverse domains, requiring expensive few-shot learning or manual prompt engineering. 
We propose a cost-effective domain adaptation framework that generates reusable summarization guidelines using only two reference summaries and three LLM inferences. 
Our approach works by having the model compare its own generated summaries against domain specific reference summaries in a one time preparation step that derives concise natural language guidelines that capture the summarization patterns of the target domain. 
These guidelines are then appended to the summarization prompt to adapt the LLM to the target domain at a minimal cost. 
We evaluate our method across diverse model sizes on three distinct summarization domains: Lawsuits, ArXiv papers, and Patents. 
Automatic metrics show that guideline-based adaptation achieves comparable or superior performance compared to in-context learning and zero-shot baselines. 
An LLM preference evaluation using the latest models shows that summaries generated using such guidelines are superior to the zero-shot or in-context learning summarization prompts. 
Our method enables efficient domain adaptation of text summarizer LLMs with a minimal resource overhead, making specialized summarization particularly accessible for agentic systems that require to process heterogeneous texts in enterprise environments.

Domain Adapted Text Summarization with Self-Generated Guidelines

Legal Argument Mining (LAM) is a complex challenge for humans and language models alike. This paper explores the application of Large Language Models (LLMs) in LAM, focusing on the identification of fine-grained argument types within judgment texts. We compare the performance of Flan-T5 and Llama 3 models against a baseline RoBERTa model to study if the advantages of magnitude-bigger LLMs can be leveraged for this task. Our study investigates the effectiveness of fine-tuning and prompting strategies in enhancing the models’ ability to discern nuanced argument types. Despite employing state-of-the-art techniques, our findings indicate that neither fine-tuning nor prompting could surpass the performance of a domain-pre-trained encoder-only model. This highlights the challenges and limitations in adapting general-purpose large language models to the specialized domain of legal argumentation. The insights gained from this research contribute to the ongoing discourse on optimizing NLP models for complex, domain-specific tasks. Our code and data for reproducibility are available at https://github.com/trusthlt/legal-argument-spans.

Contemporary LLMs struggle with extracting formal legal arguments

Judicial work depends on close reading of long
records, charge sheets, pleadings, annexures,
orders, often spanning hundreds of pages. With
limited staff support, exhaustive reading during
hearings is impractical. We present CourtNav,
a voice-guided, anchor-first navigator for legal
PDFs that maps a judge’s spoken command
(e.g., “go to paragraph 23”, “highlight the contradiction in the cross-examination”) directly
to a highlighted paragraph in seconds. CourtNav transcribes the command, classifies intent
with a grammar-first, LLM-backed router, retrieves over a layout-aware hybrid index, and
auto-scrolls the viewer to the cited span while
highlighting it and close alternates. By design, the interface shows only grounded pas-
sages, never free text, keeping evidence verifiable and auditable. This need is acute in India, where judgments and cross-examinations
notoriously long.In a pilot on representative charge sheets, pleadings, and orders, median time-to-relevance drops from 3–5 minutes (manual navigation) to 10–15 seconds;
with quick visual verification included, 30–45
seconds. Under fixed time budgets, this
navigation-first design increases the breadth of
the record actually consulted while preserving
control and transparency

CourtNav: Voice-Guided, Anchor-Accurate Navigation of Long Legal Documents in Courtrooms

This position paper argues that European copyright law has struggled to keep pace with the development of large language models (LLMs), possibly creating a fundamental epistemic misalignment: copyright compliance relies on qualitative, context-dependent standards, while LLM development is governed by quantitative, proactive metrics. This gap means that technical safeguards, by themselves, may be insufficient to reliably demonstrate legal compliance. We identify several practical limitations in the existing EU legal frameworks, including ambiguous "lawful access" rules, fragmented opt-outs, and vague disclosure duties. We then discuss technical measures such as provenance-first data governance, machine unlearning for post-hoc removal, and synthetic data generation, showing their promise but also their limits.
Finally, we propose a path forward grounded in legal-technical co-design, suggesting directions for standardising machine-readable opt-outs, disclosure templates, clarifying core legal terms, and developing legally-informed benchmarks and evidence standards. We conclude that such an integrated framework is essential to make compliance auditable, thus protected creators' rights while enabling responsible AI innovation at scale.

Copyright Infringement by Large Language Models in the EU: Misalignment, Safeguards, and the Path Forward

One of the first steps in the judicial process
is finding the applicable statutes/laws based
on the facts of the current situation. Manu-
ally searching through multiple legislation and
laws to find the relevant statutes can be time-
consuming, making the Legal Statute Identi-
fication (LSI) task important for reducing the
workload, helping improve the efficiency of
the judicial system. To address this gap, we
present a novel knowledge graph-enhanced ap-
proach for Legal Statute Identification (LSI) in
Indian legal documents using Large Language
Models, incorporating structural relationships
from the Indian Penal Code (IPC) the main leg-
islation codifying criminal laws in India. On
the IL-TUR benchmark, explicit KG inference
significantly enhances recall without sacrific-
ing competitive precision. Augmenting LLM
prompts with KG context, though, merely en-
hances coverage at the expense of precision,
underscoring the importance of good rerank-
ing techniques. This research provides the first
complete IPC knowledge graph and shows that
organized legal relations richly augment statute
retrieval, subject to being integrated into lan-
guage models in a judicious way. Our code and
data are publicly available at Github. (https://github.com/SiddharthShukla48/NyayGraph)

NyayGraph: A Knowledge Graph Enhanced Approach for Legal Statute Identification in Indian Law using Large Language Models

The rise of influencer marketing has blurred boundaries between organic content and sponsored content, making the enforcement of legal rules relating to transparency challenging. Effective regulation requires applying legal knowledge with a clear purpose and reason, yet current detection methods of undisclosed sponsored content generally lack legal grounding or operate as opaque ``black boxes.'' Using 1,143 Instagram posts, we compare \textit{gpt-5-nano} and \textit{gemini-2.5-flash-lite} under three prompting strategies with controlled levels of legal knowledge provided. Both models perform strongly in classifying content as sponsored or not (F1 up to 0.93), with Gemini favouring recall (0.93) and GPT favouring precision (0.95), though performance drops by over 10 points on ambiguous cases. We further develop a taxonomy of reasoning errors, showing frequent citation omissions (28.57%), unclear references (20.71%), and hidden ads exhibiting the highest miscue rate (28.57%). While adding regulatory text to the prompt improves explanation quality, it does not consistently improve detection accuracy. The contribution of this paper is threefold. First, it makes a novel addition to regulatory compliance technology by providing a taxonomy of common errors in LLM-generated legal reasoning to evaluate whether automated moderation is not only accurate but also legally robust, thereby advancing the transparent detection of influencer marketing content. Second, it features an original dataset of LLM explanations annotated by two students who were trained in influencer marketing law. Third, it combines quantitative and qualitative evaluation strategies for LLM explanations and critically reflects on how these findings can support advertising regulatory bodies in automating moderation processes on a solid legal foundation.

Evaluating LLM-Generated Legal Explanations for Regulatory Compliance in Social Media Influencer Marketing

As social systems become more complex, legal articles have grown increasingly intricate, making it harder for humans to identify potential conflicts among them, particularly when drafting new laws or applying existing ones. Despite its importance, no method has been proposed to detect such conflicts. We introduce a new legal NLP task, Legal Article Conflict Detection (LACD), which aims to identify conflicting articles within a given body of law. To address this task, we propose GReX, a novel graph neural network-based retrieval method. Experimental results show that GReX significantly outperforms existing methods, achieving improvements of 44.8% in nDCG@50, 32.8% in Recall@50, and 39.8% in Retrieval F1@50. Our codes are in github.com/asmath472/LACD-public.

Premium content

Downloads

Next from EMNLP 2025

Aligning LLMs for Thai Legal Question Answering with Efficient Semantic-Similarity Rewards

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES