China

Large Language Models (LLMs) deployed as autonomous agents with tool access present unique safety challenges that extend beyond standalone model vulnerabilities. Existing red-teaming frameworks like AgentHarm use static prompts and hardcoded toolsets, limiting their applicability to custom production systems.


We introduce a dual-component automated red-teaming framework: AgentHarm-Gen generates adversarial tasks and evaluation functions tailored to arbitrary toolsets, while Red-Agent-Reflect employs iterative prompt refinement with self-reflection to develop progressively more effective attacks.


Evaluating across 115 harmful tasks (71 generated, 44 from AgentHarm) spanning 8 risk categories, our method achieves substantial improvements: up to 162\% increase in attack success rate on o4-mini and 86\% success on Gemini 2.5 Pro. Successful attacks systematically decompose adversarial objectives into benign-appearing sub-tasks that circumvent safety alignment, highlighting the need for agent-specific guardrails.


We contribute our implementation to the AgentHarm repository, enabling systematic identification of safety vulnerabilities in custom agentic workflows before deployment.

EMNLP 2025

Agent vs. Agent: Automated Data Generation and Red-Teaming for Custom Agentic Workflows

responsible ai (rai)

automated red teaming

ai agents

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

The Enterprise Intelligence Platform must integrate logs from numerous third-party vendors in order to perform various downstream tasks. However, vendor documentation is often unavailable at test time. It is either misplaced, mismatched, poorly formatted, or incomplete, which makes schema mapping challenging. We introduce a reinforcement learning agent that can self-improve without labeled examples or model weight updates. During inference, the agent: 1) Identifies ambiguous field-mapping attempts. 2) Generates targeted web-search queries to gather external evidence. 3) Applies a confidence-based reward to iteratively refine its mappings. To demonstrate this concept, we converted Microsoft Defender for Endpoint logs into a common schema. Our method increased mapping accuracy from 72.73\% to 93.94\% over 100 iterations using GPT-4o. At the same time, it reduced the number of low-confidence mappings requiring expert review by 85\%. This new approach provides an evidence-driven, transparent method for solving future industry problems, paving the way for more robust, accountable, scalable, efficient, flexible, adaptable, and collaborative solutions.

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improve Without Labels or Model Updates

Abstractive summarization is a core application in contact centers, where Large Language Models (LLMs) generate millions of summaries of call transcripts daily. Despite their apparent quality, it remains unclear whether LLMs systematically under- or over-attend to specific aspects of the transcript, potentially introducing biases in the generated summary. While prior work has examined social and positional biases, the specific forms of bias pertinent to contact center operations—which we term *Operational Bias*—have remained unexplored. To address this gap, we introduce ***BlindSpot***, a framework built upon a taxonomy of 15 operational bias dimensions (e.g., disfluency, speaker, topic) for the identification and quantification of these biases. ***BlindSpot*** leverages an LLM as a zero-shot classifier to derive categorical distributions for each bias dimension in a pair of transcript and its summary. The bias is then quantified using two metrics: *Fidelity Gap* (the JS Divergence between distributions) and *Coverage* (the percentage of source labels omitted). Using ***BlindSpot***, we conducted an empirical study with 2500 real call transcripts and their summaries generated by 20 LLMs of varying scales and families (e.g., GPT, Llama, Claude). Our analysis reveals that biases are systemic and present across all evaluated models, regardless of size or family.

Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Summaries

Retrieval-Augmented Generation (RAG) is one of the leading and most widely used techniques for enhancing LLM retrieval capabilities, but it still faces significant limitations in commercial use cases. RAG primarily relies on the query-chunk text-to-text similarity in the embedding space for retrieval and can fail to capture deeper semantic relationships across chunks, is highly sensitive to chunking strategies, and is prone to hallucinations.
To address these challenges, we propose TOBUGraph, a graph-based retrieval framework that first constructs the knowledge graph from unstructured data dynamically and automatically. Using LLMs, TOBUGraph extracts structured knowledge and diverse relationships among data, going beyond RAG's text-to-text similarity. Retrieval is achieved through graph traversal, leveraging the extracted relationships and structures to enhance retrieval accuracy, eliminating the need for chunking configurations while reducing hallucination. We demonstrate TOBUGraph’s effectiveness in TOBU, a real-world application in production for personal memory organization and retrieval. Our evaluation using real user data demonstrates that TOBUGraph outperforms multiple RAG implementations in both precision and recall, significantly improving user experience through improved retrieval accuracy.

TOBUGraph: Knowledge Graph-Based Retrieval for Enhanced LLM Performance Beyond RAG

Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that naïve joint pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with task-specific adapters. Our approach achieves performance comparable to individual pre-finetuning while supporting practical deployment requirements. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.

Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER

Natural language to SQL (NL-to-SQL) systems are increasingly critical in industry for enabling non-technical users to access structured data efficiently, supporting faster decision-making and data accessibility. However, state-of-the-art systems often depend on large proprietary models, which introduce serious concerns around privacy. While open-source LLMs offer a viable substitute, high-performing variants (e.g., 70B or 405B) require substantial GPU memory, making them impractical for many production environments. Smaller open-source models that fit on a single 80GB GPU present a more deployable alternative, yet existing efforts to enhance their Text-to-SQL performance rely heavily on fine-tuning, limiting flexibility. We propose textbftextttRoSL, a plug-and-play framework that improves SQL generation for smaller LLMs without any task-specific training. While schema linking is often omitted for larger models, we show it remains essential for smaller ones. Further, we are the first to apply question decomposition at the schema linking stage, rather than during SQL generation as in prior work, to address the precision-recall tradeoff. Our approach improves schema linking recall by textbf25.1\% and execution accuracy by textbf8.2\% on the BIRD benchmark using textttibm-granite/granite-3.3-8b-instruct, making it an effective and industry-friendly NL-to-SQL solution.

Divide, Link, and Conquer: Recall-oriented Schema Linking for NL-to-SQL via Question Decomposition

In many industrial settings, users wish to ask questions in natural language, the answers to which require assembling information from diverse structured data sources. With the advent of Large Language Models (LLMs), applications can now translate natural language questions into a set of API calls or database calls, execute them, and combine the results into an appropriate natural language response.
However, these applications remain impractical in realistic industrial settings because they do not cope with the data source heterogeneity that typifies such environments. In this paper, we simulate the heterogeneity of real industry settings by introducing two extensions of the popular Spider benchmark dataset that require a combination of database and API calls. Then, we introduce and evaluate a declarative approach to handling such data heterogeneity.
We demonstrate that our declarative approach does a significantly better job of coping with data source heterogeneity than state-of-the-art LLM-based agentic or imperative code generation systems. Our augmented benchmarks will soon be available to the research community.

Declarative Techniques for NL Queries over Heterogeneous Data

We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights. With 1.6k queries across five analytical tasks and 9.1k conversations, our benchmark provides a reliable standard for measuring conversational data retrieval performance. Our evaluation of 16 popular embedding models shows that even the best models reach only around NDCG@10 of 0.51, revealing a substantial gap between document and conversational data retrieval capabilities. Our work identifies unique challenges in conversational data retrieval (implicit state recognition, turn dynamics, contextual references) while providing practical query templates and detailed error analysis across different task categories. The benchmark dataset and code are available at https://anonymous.4open.science/r/CDR-Benchmark-B53A.

Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval

The specialized vocabulary and nuanced concepts of the telecommunications industry pose persistent challenges for standard Natural Language Processing (NLP) models. Generic embedding models often struggle to represent telecom-specific semantics, limiting their utility in retrieval and downstream tasks. We present T-VEC (Telecom Vectorization Model), a domain-adapted embedding model fine-tuned from the gte-Qwen2-1.5B-instruct backbone using a triplet loss objective over 100K curated telecom triplets. T-VEC sets a new benchmark in telecom retrieval, achieving CosineSim@1 of 0.8814, Recall@5 of 0.9249, and Top1 Exact Match of 0.9310—significantly outperforming leading general-purpose models like MPNet, BGE, and E5 by 20-30\% relative margin. These gains confirm T-VEC’s superior domain grounding and retrieval precision, with embedding visualizations further showcasing tight clustering of telecom-relevant concepts. We release T-VEC and its tokenizer to support more robust and semantically faithful NLP applications within the telecom domain.

T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning

Cascade systems for open-ended text generation face a fundamental challenge: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose _semantic agreement_—meaning-level consensus between ensemble outputs—as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, semantic cascades improve deferral accuracy, match or surpass target-model quality at 40\% of the cost, and reduce latency by up to 60\%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.

Semantic Agreement Enables Efficient Open-Ended LLM Cascades

Large Language Models (LLMs) have transformed natural language processing, yet they still struggle with direct text editing tasks that demand precise, context-aware modifications. While models like ChatGPT excel in text generation and analysis, their editing abilities often fall short, addressing only superficial issues rather than deeper structural or logical inconsistencies. In this work, we introduce a dual approach to enhance LLMs editing performance. First, we present InstrEditBench, a high-quality benchmark dataset comprising over 20,000 structured editing tasks spanning Wiki articles, LaTeX documents, code, and database Domain-specific Languages (DSL). InstrEditBench is generated using an innovative automated workflow that accurately identifies and evaluates targeted edits, ensuring that modifications adhere strictly to specified instructions without altering unrelated content. Second, we propose FineEdit, a specialized model trained on this curated benchmark. Experimental results demonstrate that FineEdit achieves significant improvements around 10% compared with Gemini on direct editing tasks, convincingly validating its effectiveness.

Downloads

Next from EMNLP 2025

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improve Without Labels or Model Updates

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improve Without Labels or Model Updates

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads