India

As large language models (LLMs) continue to scale, the memory footprint of Key-Value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently occurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose SemShareKV, a KV cache sharing and compression framework that accelerates LLM inference by reusing KV cache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using Locality-Sensitive Hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant KV pairs from a reference prompt&#39;s cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42\% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.

IJCNLP-AACL 2025

SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching

semantic-aware reuse

llm inference optimization

kv cache compression

token-level matching

As large language models (LLMs) continue to scale, the memory footprint of Key-Value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently occurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose SemShareKV, a KV cache sharing and compression framework that accelerates LLM inference by reusing KV cache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using Locality-Sensitive Hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant KV pairs from a reference prompt's cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42\% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.

poster

### Welcome to IJCNLP-AACL 2025! 
 It is a great honor to host this joint conference in Mumbai, India, from December 20 to 24, 2025. The joint conferences of IJCNLP and AACL are organized with alternating leadership in the Asia-Pacific region. The event is run by the Asian Federation of Natural Language Processing (AFNLP) in odd years, and by AACL in even years, while it is organized solely by ACL when the annual ACL meeting is held in the region. This year, the conference is primarily organized by AFNLP. 
*Kentaro Inui
MBZUAI, UAE
General Chair, IJCNLP-AACL 2025* 
Read full message and download the Conference Handbook [**here**](https://drive.google.com/file/d/1UTwxkAqSqI-GAoJC3wE1zZt5VP1Y8GX0/view?usp=sharing).

The 14th IJCNLP & 4th AACL will be held in Mumbai, India from December 20th to December 24th, 2025.

Effective tool pre-selection via retrieval is essential for AI agents to select from a vast array of tools when identifying and planning actions in the context of complex user queries. Despite its central role in planning, this aspect remains underexplored in the literature. Traditional approaches rely primarily on similarities between user queries and tool descriptions, which significantly limits retrieval accuracy, specifically when handling multi-step user requests. To address these limitations, we propose a Knowledge Graph (KG)-based tool retrieval framework that captures the semantic relationships between tools and their functional dependencies. Our retrieval algorithm leverages ensembles of 1-hop ego tool graphs to model direct and indirect connections between tools, enabling more comprehensive and contextual tool selection for multi-step tasks. We evaluate our approach on a synthetically generated internal dataset across six defined user classes, extending previous work on coherent dialogue synthesis and tool retrieval benchmarks. Results demonstrate that our tool graph-based method achieves 91.85% tool coverage on the micro-average CompleteRecall metric, compared to 89.26% for re-ranked semantic-lexical hybrid retrieval, the strongest non-KG baseline in our experiments. These findings support our hypothesis that the structural information modeled in the graph provides complementary signals to pure similarity matching, particularly for queries requiring sequential tool composition.

Planning Agents on an Ego-Trip: Leveraging Hybrid Ego-Graph Ensembles for Improved Tool Retrieval in Enterprise Task Planning

Test-time compute has emerged as a powerful paradigm in function-level code generation. However, previous proposed strategies have been viewed as disparate, thus lacking a fair apples-to-apples analysis enabling understanding of their operational mechanisms in execution-based benchmarks. Therefore, we present a mathematical framework that unifies generation and reranking with theoretical justifications through the lens of Minimum Bayes Risk (MBR) decoding. Our proposed framework leads to key research questions regarding the effectiveness of using parallel and/or iterative sampling, design choices of reranking signals and soft/hard MBR utility functions, and behaviors of the final selected program across different methods. Our empirical findings highlight the importance of the diversity of sampled candidates (over self-improvement), reranking with simple and high-quality signals, and the effectiveness of test-time compute to select programs that manifest general and edge test case robustness. We will open-source our analysis toolkit and implementation to enable reproducible research.We open-source our analysis toolkit and implementation to enable reproducible research.

Formalizing Test-Time Compute for Function-Level Code Generation

The rise of generative AI has led to challenges in distinguishing AI-generated text from human-written content, raising concerns about misinformation and content authenticity. Detecting AI-generated text remains challenging, especially under various stylistic domains and paraphrased inputs. We introduce SGG-ATD, a novel detection framework that models structural and contextual relationships between LLM-predicted and original-input text. By masking parts of the input and reconstructing them using a language model, we capture implicit coherence patterns. These are encoded in a graph where cosine and contextual links between keywords guide classification via a Graph Convolutional Network (GCN). SGG-ATD achieves strong performance across diverse datasets and shows resilience to adversarial rephrasing and out-of-distribution inputs, outperforming competitive baselines.

Seeing Through the Mask: AI-Generated Text Detection with Similarity-Guided Graph Reasoning

Translating natural language requirements into Signal Temporal Logic (STL) is essential for safety-critical systems but requires mathematical expertise. We propose a translational grammar mapping Universal Dependencies (UD) structures to STL Operators through 17 theoretically-motivated patterns, evaluated on the NL2TL benchmarking dataset of 7,002 expert-annotated sentence-STL pairs, and an additional cross-domain analysis. We built a parser guided by this grammar to explore the formal deterministic relationship between UDR Compositions and STL Operators, achieving $\sim$99\% sentence coverage, $\sim$54\% exact matches (and $\sim$97\% similarity). Sentence-level regression analyses predict STL statements and STL Operator classes, considering the co-occurance of UDR substructures (UDR components) with an accuracy of more than $\sim$74\% and $\sim$81\%, respectively. They uncover a new logical grammatical link between temporal NL and formal logic, that is conditioned by the sentence-level context, and provide insights into how linguistic theory unfolds in practice through temporal linguistic expressions.

Merging Two Grammar Worlds: Exploring the Relationship between Universal Dependencies and Signal Temporal Logic

The development of large language models (LLMs) has resulted in significant transformations in the field of chemistry, with potential applications in molecular science. Traditionally, the exploration of methods to enhance pre-trained general-purpose LLMs has focused on techniques like supervised fine-tuning (SFT) and retrieval-augmented generation (RAG), to improve model performance and tailor them to specific applications. General purpose extended approaches are being researched, but their adaptation within the chemical domain has not progressed significantly. This study advances the application of LLMs in molecular science by exploring SFT of LLMs, and developing RAG and multimodal models, incorporating molecular embeddings derived from molecular fingerprints and other properties. Experimental results show that a multimodal model with fingerprint inputs to the LLM achieved the highest overall performance. For molecular representation based on SMILES notation, fingerprints effectively capture the structural information of molecular compounds, demonstrating the applicability of LLMs in drug discovery research.

Enhancing LLM-Based Molecular Captioning with Molecular Fingerprints

Large language models (LLMs) are increasingly integrated into academic workflows, with many conferences and journals permitting their use for tasks such as language refinement and literature summarization. However, their use in peer review remains prohibited due to concerns around confidentiality breaches, hallucinated content, and inconsistent evaluations. As LLM-generated text becomes more indistinguishable from human writing, there is a growing need for reliable attribution mechanisms to preserve the integrity of the review process. In this work, we evaluate topic-based watermarking (TBW), a semantic-aware technique designed to embed detectable signals into LLM-generated text. We conduct a systematic assessment across multiple LLM configurations, including base, few-shot, and fine-tuned variants, using authentic peer review data from academic conferences. Our results show that TBW maintains review quality relative to non-watermarked outputs, while demonstrating robust detection performance under paraphrasing. These findings highlight the viability of TBW as a minimally intrusive and practical solution for LLM attribution in peer review settings.

The Feasibility of Topic-Based Watermarking on Academic Peer Reviews

Discourse relation parsing plays a crucial role in uncovering the logical structure of text, yet existing corpora focus almost exclusively on general‐domain genres, leaving specialized fields like engineering under‐resourced. We introduce ENG‑DRB, the first PDTB‑style discourse relation corpus derived from transcripts of hands‑on engineering tutorial videos. ENG‑DRB comprises 11 tutorials spanning civil, mechanical, and electrical/electronics engineering (155 minutes total) with 1,215 annotated relations. Compared to general‑domain benchmarks, this dataset features a high proportion of explicit senses, dense causal and temporal relations, and frequent overlapping and embedded senses. Our benchmarking experiments underscore the dataset’s difficulty. A top parser (HITS) detects segment boundaries well (98.6\% F1), but its relation classification is more than 11 F1 percentages lower than on the standard PDTB. In addition, state‑of‑the‑art LLMs (OpenAI o4‑mini, Claude 3.7, LLaMA‑3.1) achieve at best 41\% F1 on explicit relations and less than 9\% F1 on implicit relations, revealing systematic errors in temporal and causal sense detection. The dataset can be accessed at: https://doi.org/10.57967/hf/6895. Code to reproduce our results is available at: https://github.com/chengzhangedu/ENG-DRB.

ENG-DRB: PDTB-style Discourse Relation Bank on Engineering Tutorial Video Scripts

Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to overcome the knowledge limitations of Large Language Models (LLMs) by integrating external retrieval with language generation. While early RAG systems based on static pipelines have shown effectiveness in well-structured tasks, they struggle in real-world scenarios requiring complex reasoning, dynamic retrieval, and multi-modal integration. To address these challenges, the field has shifted toward Reasoning Agentic RAG, a paradigm that embeds decision-making and adaptive tool use directly into the retrieval process. In this paper, we present a comprehensive review of Reasoning Agentic RAG methods, categorizing them into two primary systems: predefined reasoning, which follow fixed modular pipelines to boost reasoning, and agentic reasoning, where the model autonomously orchestrates tool interaction during inference. We analyze representative techniques under both paradigms, covering architectural design, reasoning strategies, and tool coordination. Finally, we discuss key research challenges and propose future directions to advance the flexibility, robustness, and applicability of reasoning agentic RAG systems.

Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges

The rapid progress of generative AI (Gen-AI) and large language models (LLMs) offers significant potential for geospatial applications, but simultaneously introduces critical privacy, security, and ethical risks. Existing general-purpose AI safety frameworks inadequately cover GeoAI-specific risks such as geolocation privacy violations and re-identification, with False Safe Rates exceeding 40\% in some models. To address this, we present $\texttt{GeoSAFE}$ (Geospatial Safety Assurance Framework and Evaluation), introducing the first GeoAI-specific safety taxonomy with six hazard categories and a multimodal $\texttt{GeoSAFE-Dataset}$. It includes 11694 textual prompts with explanations, augmented by real-world queries and images to reduce synthetic bias and reflect operational use. We benchmark model performance on detecting $\texttt{unsafe}$ geospatial queries. Additionally, we present $\texttt{GeoSAFEGuard}$, an instruction-tuned LLM achieving 4.6\% False Safe Rate, 0.4\% False Unsafe Rate, and 97\% F1-score on text-to-text evaluation of $\texttt{GeoSAFE-Dataset}$. An anonymous user-survey confirms human-$\texttt{GeoSAFE}$ alignment emphasizing the urgent need for domain-specific safety evaluations as general-purpose LLMs fail to detect unsafe location-powered queries.

GeoSAFE - A Novel Geospatial Artificial Intelligence Safety Assurance Framework and Evaluation for LLM Moderation

Multilingual vision--language models (VLMs) promise universal image--text retrieval, yet their social biases remain under‑explored. We perform the first systematic audit of four public multilingual CLIP variants—M‑CLIP, NLLB‑CLIP, CAPIVARA‑CLIP, and the debiased SigLIP‑2—covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of \textsc{FairFace} and the \textsc{PATA} stereotype suite in a zero‑shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, \emph{every} model exhibits stronger gender skew than its English‑only baseline. CAPIVARA‑CLIP shows its largest biases precisely in the low‑resource languages it targets, while the shared encoder of NLLB‑CLIP and SigLIP‑2 transfers English gender stereotypes into gender‑neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP‑2 reduces agency and communion skews, it inherits—and in caption‑sparse contexts (e.g., Xhosa) amplifies—the English anchor’s crime associations. Highly gendered languages consistently magnify all bias types, yet gender‑neutral languages remain vulnerable whenever cross‑lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language‑specific “hot spots,” underscoring the need for fine‑grained, language‑aware bias evaluation in future multilingual VLM research.

Downloads

Next from IJCNLP-AACL 2025

Planning Agents on an Ego-Trip: Leveraging Hybrid Ego-Graph Ensembles for Improved Tool Retrieval in Enterprise Task Planning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES