India

Discourse relation parsing plays a crucial role in uncovering the logical structure of text, yet existing corpora focus almost exclusively on general‐domain genres, leaving specialized fields like engineering under‐resourced. We introduce ENG‑DRB, the first PDTB‑style discourse relation corpus derived from transcripts of hands‑on engineering tutorial videos. ENG‑DRB comprises 11 tutorials spanning civil, mechanical, and electrical/electronics engineering (155 minutes total) with 1,215 annotated relations. Compared to general‑domain benchmarks, this dataset features a high proportion of explicit senses, dense causal and temporal relations, and frequent overlapping and embedded senses. Our benchmarking experiments underscore the dataset’s difficulty. A top parser (HITS) detects segment boundaries well (98.6\% F1), but its relation classification is more than 11 F1 percentages lower than on the standard PDTB. In addition, state‑of‑the‑art LLMs (OpenAI o4‑mini, Claude 3.7, LLaMA‑3.1) achieve at best 41\% F1 on explicit relations and less than 9\% F1 on implicit relations, revealing systematic errors in temporal and causal sense detection. The dataset can be accessed at: https://doi.org/10.57967/hf/6895. Code to reproduce our results is available at: https://github.com/chengzhangedu/ENG-DRB.

IJCNLP-AACL 2025

ENG-DRB: PDTB-style Discourse Relation Bank on Engineering Tutorial Video Scripts

corpus benchmarking

tutorial transcripts

engineering discourse

pdtb-style annotation

discourse relation parsing

poster

### Welcome to IJCNLP-AACL 2025! 
 It is a great honor to host this joint conference in Mumbai, India, from December 20 to 24, 2025. The joint conferences of IJCNLP and AACL are organized with alternating leadership in the Asia-Pacific region. The event is run by the Asian Federation of Natural Language Processing (AFNLP) in odd years, and by AACL in even years, while it is organized solely by ACL when the annual ACL meeting is held in the region. This year, the conference is primarily organized by AFNLP. 
*Kentaro Inui
MBZUAI, UAE
General Chair, IJCNLP-AACL 2025* 
Read full message and download the Conference Handbook [**here**](https://drive.google.com/file/d/1UTwxkAqSqI-GAoJC3wE1zZt5VP1Y8GX0/view?usp=sharing).

The 14th IJCNLP & 4th AACL will be held in Mumbai, India from December 20th to December 24th, 2025.

Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to overcome the knowledge limitations of Large Language Models (LLMs) by integrating external retrieval with language generation. While early RAG systems based on static pipelines have shown effectiveness in well-structured tasks, they struggle in real-world scenarios requiring complex reasoning, dynamic retrieval, and multi-modal integration. To address these challenges, the field has shifted toward Reasoning Agentic RAG, a paradigm that embeds decision-making and adaptive tool use directly into the retrieval process. In this paper, we present a comprehensive review of Reasoning Agentic RAG methods, categorizing them into two primary systems: predefined reasoning, which follow fixed modular pipelines to boost reasoning, and agentic reasoning, where the model autonomously orchestrates tool interaction during inference. We analyze representative techniques under both paradigms, covering architectural design, reasoning strategies, and tool coordination. Finally, we discuss key research challenges and propose future directions to advance the flexibility, robustness, and applicability of reasoning agentic RAG systems.

Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges

The rapid progress of generative AI (Gen-AI) and large language models (LLMs) offers significant potential for geospatial applications, but simultaneously introduces critical privacy, security, and ethical risks. Existing general-purpose AI safety frameworks inadequately cover GeoAI-specific risks such as geolocation privacy violations and re-identification, with False Safe Rates exceeding 40\% in some models. To address this, we present $\texttt{GeoSAFE}$ (Geospatial Safety Assurance Framework and Evaluation), introducing the first GeoAI-specific safety taxonomy with six hazard categories and a multimodal $\texttt{GeoSAFE-Dataset}$. It includes 11694 textual prompts with explanations, augmented by real-world queries and images to reduce synthetic bias and reflect operational use. We benchmark model performance on detecting $\texttt{unsafe}$ geospatial queries. Additionally, we present $\texttt{GeoSAFEGuard}$, an instruction-tuned LLM achieving 4.6\% False Safe Rate, 0.4\% False Unsafe Rate, and 97\% F1-score on text-to-text evaluation of $\texttt{GeoSAFE-Dataset}$. An anonymous user-survey confirms human-$\texttt{GeoSAFE}$ alignment emphasizing the urgent need for domain-specific safety evaluations as general-purpose LLMs fail to detect unsafe location-powered queries.

GeoSAFE - A Novel Geospatial Artificial Intelligence Safety Assurance Framework and Evaluation for LLM Moderation

Multilingual vision--language models (VLMs) promise universal image--text retrieval, yet their social biases remain under‑explored. We perform the first systematic audit of four public multilingual CLIP variants—M‑CLIP, NLLB‑CLIP, CAPIVARA‑CLIP, and the debiased SigLIP‑2—covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of \textsc{FairFace} and the \textsc{PATA} stereotype suite in a zero‑shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, \emph{every} model exhibits stronger gender skew than its English‑only baseline. CAPIVARA‑CLIP shows its largest biases precisely in the low‑resource languages it targets, while the shared encoder of NLLB‑CLIP and SigLIP‑2 transfers English gender stereotypes into gender‑neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP‑2 reduces agency and communion skews, it inherits—and in caption‑sparse contexts (e.g., Xhosa) amplifies—the English anchor’s crime associations. Highly gendered languages consistently magnify all bias types, yet gender‑neutral languages remain vulnerable whenever cross‑lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language‑specific “hot spots,” underscoring the need for fine‑grained, language‑aware bias evaluation in future multilingual VLM research.

Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

Hassles and uplifts are psychological constructs of individuals' positive or negative responses to daily minor incidents, with cumulative impacts on mental health. These concepts are largely overlooked in NLP, where existing tasks and models focus on identifying general sentiment expressed in text. These, however, cannot satisfy targeted information needs in psychological inquiry. To address this, we introduce Hassles and Uplifts Detection (HUD), a novel NLP application to identify these constructs in social media language.
We evaluate various language models and task adaptation approaches on a probing dataset collected from a private, real-time emotional venting platform. Some of our models achieve F scores close to 80%. We also identify open opportunities to improve affective language understanding in support of studies in psychology.

Hassles and Uplifts Detection on Social Media Narratives

Can today’s Text-to-SQL benchmarks still stretch modern LLMs? We argue no. Spider1.0 and BIRD, painstakingly hand-built, remain small, costly, and skewed toward middle complex SQL. Meanwhile, LLM-generated corpora are inexpensive but often superficial and fragile suffering from shallow nesting, semantic drift, template fatigue, and insufficient quality check.

We address this gap with a Chain-of-Verifications framework that turns a handful of expert-labelled seeds into a large, reliably checked dataset at a fraction of the usual cost. The resulting corpus, AIGT2S, delivers: (1)18k Question–SQL pairs across 113 databases, 41–77% larger than current English sets; (2)55% queries in the Ultra band of our four-level difficulty taxonomy; (3)87.5% inter-annotator agreement; (4)≥80% labour and ≥98% monetary savings versus earlier efforts.

Baselines including GPT-4o, Llama3, RESDSQL, and MAC-SQL, achieve at most 56% execution accuracy, indicating substantial room for improvement.

High-Quality Complex Text-to-SQL Data Generation through Chain-of-Verification

In this paper, we address the persistent challenges that figurative language expressions pose for natural language processing (NLP) systems, particularly in low-resource languages such as Konkani. We present a hybrid model that integrates a pre-trained Multilingual BERT (mBERT) with a bidirectional LSTM and a linear classifier. This architecture is fine-tuned on a newly introduced annotated dataset for metaphor classification, developed as part of this work. To improve the model’s efficiency, we implement a gradient-based attention head pruning strategy. For metaphor classification, the pruned model achieves an accuracy of 78%. We also applied our pruning approach to expand on an existing idiom classification task, achieving 83% accuracy. These results demonstrate the effectiveness of attention head pruning for building efficient NLP tools in underrepresented languages.

Pruning for Performance: Efficient Idiom and Metaphor Classification in Low-Resource Konkani Using mBERT

Self-supervised speech models have demonstrated the ability to learn rich acoustic representations. However, interpreting which specific phonological or acoustic features these models leverage within their highly polysemantic activations remains challenging. In this paper, we propose a straightforward and unsupervised probing method for model interpretability. We extract the activations from the final MLP layer of a pretrained HuBERT model and train a sparse autoencoder (SAE) using dictionary learning techniques to generate an over-complete set of latent representations. Analyzing these latent codes, we observe that a small subset of high-variance units consistently aligns with phonetic events, suggesting their potential utility as interpretable acoustic detectors. Our proposed method does not require labeled data beyond raw audio, providing a lightweight and accessible tool to gain insights into the internal workings of self-supervised speech models.

Interpretable Sparse Features for Probing Self-Supervised Speech Models

Transformer based models, specially large language models (LLMs) dominate the field of NLP with their mass adoption in tasks such as text generation, summarization and fake news detection. These models offer ease of deployment and reliability for most applications, however, they require significant amounts of computational power for training as well as inference. This poses challenges in their adoption in resource-constrained applications, specially in the open-source community where compute availability is usually scarce. This work proposes a graph-based approach for Environmental Claim Detection, exploring Graph Neural Networks (GNNs) and Hyperbolic Graph Neural Networks (HGNNs) as lightweight yet effective alternatives to transformer-based models. Re-framing the task as a graph classification problem, we transform claim sentences into dependency parsing graphs, utilizing a combination of word2vec \& learnable part-of-speech (POS) tag embeddings for the node features and encoding syntactic dependencies in the edge relations. Our results show that our graph-based models, particularly HGNNs in the poincaré space (P-HGNNs), achieve performance superior to the state-of-the-art on environmental claim detection while using up to **30x fewer parameters**. We also demonstrate that HGNNs benefit vastly from explicitly modeling data in hierarchical (tree-like) structures, enabling them to significantly improve over their euclidean counterparts.

Efficient Environmental Claim Detection with Hyperbolic Graph Neural Networks

Continual learning (CL) presents a significant challenge for large pre-trained models, primarily due to catastrophic forgetting and the high computational cost of sequential knowledge updating. Parameter-Efficient Transfer Learning (PETL) methods offer reduced computational burdens but often struggle to effectively mitigate forgetting. This paper introduces Stacked Low-Rank Adaptation (SLoRA), a novel parameter-efficient approach that leverages the additive composition of task-specific, frozen low-rank adapters to enable modular continual learning with inherent support for explicit knowledge modification. SLoRA was evaluated on vision benchmarks, BERT-base, and the 1-billion-parameter Llama-3.2-1B model. Experiments demonstrated that SLoRA almost completely eliminated catastrophic forgetting, achieving a final average accuracy of 92.75\% on Llama-3.2-1B while perfectly preserving prior task performance. Furthermore, SLoRA is computationally efficient, enabling up to a 15x training speed-up over full fine-tuning with 99.7\% fewer trainable parameters per update. SLoRA offers a compelling balance of forgetting mitigation, parameter efficiency, and modularity, representing a promising direction for developing adaptable and efficient lifelong knowledgeable foundation models.

Stacked LoRA: Isolated Low-Rank Adaptation for Lifelong Knowledge Management

Reliable evaluation of Question Answering (QA) systems in low-resource Indic languages presents a significant challenge due to limited annotated datasets, linguistic diversity, and suitable evaluation metrics. Languages such as Sindhi, Manipuri, Dogri, Konkani, and Maithili are particularly underrepresented, creating difficulty in assessing Large Language Models (LLMs) on QA tasks. Existing metrics, including BLEU, ROUGE-L, and BERTScore, are effective in machine translation and high-resource settings; however, they often fail in low-resource QA due to score compression, zero-inflation, and poor scale alignment. To overcome this, LRMGS (Language-Robust Metric for Generative QA) is introduced to capture semantic and lexical agreement while preserving the score scale across languages. LRMGS is evaluated across 8 Indic languages and multiple LLMs, demonstrating consistently higher concordance with reference-based chrF++ scores, measured using the Concordance Correlation Coefficient (CCC). Experimental results indicate that LRMGS provides more accurate discrimination of system performance in very low-resource languages compared to existing metrics. This work establishes a robust and interpretable framework for evaluating QA systems in low-resource Indic languages, supporting more reliable multilingual model assessment.

Downloads

Next from IJCNLP-AACL 2025

Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES