Singapore

The exponential growth of video content has created an urgent need for efficient multimodal video retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail under cross-modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal video retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEiT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, LLM-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality-specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive video search capabilities.

AAAI 2026

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance. Our code will be released soon.

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

Modern generative pre-trained language models excel at open-ended text generation, yet continue to underperform on structure-related tasks such as NER, relation extraction, and semantic role labeling, especially when compared to encoder-only models of similar sizes. While this gap has been attributed to limited structure knowledge, we hypothesize this is also due to the missing connection between the model’s internal representations of linguistic structure and the output space used during supervised fine-tuning. We propose the Structured Language Generation Model (SLGM), a model- and task-agnostic framework that reformulates structured prediction as a classification problem through three components: (1) reinforced input formatting with structural cues, (2) loss design, and (3) format-aware decoding that constrains generation to task-valid outputs. Across 5 tasks and 13 datasets, SLGM substantially improves structure prediction without relying on dataset-specific engineering or additional model parameters. It outperforms baseline fine-tuning on models of the same size, achieves comparable performance to much larger models when used with $<$1B parameter models, and acts as a zero-weight adapter that reproduces the benefits of dataset-specific fine-tuning in low-resource settings.

Structured Language Generation Model: Loss Calibration and Formatted Decoding for Robust Structure Prediction

Optimizing Retrieval-Augmented Generation (RAG) configurations for specific tasks is a complex and resource-intensive challenge. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To fill this gap, we present a comprehensive study involving five HPO algorithms over five datasets from diverse domains, including a newly curated one on real-world product documentation. Our study explores the largest HPO search space considered to date, with three evaluation metrics as optimization targets. Analysis of the results shows that RAG HPO can be done efficiently, either greedily or with random search, and that it significantly boosts RAG performance for all datasets. For greedy HPO approaches, we show that optimizing model selection first is preferable to the common practice of following the RAG pipeline order during optimization.

An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation

We present Fisher-LD, a layer-wise knowledge distillation framework for transformer-based recommenders. Fisher-LD uses the Fisher Information Matrix to quantify per-layer importance, allocating supervision where it matters most. On MovieLens-1M, our 6-layer student achieves HR@10=0.978 with 3.3× compression and 3.2× speedup, exceeding the 12-layer teacher (HR@10=0.934) by 4.4 percentage points. Cross-domain validation on Amazon Reviews confirms generalization. Experiments against five baselines show consistent improvements. Our three-phase protocol—Fisher analysis, selective distillation, task fine-tuning—enables effective compression for production LLM-based recommenders. Code will be released.

Fisher-LD: Layer-Wise Knowledge Distillation for LLM Recommenders

jina-reranker-v3 is a 0.6B-parameter multilingual listwise reranker that introduces a novel "last but not late" (LBNL) interaction mechanism. Unlike late interaction models like ColBERT that encode documents separately before multi-vector matching, our approach enables cross-document interactions during encoding by processing queries and all candidate documents simultaneously within shared context windows. We extract contextual embeddings from special tokens at each document's end (the "last" position), but crucially, interactions occur throughout the encoding process ("not late"). The model achieves state-of-the-art BEIR performance with 61.9 nDCG@10 while being significantly smaller than comparable alternatives.

jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking

The rapid growth of connected devices has made federated learning (FL) essential for privacy-preserving intelligence at the edge. This, in combination with the growing demand for over-parameterized neural networks, large-scale foundation and reasoning models, creates a fundamental communication and information retrieval bottleneck. Therefore, we aim to provide a solution by alleviating the FL uplink communication bottleneck by proposing \emph{Digital MAC Federated Learning (DigiMAC-FL)}, a novel, fully digital-transmission-based aggregation framework that performs quantized model averaging directly within the multiple access channel. Each client transmits QAM-encoded model increments, enabling simultaneous uplink aggregation and exact arithmetic averaging in the complex baseband. Unlike analog over-the-air schemes, DigiMAC--FL operates within standard digital modulation while preserving the linearity required for theoretical analysis. Experiments on Fashion-MNIST and CIFAR-10 confirm that DigiMAC--FL achieves near--FedAvg accuracy with more than a tenfold reduction in communication cost. The observed Pareto-optimal point at four bits per parameter matches the analytically predicted convergence threshold, demonstrating an exact bit--accuracy trade-off. These results highlight DigiMAC-FL as a practical bridge between wireless communication and distributed reasoning systems, paving the way for scalable, communication-aware retrieval and foundation model training.

DigiMAC-FL: Towards Communication-Aware Federated Optimization for Reasoning-Enhanced Retrieval

Composed Image Retrieval (CIR) aims to find a target image that aligns with user intent, expressed through a reference image and a modification text. While Zero-shot CIR (ZS-CIR) methods sidestep the need for labeled training data by leveraging pretrained vision-language models, they often rely on a single fused query that merges all descriptive cues of what the user wants—tending to dilute key information and failing to account for what they wish to avoid. Moreover, current CIR benchmarks assume a single correct target per query, overlooking the ambiguity in modification texts. To address these challenges, we propose Soft Filtering with Textual constraints (SoFT), a training-free, plug-and-play filtering module for ZS-CIR. SoFT leverages multimodal large language models (LLMs) to extract two complementary constraints from the reference-modification pair: prescriptive (must-have) and proscriptive (must-avoid) constraints. These serve as semantic filters that reward or penalize candidate images to re-rank results, without modifying the base retrieval model or adding supervision. In addition, we construct a two-stage dataset pipeline that refines CIR benchmarks. We first identify multiple plausible targets per query to construct multi-target triplets, capturing the open-ended nature of user intent. Then guide multimodal LLMs to rewrite the modification text to focus on one target, while referencing contrastive distractors to ensure precision. This enables more comprehensive and reliable evaluation under varying ambiguity levels. Applied on top of CIReVL—a ZS-CIR retriever—SoFT raises $R@5$ to 65.25 on CIRR (+12.94), $mAP@50$ to 27.93 on CIRCO (+6.13), and $R@50$ to 58.44 on FashionIQ (+4.59), demonstrating broad effectiveness.

Soft Filtering : Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints

Large Language Models (LLMs) are increasingly employed in enterprise question-answering (QA) systems, requiring adaptation to domain-specific knowledge. Among the most prevalent methods for incorporating such knowledge are Retrieval-Augmented Generation (RAG) and fine-tuning (FT). Yet, from a cost–accuracy trade-off perspective, it remains unclear which approach best suits industry scenarios. This study investigates the impact of RAG and FT across two closed, industry-specific datasets, evaluating answer quality and operational costs.We extend the Cost-of-Pass framework proposed by Erol et al. (2025) to jointly assess output quality, generation cost, and user interaction cost. Our findings reveal that while premium models perform best out of the box, open-source models can achieve comparable quality when enhanced with RAG. Overall, RAG emerges as the most effective and cost-efficient adaptation method for closed and open source.

Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications

Multi-hop question-answering tasks pose a significant challenge for retrieval-augmented generation (RAG) models, which must effectively retrieve, infer, and reason across multiple documents to provide accurate answers. Despite the support of RAG, large language models (LLMs) often struggle to consistently retrieve the most relevant information from large collections of documents. To address this issue, we propose a novel approach, BrowseNet involving query-specific traversal within a knowledge graph (KG) for multi-hop information retrieval. This method encodes unstructured text data into a KG, where nodes represent document chunks and edges capture lexical relationships between these chunks. By dynamically traversing the knowledge graph based on the type of multi-hop query, the proposed method enhances retrieval performance by leveraging intrinsic network parameters. We evaluate this method against RAG baselines on publicly available multi-hop query datasets. Experimental results demonstrate that BrowseNet establishes a new state-of-the-art retrieval performance in multi-hop retrieval.

Structured Traversal of Knowledge Graphs for Multi-hop Question Answering

Large language models (LLMs) often produce fluent but factually incorrect statements, even when relevant evidence is available, due to misallocation of attention between contextual inputs and parametric knowledge. Ensuring that models actively reason over context and retrieve relevant information is critical for trustworthy and interpretable AI. We introduce COMPASS (Context-Modulated PID Attention Steering System), a lightweight, interpretable framework that dynamically steers attention to retrieved context during generation. Using the Context Reliance Score (CRS), COMPASS identifies which attention heads are underutilizing context, and a PID controller adjusts them in real time to improve evidence grounding and factual consistency. This mechanism enables the model to demonstrate advanced reasoning by actively returning to context and retrieving supporting information when needed, without retraining or multi-pass decoding. Across benchmarks including HotpotQA, XSum, HaluEval, and RAGTruth, COMPASS reduces hallucinations by 2.8–5.8% absolute while revealing how attention heads contribute to context-aligned reasoning. These results show that feedback-driven, interpretable control can enhance reasoning, retrieval, and evidence-based generation in LLMs.

Premium content

Next from AAAI 2026

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES