Singapore

Multimodal large language models (LMMs) have demonstrated remarkable capabilities across diverse vision-language tasks, including image captioning, visual question answering, and text-image retrieval. However, their computational complexity and memory footprint, particularly in the key-value (KV) cache during inference, pose significant challenges for real-time deployment, especially on resource-constrained devices. In this paper, we propose Dynamic KV Cache Quantization, a novel quantization strategy tailored for multimodal LMMs. Our approach applies per-channel quantization to (K) and per-token quantization to (V), leveraging their respective statistical distributions to optimize precision allocation. Additionally, we introduce an adaptive token and channel recording mechanism that dynamically adjusts quantization parameters based on real-time distribution tracking, effectively mitigating the impact of outliers. To further enhance compression efficiency, we implement fine-grained grouping, which partitions KV tensors into localized subgroups, enabling more adaptive quantization. Experimental results on LLaVA-1.5 (7B/13B) and Qwen-VL across multiple multimodal benchmarks demonstrate that our method significantly outperforms existing KV-cache quantization approaches, achieving a superior trade-off between memory efficiency and model accuracy.

AAAI 2026

Efficient Multimodal Large Language Model via Dynamic KV Cache Quantization

mas:ai theories and architectures

krr: computational complexity of reasoning

and temporal reasoning

krr: geometric

spatial

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recent advances in pretrained language models (PLMs) have significantly improved conversational recommender systems (CRS), enabling more fluent and context-aware interactions. To further enhance accuracy and mitigate hallucination, many methods integrate PLMs with knowledge graphs (KGs), but face key challenges: failing to fully exploit PLM reasoning over graph relationships, indiscriminately incorporating retrieved knowledge without context filtering, and neglecting collaborative preferences in multi-turn dialogues. To this end, we propose PCRS-TKA, a prompt-based framework employing retrieval-augmented generation to integrate PLMs with KGs. PCRS-TKA constructs dialogue-specific knowledge trees from KGs and serializes them into texts, enabling structure-aware reasoning while capturing rich entity semantics. Our approach selectively filters context-relevant knowledge and explicitly models collaborative preferences using specialized supervision signals. A semantic alignment module harmonizes heterogeneous inputs, reducing noise and enhancing accuracy.
Extensive experiments demonstrate that PCRS-TKA consistently outperforms all baselines in both recommendation and conversational quality.

Enhancing Conversational Recommender Systems with Tree-Structured Knowledge and Pretrained Language Models

To address partial node failures in unmanned aerial vehicle swarms, self-healing communication techniques are commonly employed to restore backbone connectivity while preserving area coverage. However, existing heuristic methods struggle to scale under large-scale failures and dynamic conditions, while learning-based approaches often suffer from spatial collapse, resulting in significant coverage loss. To overcome these limitations, we propose a resilient self-healing framework that enables rapid connectivity recovery and wide-area coverage through a divide-and-conquer strategy. First, we introduce a buffered dynamic virtual force expansion mechanism that categorizes pairwise distances into repulsive, neutral, and attractive zones, allowing nodes to disperse appropriately while preserving communication links and maintaining safety buffers. Subsequently, we design a multipartite graph convolution module to reason over subnetwork-level interactions and facilitate cross-subnetwork reconnection with global structural awareness. Finally, we develop an adaptive fusion strategy that combines both outputs with time-aware weighting to generate the final motion decisions. Experimental results in both random and uniform deployment scenarios demonstrate that our approach outperforms several state-of-the-art methods in terms of connectivity restoration speed and communication coverage.

Resilient UAV Swarm with Fast Connectivity Recovery and Extensive Coverage

To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi task machine-revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses consistent, distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward‐based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model’s token distribution toward human‐like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi‑task evaluation framework that leverages a five‑dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated in texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%.

HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user's intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of composed features, and (3) retrieval uncertainty. To address these challenges, we propose the evidence-dRivEn dual-sTream diRectionAl anChor calibration networK (ReTrack). ReTrack is the first CVR framework that improves multi-modal query understanding by calibrating directional bias in composed features. It consists of three key modules: Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment. Specifically, ReTrack estimates the semantic contribution of each modality to calibrate the directional bias of the composed feature. It then uses the calibrated directional anchors to compute bidirectional evidence that drives reliable composed-to-target similarity estimation. Moreover, ReTrack exhibits strong generalization to the Composed Image Retrieval (CIR) task, achieving SOTA performance across four benchmark datasets in both CVR and CIR scenarios.

ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval

Despite recent advances in fairness-aware machine learning, predictive models often exhibit discriminatory behavior towards marginalized groups. Such unfairness might arise from biased training data, model design, or representational disparities across groups, posing significant challenges in high-stakes decision-making domains such as college admissions. While existing fair learning models aim to mitigate bias, achieving an optimal trade-off between fairness and accuracy remains a challenge. Moreover, the reliance on black-box models hinders interpretability, limiting their applicability in socially sensitive domains.
In this paper, we try to circumvent these issues by integrating Kolmogorov-Arnold Networks (KANs) within a fair adversarial learning framework. Leveraging the adversarial robustness and interpretability of KANs, our approach enables a balance between fairness and accuracy. To further facilitate this balance, we propose an adaptive penalty update mechanism that dynamically adjusts fairness constraints during the model training. We conduct numerical experiments on two real-world college admissions datasets, across three different optimization strategies. The results demonstrate the efficiency and robustness of KANs by consistently outperforming the baseline fair learning models, and maintaining high predictive accuracy while achieving competitive fairness across sensitive attributes.

Learning Fair Representations with Kolmogorov-Arnold Networks

Pre-training large language models on genomic sequences has become a powerful approach for learning biologically meaningful representations. While masked language modeling (MLM)-based approaches, such as DNABERT and Nucleotide Transformer (NT), achieve strong performance, they are hindered by inefficiencies due to partial token supervision, pre-training/fine-tuning mismatches, and high computational costs. We introduce NucEL, the first ELECTRA-style pre-training framework for genomic foundation models, which overcomes these challenges. Through a discriminator network identifying tokens modified by a generator, NucEL achieves comprehensive token-level supervision across all sequence positions, thereby markedly improving training efficiency relative to the partial supervision of masked positions inherent in MLM frameworks. By integrating ModernBERT’s architectural advancements, including hybrid local-global attention and flash attention mechanisms, NucEL establishes an optimized BERT architecture for genomic sequence modeling. Unlike traditional methods that tokenize genomic sequences into 6-mers, NucEL implements single-nucleotide tokenization, enabling fine-grained resolution and improving both efficiency and interpretability. Pre-trained on the human genome only, NucEL achieves state-of-the-art performance on benchmark datasets across diverse downstream tasks in both human and non-human species, including regulatory element identification (e.g., promoters, enhancers), transcription factor binding prediction in human and mouse, open chromatin region classification, and histone modification profiles, surpassing MLM-based models of similar size and rivaling models 25 times larger, such as NT. Ablation studies provide critical insights into tokenization and masking strategies, optimizing ELECTRA-style pretraining for DNA sequences. Attention analyses reveal NucEL’s superior ability to capture biologically relevant sequence motifs compared to NT, offering valuable insights into its hierarchical learning process and regulatory element modeling capabilities. This work highlights the potential of ELECTRA-style pretraining as an efficient and effective strategy for advancing genomic representation learning with broad implications for future genomic research.

NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

Causal structure learning, also known as causal discovery, aims to estimate causal relationships between variables as a form of a causal directed acyclic graph (DAG) from observational data. One of the major frameworks is the order-based approach that first estimates a topological order of the underlying DAG and then prunes spurious edges from the fully-connected DAG induced by the estimated topological order. Previous studies often focus on the former ordering step because it can reduce the search space of DAGs dramatically. In practice, the latter pruning step is equally crucial for ensuring both computational efficiency and estimation accuracy. Most existing methods employ a pruning technique based on generalized additive models and hypothesis testing, commonly known as CAM-pruning. However, this approach can be a computational bottleneck as it requires repeatedly fitting additive models for all variables. Furthermore, it may harm estimation quality due to multiple testing. To address these issues, we introduce a new pruning method based on sparse additive models, which enables direct pruning of redundant edges without relying on hypothesis testing. We propose an efficient algorithm for learning sparse additive models by combining the randomized tree embedding technique with group-wise sparse regression. Experimental results on both synthetic and real datasets demonstrated that our method is significantly faster than existing pruning methods while maintaining comparable or superior accuracy.

Sparse Additive Model Pruning for Order-Based Causal Structure Learning

Knowledge Base Question Answering (KBQA) aims to answer natural language questions using structured knowledge from KBs. While LLM-only approaches offer generalization, they suffer from outdated knowledge, hallucinations, and lack of transparency. Chain-based KG-RAG methods address these issues by incorporating external KBs, but are limited to simple chain-structured questions due to the absence of planning and logical structuring. Inspired by semantic parsing methods, we propose PDRR: a four-stage framework consisting of Predict, Decompose, Retrieve, and Reason. Our method first predicts the question type and decomposes the question into structured triples. Then retrieves relevant information from KBs and guides the LLM as an agent to reason over and complete the decomposed triples. Experimental results show that our proposed KBQA model, PDRR, consistently outperforms existing methods across different LLM backbones and achieves superior performance on various types of questions.

Beyond Chains: Bridging Large Language Models and Knowledge Bases in Complex Question Answering

In recent years, ML-based heuristic functions for automated planning have shown increasing performance. A main challenge is the level of generalization required in planning: techniques must generalize at least across different instances of the same domain (which results in different sizes of learning input). A common approach to overcome the issue is to use graph representations as input. While GNNs are a natural choice for learning, other methods have recently been favored because they show better runtime performance and need less training data. However, existing work has so far been limited to non-hierarchical planning. We describe the first approach to learn heuristics for hierarchical planning. We extend the Instance Learning Graph – a graph structure used in non-hierarchical planning – to the new setting and show how to learn heuristic functions based on it. Since our heuristics are applicable to the lifted model, there is no need to ground it. We therefore combine it with a novel lifted HTN planning system. Like recent systems in non-hierarchical planning, it grounds the search space explored so far, but not the entire model prior to search. Our evaluation shows that our approach is competitive with the lifted systems from the literature, though the ground systems achieve higher coverage.

Learning Heuristic Functions for HTN Planning

Capturing accurate dynamic information of moving organs is essential for functional assessment using non-invasive imaging modalities. Achieving high temporal resolution visualization of physiological processes remains a critical challenge in dynamic magnetic resonance imaging (MRI) when reconstructing from extremely limited acquisitions. We introduce an unsupervised zero-shot reconstruction framework combining Implicit Neural Representation (INR) with manifold learning, capable of reconstructing dynamic MRI data at unprecedented temporal resolutions (less than 10 ms per frame for 2D imaging, less than 400 ms per frame for 3D imaging). 
The framework employs learnable low-dimensional manifold vectors to autonomously capture motion in real time directly from undersampled data, and dynamically condition coordinate-based spatial representations to generate high-fidelity image sequences.
Through a novel spatiotemporal coarse-to-fine (C2F) optimization strategy, our method outperforms current state-of-the-art (SOTA) techniques across multiple imaging scenarios, including cardiac, speech and dynamic-contrast-enhanced (DCE) abdominal MRI, demonstrating robust performance under challenging motion patterns and contrast dynamics.
The learned manifolds additionally provide intuitive visualization of motion and contrast evolution during imaging.
These advances indicate strong clinical potential for applications requiring extreme temporal resolution while maintaining both anatomical and temporal fidelity.

Content not yet available

Next from AAAI 2026

Enhancing Conversational Recommender Systems with Tree-Structured Knowledge and Pretrained Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES