Singapore

Vision-Language Models (VLMs) are widely used in tasks like Open-Vocabulary Object Detection and zero-shot Classification, owing to their powerful generalization. However, recent research reveals that VLMs exhibit significant performance instability when tasked with recognizing concepts at varying granularities (\textit{e.g.}, animal vs. dog). Prevailing methods inject external knowledge from Large Language Models, but this unconstrained approach distorts the VLM&#39;s inherent hierarchical orthogonal geometry, leading to performance collapse on general concepts. To address this, we introduce \textbf{\textit{GeCoin}}, an innovative \textbf{\textit{Ge}}ometrically \textbf{\textit{Co}}nstra\textbf{\textit{in}}ed framework that safely enhances existing VLMs with external knowledge for improved hierarchical understanding, without additional training. By projecting knowledge into the null-space of a query concept&#39;s feature space, \textit{GeCoin} mathematically guarantees the preservation of general knowledge while integrating specialized information. Extensive experiments across large-scale benchmarks, diverse VLMs (\textit{e.g.}, CLIP, SigLip 2), and knowledge from various LLMs (\textit{e.g.}, GPT-3.5, Claude-3, Gemini-Pro) show that \textbf{GeCoin} boosts performance by an average of 3.9\% over the strongest baseline—crucially eradicating performance collapse on general concepts. The code link is in the supplementary material.

AAAI 2026

Injection Without Distortion: Geometrically Constrained Knowledge Enhancement for Vision-Language Models

cv: representation learning for vision

cv: multi-modal vision

cv: object detection & categorization

Vision-Language Models (VLMs) are widely used in tasks like Open-Vocabulary Object Detection and zero-shot Classification, owing to their powerful generalization. However, recent research reveals that VLMs exhibit significant performance instability when tasked with recognizing concepts at varying granularities (\textit{e.g.}, animal vs. dog). Prevailing methods inject external knowledge from Large Language Models, but this unconstrained approach distorts the VLM's inherent hierarchical orthogonal geometry, leading to performance collapse on general concepts. To address this, we introduce \textbf{\textit{GeCoin}}, an innovative \textbf{\textit{Ge}}ometrically \textbf{\textit{Co}}nstra\textbf{\textit{in}}ed framework that safely enhances existing VLMs with external knowledge for improved hierarchical understanding, without additional training. By projecting knowledge into the null-space of a query concept's feature space, \textit{GeCoin} mathematically guarantees the preservation of general knowledge while integrating specialized information. Extensive experiments across large-scale benchmarks, diverse VLMs (\textit{e.g.}, CLIP, SigLip 2), and knowledge from various LLMs (\textit{e.g.}, GPT-3.5, Claude-3, Gemini-Pro) show that \textbf{GeCoin} boosts performance by an average of 3.9\% over the strongest baseline—crucially eradicating performance collapse on general concepts. The code link is in the supplementary material.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for comprehensive travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through three key contributions: (1) TravelQA, a novel dataset of 265k question-answer pairs combining 160k text QA from authentic travel sources, 100k vision-language QA featuring maps and location imagery, and 5k expert-annotated Chain-of-Thought reasoning examples; (2) Travel-CoT, a structured reasoning framework that decomposes travel queries into spatial, temporal, and practical dimensions, improving answer accuracy by 10.8% while providing interpretable decision paths; and (3) an interactive agent system validated through extensive user studies. Through fine-tuning experiments on state-of-the-art vision-language models (LLaVA, Qwen-VL, Shikra), we achieve 6.2-9.4% base improvements, further enhanced by Travel-CoT reasoning. Our model demonstrates superior capabilities in contextual travel recommendations, map interpretation, and scene understanding while providing practical information such as operating hours and cultural insights. User studies with 500 participants show TraveLLaMA achieves a System Usability Scale score of 82.5, significantly outperforming general-purpose models and establishing new standards for multimodal travel assistance systems.

TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

The neural-enhanced video streaming (NeVS) has been an emerging technique to integrate neural models into video codecs for higher streaming efficiency. The state-of-the-art methods, e.g., DeNC and Gemino, typically compress videos in RGB space and restore video quality via a neural enhancement model hosted on the external media server. However, these methods are not always accessible in resource-constrained edge environments due to their heavy reliance on the media server's computation, which undermines end-to-end performance and restricts NeVS's usage boundary. This limitation raises an interesting question: is it possible to make NeVS lightweight so that all neural codec operations can be handled directly by clients' edge devices? In this paper, we present the answer yes and develop a new plug-and-play module called DeNC++, which significantly improves the compression-restoration-overhead trade-off over existing methods. Our core design philosophy is to wrap all the codec operations within a latent semantic space, in which the original high-dimensional visual signals are efficiently embedded into low-dimensional semantic representations. With this fundamental transformation, DeNC++'s neural encoder introduces the triple semantic-bitwidth-resolution compression to effectively lower the streaming traffic. Meanwhile, we make DeNC++'s neural decoder aware of the perceptual loss caused by its encoder and design tiny generative models to guarantee high restoration quality. We also strictly restrict the runtime computational overhead and accelerate the neural enhancement process, making DeNC++ compatible with commodity edge devices. Real-world evaluations reveal that DeNC++ consistently provides higher restoration quality while achieving 24-55 times higher compression ratio and 5-7 times end-to-end speedup over the latest NeVS solutions.

DeNC++: Efficient Diffusion-Enhanced Neural Codec for End-to-end Semantic Streaming at the Edge

Graph Neural Networks (GNNs) address two key challenges in applying deep learning to graph-structured data: they handle varying size input graphs and ensure invariance under graph isomorphism. While GNNs have demonstrated broad applicability, understanding their expressive power remains an important question. In this paper, we propose GNN architectures that correspond precisely to prominent fragments of first-order logic (FO), including various modal logics as well as more expressive two-variable fragments. To establish these results, we apply methods from finite model theory of first-order and modal logics to the domain of graph representation learning. Our results provide a unifying framework for understanding the logical expressiveness of GNNs within FO.

The Correspondence Between Bounded Graph Neural Networks and Fragments of First-Order Logic

Cross-domain few-shot segmentation (CD-FSS) aims to tackle the dual challenge of recognizing novel classes and adapting to unseen domains with limited annotations. However, encoder features often entangle domain-relevant and category-relevant information, limiting both generalization and rapid adaptation to new domains. To address this issue, we propose a Divide-and-Conquer Decoupled Network (DCDNet). In the training stage, to tackle feature entanglement that impedes cross-domain generalization and rapid adaptation, we propose the Adversarial-Contrastive Feature Decomposition (ACFD) module. It decouples backbone features into category-relevant private and domain-relevant shared representations via contrastive learning and adversarial learning. Then, to mitigate the potential degradation caused by the disentanglement, the Matrix-Guided Dynamic Fusion (MGDF) module adaptively integrates base, shared, and private features under spatial guidance, maintaining structural coherence. In addition, in the fine-tuning stage, to enhanced model generalization, the Cross-Adaptive Modulation (CAM) module is placed before the MGDF, where shared features guide private features via modulation ensuring effective integration of domain-relevant information. Extensive experiments on four challenging datasets show that DCDNet outperforms existing CD-FSS methods, setting a new state-of-the-art for cross-domain generalization and few-shot adaptation. Code: https://github.com/rawwap/DCDNet.

Divide-and-Conquer Decoupled Network for Cross-Domain Few-Shot Segmentation

Cake-cutting algorithms, which aim to fairly allocate a continuous resource based on individual agent preferences, have seen significant progress over the past two decades. Much of the research has concentrated on fairness, with comparatively less attention given to other important aspects.
Chen et al. (2010) introduced an algorithm that, in addition to ensuring fairness, was strategyproof---meaning agents had no incentive to misreport their valuations.
However, even in the absence of strategic incentives to misreport, agents may still hesitate to reveal their true preferences due to privacy concerns (e.g., when allocating advertising time between firms, revealing preferences could inadvertently expose planned marketing strategies or product launch timelines).
In this work, we extend the strategyproof algorithm of Chen et al. by introducing a privacy-preserving dimension. To the best of our knowledge, we present the first private cake-cutting protocol, and, in addition, this protocol is also envy-free and strategyproof. Our approach replaces the algorithm’s centralized computation with a novel adaptation of cryptographic techniques, enabling privacy without compromising fairness or strategyproofness. Thus, our protocol encourages agents to report their true preferences not only because they are not incentivized to lie, but also because they are protected from having their preferences exposed.

Truth, Justice, and Secrecy: Cake Cutting Under Privacy Constraints

Epilepsy is a widespread neurological disorder characterized by highly patient-specific EEG patterns. Existing EEG-based seizure detection methods either train individualized models for each patient or adapt models pre-trained on known patients to new ones. However, when encountering previously unseen patients, these methods typically require retraining or fine-tuning, which limits their practical utility in clinical settings. This limitation can be linked to biases caused by patient-specific variations, which obscure the underlying pathological patterns of seizures. To address this, we propose an evidential multi-view framework that reinforces the learning of core epileptic features by promoting consistency across multiple views and reducing reliance on high-uncertainty, patient-specific segments. Specifically, we introduce Bias-guided Fisher-Evidential Multi-View Learning (BF-EML) to guide the model toward discovering intrinsic seizure patterns. BF-EML employs a two-stage training architecture: In Stage I, we use the Fisher Information Matrix to reorder EEG segments by uncertainty and deliberately train a biased feature generator on low-evidence segments. In Stage II, we design a dual-branch network where the biased and unbiased branches are alternately trained, encouraging the unbiased branch to reduce its reliance on patient-specific biases. Finally, we introduce a shift-calibrated fusion strategy to enhance the consistency of pathogenic feature integration. Extensive experiments on public datasets and a clinical dataset demonstrate that our method achieves superior performance in both single- and multi-patient scenarios. Importantly, it generalizes well to previously unseen patients without the need for retraining. We will release all datasets and source code upon publication.

Universal EEG Epilepsy Detection via Evidential Multi-View De-Biasing

Graph-level anomaly detection (GLAD), which identifies rare or atypical graphs within a graph set, is crucial for applications such as image analysis, industrial defect inspection and fraud detection. However, existing GLAD approaches typically rely on the in-distribution hypothesis while lacking generalization capability for out-of-distribution (OOD) scenarios (e.g., different graph sizes), which largely limits the application in the real world. For the first time, we formulate the OOD generalization problem for GLAD, where testing graph data exhibit significant distributional shifts from training data. To tackle two common types of distributional shifts, domain generalization and subpopulation shift, we propose the $\textbf{F}$ine-$\textbf{G}$rained $\textbf{S}$ubpopulation $\textbf{G}$raph-$\textbf{L}$evel $\textbf{A}$nomaly $\textbf{D}$etection ($\textbf{FGS-GLAD}$). First, we propose a $\textbf{G}$raph $\textbf{I}$nformation $\textbf{B}$ottleneck-based $\textbf{A}$nomaly $\textbf{D}$etection Module ($\textbf{GIB4AD}$) that implements graph reverse distillation and graph information bottleneck on the graph to enhance task-relevant feature extraction for domain generalization. Second, We propose a $\textbf{F}$ine-$\textbf{G}$rained $\textbf{S}$ubpoulation $\textbf{I}$nference Module ($\textbf{FGSI}$) to predict fine-grained subpopulations and focus on critical inter-subpopulation features through a supervised contrastive mechanism. Experiments on seven benchmark datasets and ten baselines demonstrate our model's superiority in handling domain generalization and subpopulation shift, advancing graph-level anomaly detection for real-world applications.

Exploring Domain Generalization and Subpopulation Shift for Generalizable Graph-Level Anomaly Detection

Our statistical analysis reveals a complementary phenomenon between large language model-based question answering (QA) and small model-based QA. To facilitate dual knowledge transfer between these two paradigms, this paper introduces a collaborative enhancement method of large and small models for question answering. The proposed method consists of two iterative steps: i) small4large step, in which the small model first predicts an answer for a given question along with its confidence, and these results are then leveraged as prompts to strengthen the large model's performance; ii) large4small step, where the large model enhances the small model through distillation, judgment and reflection. Through iteration of these two steps, the large and small models could enhance each other progressively. Experimental evaluations across eight datasets spanning five domains demonstrate that the proposed method effectively improves the question answering performance of both large and small models simultaneously.

Collaborative Enhancement of Large and Small Models for Question Answering via Dual Knowledge Transfer

Large Language Models (LLMs) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured pruning offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training pruning framework that alleviates calibration bias by identifying and preserving neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between sparsity and performance, it allocates sparsity to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while preserving language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative pruning methods. Specifically, FANG outperforms FLAP and OBC by 1.5%–8.5% in average accuracy under 30% and 40% sparsity.

Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

Discovering subgroups with the maximum average treatment effect is crucial for targeted decision making in domains such as precision medicine, public policy, and education. While most prior work is formulated in the potential‑outcome framework, the corresponding structural causal model (SCM) for this task has been largely overlooked. In practice, two approaches dominate. The first estimates pointwise conditional treatment effects and then fits a tree on those estimates, effectively turning subgroup estimation into the harder problem of accurate pointwise estimation. The second constructs decision trees or rule sets with ad‑hoc 'causal' heuristics, typically without rigorous justification for why a given heuristic may be used or whether such heuristics are necessary at all.

We address these issues by studying the problem directly under the SCM framework. Under the assumption of a partition-based model, we show that optimal subgroup discovery reduces to recovering the data-generating models and hence a standard supervised learning problem (regression or classification). This allows us to adopt \emph{any} partition-based methods to learn the subgroup from data. We instantiate the approach with CART, arguably one of the most widely used tree-based method, to learn the subgroup with maximum treatment effect. Finally, on a large collection of synthetic and semi‑synthetic datasets, we compare our method against a wide range of baselines and find that our approach, which avoids such causal heuristics, more accurately identifies subgroups with maximum treatment effect.

Downloads

Next from AAAI 2026

TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES