Singapore

Visible-Infrared (RGB-IR) Unmanned Aerial Vehicle (UAV) object detection integrates complementary cues from visible and infrared sensors, offering broad application potential. However, due to sensor parallax, it still faces the challenge of weak spatial misalignment, which significantly limits its performance in UAV-based object detection. Existing methods emphasize strict alignment, overlooking spec-
tral heterogeneity under varying illumination. To address these issues, we propose the Illumination Guided Implicit Alignment Network (IGIANet) to mitigate modality heterogeneity without explicit alignment. Specifically, we integrate three novel modules. First, we propose an illumination-guided frequency modulation module that adaptively allocates fusion weights to visible and infrared features based on
global illumination estimation, effectively alleviating modality imbalance under varying lighting conditions. Second, we introduce a frequency-guided cross-modality differential enhancement module, which computes differential cues across frequency domains to enhance complementary information and highlight weakly aligned and low-contrast regions. Finally, we introduce an implicit alignment-driven dynamic fusion module that actively estimates offsets and generates dynamic, position-adaptive fusion kernels to align and fuse
modalities. Extensive experiments demonstrate that IGIANet outperforms state-of-the-art models on various benchmarks,
achieving 79.1% mAP on DroneVehicle, 57.1% mAP on VEDAI, and 49.4% mAP on FLIR.

AAAI 2026

IGIANet: Illumination Guided Implicit Alignment Network for Infrared–Visible UAV Detection

cv: applications cv: multi-modal vision cv: object detection & categorization

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multi-Agent Path Finding (MAPF) requires computing collision-free paths for multiple agents in shared environment. Most MAPF planners assume that each agent reaches a specific location at a specific timestep, but this is infeasible to directly follow on real systems where delays often occur. To address collisions caused by agents deviating due to delays, the Temporal Plan Graph (TPG) was proposed, which converts a MAPF time dependent solution into a time independent set of inter-agent dependencies. Recently, a Bidirectional TPG (BTPG) was proposed which relaxed some dependencies into "bidirectional pairs" and improved efficiency of agents executing their MAPF solution with delays. Our work improves upon this prior work by designing an algorithm, BPTG-max, that finds more bidirectional pairs. Our main theoretical contribution is in designing the BTPG-max algorithm is locally optimal, i.e. which constructs a BTPG where no additional bidirectional pairs can be added. We also show how in practice BTPG-max leads to BTPGs with significantly more bidirectional edges, superior anytime behavior, and improves robustness to delays.

BTPG-max: Achieving Local Maximal Bidirectional Pairs for Bidirectional Temporal Plan Graphs

State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. Our code will be released upon publication.

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

Generative image steganography has attracted significant at-
tention due to its unparalleled resilience against steganaly-
sis. However, current generative steganography methods still
confront difficulties in terms of the lack of provable security
guarantees under statistical analysis and vulnerability to real-
world unknown channel attacks. To overcome these obsta-
cles, this paper proposes a novel generative image steganog-
raphy framework that leverages the Latent Diffusion Model
(LDM). Notably, we have uncover a consistent trend: regard-
less of whether an image has undergone attacks such as com-
pression or noise addition, the sign of the values in its latent
vector, encoded through LDM, remains unchanged. Capital-
izing on this trend, we have devised a adaptive distribution-
preserving mapping (ADPM) mechanism, capable of con-
verting a secret message into a latent vector that follows stan-
dard normal distribution in an adjustable way. Since both
the secret latent vector and the latent vector randomly gen-
erated during regular image generation follow the same dis-
tribution, satisfying the optimal input conditions for the diffu-
sion model, the proposed method can achieve provable secu-
rity. The experimental results highlight the outstanding per-
formance of our method in terms of robustness, security, ex-
traction accuracy, and image quality.

Towards Provably Secure and Highly Robust Generative Image Steganography Leveraging Latent Diffusion Model

In real-world applications, video action recognition models must continuously learn new action categories while retaining previously acquired knowledge. However, most existing approaches rely on storing historical data for replay, which introduces storage burdens and raises data privacy concerns. To address these challenges, we investigate the problem of Exemplar-Free Continual Video Action Recognition (EF-CVAR) and propose a novel framework named Slow-Fast Collaborative Learning (SFCL). SFCL integrates two complementary learning paradigms: a slow branch based on gradient-driven deep learning, which provides strong adaptability to new tasks, and a fast branch based on analytic learning (e.g., Recursive Least Squares), which efficiently preserves old knowledge without requiring access to past samples. To enable effective collaboration between the two branches, we design the Slow-Fast Dynamic Re-parameterization (SFDR) mechanism for adaptive fusion, and the Knowledge Reflection Mechanism (KRM), which mitigates forgetting and task-recency bias via pseudo-feature generation and dual-level knowledge distillation. Extensive experiments on UCF101, HMDB51, and Something-Something V2 demonstrate that SFCL achieves superior performance compared to existing replay-based methods, despite being exemplar-free. Notably, in long-duration continual learning scenarios, SFCL exhibits remarkable robustness, achieving up to a 30.39\% improvement in accuracy over baselines while maintaining a low forgetting rate, highlighting its scalability and effectiveness in real-world video recognition tasks.

Rep Deep & Machine Learning: Exemplar-Free Continual Video Action Recognition via Slow-Fast Collaborative Learning

Vision-Language Models (VLMs) have made significant progress in static perception, but their ability to understand dynamic task-oriented reasoning remains unclear. Existing benchmarks mainly focus on static spatial relationships and lack systematic assessment of dynamic reasoning capabilities. To this end, we propose SpatialLogic-Bench, a novel benchmark designed to evaluate VLMs’ understanding of spatiotemporal logic and their ability to assess task progress. The benchmark assesses two critical capabilities: first, fine-grained visual discrimination to accurately perceive subtle physical changes between state frames; second, the logical capacity to connect these changes to task goals and judge whether they indicate progress. To mitigate temporal dependency biases, we introduce a dual-task paradigm, presenting image pairs in both chronological and reversed orders while keeping task descriptions consistent. We construct a multi-scale evaluation system by varying time intervals between frames: smaller intervals test the model's fine-grained perception, while larger intervals demand more sophisticated logical inference. Empirical evaluation reveals that most VLMs experience significant performance degradation on tasks presented in inverse chronological order, indicating an over-reliance on temporal cues rather than robust reasoning abilities. SpatialLogic-Bench clearly exposes critical limitations in current models and provides valuable guidance for improving dynamic spatial perception capabilities.

SpatialLogic-Bench: A Diagnostic Benchmark for Task-Oriented Spatiotemporal Reasoning

Unsupervised feature selection (FS) is essential for high-dimensional learning tasks where labels are not available. It helps reduce noise, improve generalization, and enhance interpretability. However, most existing unsupervised FS methods evaluate features in isolation, even though informative signals often emerge from groups of related features. For example, adjacent pixels, functionally connected brain regions, or correlated financial indicators tend to act together, making independent evaluation suboptimal. Although some methods attempt to capture group structure, they typically rely on predefined partitions or label supervision, limiting their applicability. We propose GroupFS, an end-to-end, fully differentiable framework that jointly discovers latent feature groups and selects the most informative groups among them, without relying on fixed a priori groups or label supervision. GroupFS enforces Laplacian smoothness on both feature and sample graphs and applies a group sparsity regularizer to learn a compact, structured representation. Across nine benchmarks spanning images, tabular data, and biological datasets, GroupFS consistently outperforms state-of-the-art unsupervised FS in clustering and selects groups of features that align with meaningful patterns.

Unsupervised Feature Selection Through Group Discovery

Fine-tuning adapts pretrained models for specific tasks but poses the risk of catastrophic forgetting (CF), where critical knowledge from pretraining is overwritten. To address the issue of CF in a general-purpose framework, we propose Low-damage Knowledge Implanting (LoKI), a parameter-efficient fine-tuning (PEFT) technique that utilizes recent mechanistic understanding of how knowledge is stored in transformer architectures. We compare LoKI against state-of-the-art PEFT methods in two real-world fine-tuning scenarios. The results show that LoKI demonstrates significantly better preservation of general capabilities. At the same time, its task-specific performance is comparable to or even surpasses that of full parameter fine-tuning and these PEFT methods across various model architectures. Our work bridges the mechanistic insights of LLMs' knowledge storage with practical fine-tuning objectives, enabling an effective balance between task-specific adaptation and the retention of general-purpose capabilities.

LoKI: Low-Damage Knowledge Implanting of Large Language Models

Amid recent advances for multivariate time series forecasting, self-supervised learning has emerged as a promising paradigm for deriving transferable knowledge from multi-domain data. Despite its effectiveness, existing approaches exhibit two critical limitations: (1) Underestimating the significance of multivariate dependencies in learning generalizable representations and (2) Failing to reconcile the complementary strengths of autoregressive and one-shot generative paradigms. In this work, we propose TimeCAP, a novel channel-aware pre-training framework that internalizes latent causal relationships among variables inherent in multi-domain data, and effectively transfers the acquired knowledge to downstream applications. Technically, we present a flexible channel-grouping learning approach, complemented by an adaptive meta-routing mechanism, enabling TimeCAP to parallel recognize intra-group local patterns while maintaining global coherence. Intra- and inter-group multivariate dependencies are captured through the self- and cross-attention with channel-aware mask, which strictly confine interactions among time-aligned, fine-grained multivariate tokens. To seamlessly unify two advanced generative paradigms, we propose a novel dynamic dual-head decoding and optimization strategy, empowering TimeCAP to leverage critical dependencies in the output series while avoiding cumulative errors over time. In the few-shot evaluation, TimeCAP achieves average MSE and MAE reductions of 11.8% and 6% over leading baselines, while also outperforming state-of-the-art models in full-shot and zero-shot settings by large margins.

TimeCAP: A Channel-Aware Pre-Training Framework for Multivariate Time Series Forecasting

This paper studies submodular maximization over matroids in the fully dynamic setting, where elements of an underlying ground set undergo sequential insertions and deletions. The goal is to maintain an approximate optimal solution for the current element set with low amortized update time. For monotone submodular functions. we propose a dynamic algorithm achieving a $(0.3178 - \varepsilon)$-approximation using $\tilde{O}_{\varepsilon}(k^3)$ expected amortized queries. Furthermore, we extend our approach to the non-monotone submodular maximization setting, obtaining a $(0.1921-\varepsilon)$-approximation with the same update complexity. Both algorithms improve upon the best known approximation guarantees, which are $(0.25 - \varepsilon)$ for the monotone case (Banihashem et al., 2024) and $(0.0932 - \varepsilon)$ for the non-monotone case (Liu and Yang, 2024).

Improved Fully Dynamic Submodular Maximization Under Matroid Constraints

Learning representation of the enclosing subgraph of node pairs is recognized as an efficient approach for link-oriented prediction tasks in network applications. The core challenge within this subgraph encoding approach is how to effectively distinguish and then properly aggregate the contribution of nodes in the subgraph into a single vector to indicate the relation between the target node pair. In this work, we propose a novel sphere-based subgraph encoding architecture, namely BS-SubGNN, to address the challenge. In detail, we design two key building blocks, including Bicentric Sphere Node Labeling (BSNL) and Bicentric Sphere Subgraph Pooling (BSSP) to assist message passing in BS-SubGNN. BSNL endows each node a label according to the sphere it belongs to in the subgraph to distinguish the contribution of nodes, while BSSP adopts an attention mechanism to aggregate the contribution of nodes in each sphere. Theoretically, we prove that BS-SubGNN can unify existing node distance labeling methods, and yield discriminative node features with less time complexity. We evaluate the performance of BS-SubGNN in link prediction tasks over a variety of network types, including undirected networks, attribute networks, directed networks, and signed directed networks. Our experimental results demonstrate that BS-SubGNN consistently achieves significant performance improvements over the above diverse types of networks. In particular, compared to those methods with a requisite of multi-hop neighborhood information, BS-SubGNN can obtain better performance even when only one-hop neighborhood information of the node pair is utilized.

Downloads

Next from AAAI 2026

BTPG-max: Achieving Local Maximal Bidirectional Pairs for Bidirectional Temporal Plan Graphs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES