Singapore

Image-based feature representation plays a critical role in visual localization, enabling robots to estimate their position and orientation in GPS-denied environments. However, this task is often undermined by significant variations in camera viewpoints and scene appearances. Recently, map-free visual relocalization (MFVR) has emerged as a promising paradigm due to its compatibility with lightweight deployment and privacy isolation on mobile devices. In this paper, we propose the Debiased Multiplex Tokenizer (DeMT) as a novel method for versatile and efficient MFVR. Specifically, DeMT performs relative pose regression through an integrated framework built upon a pretrained vision Mamba encoder, comprising three key modules: First, Multiplex Interactive Tokenization yields robust image tokens with non-local affinities and cross-domain descriptions; Second, Debiased Anchor Registration facilitates anchor token matching through proximity graph retrieval and causal pointer debiasing; Third, Orthogonal Pose Regression enhances both pair-wise and multi-view pose regression via Jacobi polynomial parsing of Kolmogorov–Arnold networks. Extensive evaluations across ten public datasets demonstrate that DeMT substantially outperforms existing benchmarks and ablation variants in diverse indoor and outdoor environments. Our code and models will be released upon paper acceptance.

AAAI 2026

Debiased Multiplex Tokenizer for Efficient Map-Free Visual Relocalization

mamba codec

relative pose regression

visual localization

causal inference

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users' locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.

GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations

Omnidirectional videos (ODVs) provide an immersive visual experience by capturing the 360$^{\circ}$ scene. With the rapid advancements in virtual/augmented reality, metaverse, and generative artificial intelligence, the demand for high-quality ODVs is surging. However, ODVs often suffer from low resolution due to their wide field of view and limitations in capturing devices and transmission bandwidth. Although video super-resolution (SR) is a capable video quality enhancement technique, the performance ceiling and practical generalization of existing methods are limited when applied to ODVs due to their unique attributes. To alleviate spatial projection distortions and temporal flickering of ODVs, we propose a Spatio-Temporal Distortion Aware Network (STDAN) with joint spatio-temporal alignment and reconstruction. Specifically, we incorporate a spatio-temporal continuous alignment (STCA) to mitigate discrete geometric artifacts in parallel with temporal alignment. Subsequently, we introduce an interlaced multi-frame reconstruction (IMFR) to enhance temporal consistency. Furthermore, we employ latitude-saliency adaptive (LSA) weights to focus on regions with higher texture complexity and human-watching interest. By exploring a spatio-temporal jointly framework and real-world viewing strategies, STDAN effectively reinforces spatio-temporal coherence on a novel ODV-SR dataset and ensures affordable computational costs. Extensive experimental results demonstrate that STDAN outperforms state-of-the-art methods in improving visual fidelity and dynamic smoothness of ODVs.

Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution

While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation.

VPN: Visual Prompt Navigation

Deep hash networks are widely used in tasks such as large-scale image retrieval due to high search efficiency and low storage costs through binary hash codes. With the growing demand for deploying deep hash networks on resource-constrained devices, it is crucial to perform network compression on them, in which automatic pruning constitutes a priority option owing to efficacy maintenance. However, most existing pruning methods are designed primarily for image classification tasks, which makes them suffer from efficacy degradation when transplanted to image retrieval tasks. In this paper, we propose a novel Automatic Channel Pruning framework by Searching with Structure Embedding (ACP-SSE). To the best of our knowledge, this is the first study to explore pruning techniques for deep hash networks and the first automatic pruning method by searching based on network topology structure. Specifically, we first design a structure encoding model by Graph Convolutional Networks (GCNs) whose graph is constructed by hash network and nodes' features are initialized by pruning strategies. The model is trained by contrastive learning loss efficiently without accuracy supervision by fine-tuning pruned models. In addition, we introduce a dynamic pruning search space in consideration of the resource constraints. By converting the automatic channel pruning task into searching the pruned structure with effect similar to the unpruned structure, it enables the method to adapt to various network architectures. Finally, the optimal networks are selected from the candidate set according to their performance in specific downstream tasks. Extensive experiments demonstrate that ACP-SSE indeed works in the automatic channel pruning area, outperforming state-of-the-art baselines in hashing-based image retrieval, while maintaining competitive accuracy in image classification. Our code is available in the supplementary material.

Automatic Channel Pruning by Searching with Structure Embedding for Hash Network

Current sparse autoencoder (SAE) approaches to neural network interpretability assume that activations can be decomposed through linear superposition into sparse, interpretable features. Despite high reconstruction fidelity, SAEs consistently fail to eliminate polysemanticity and exhibit pathological behavioral errors. We propose that neural networks encode information in two complementary spaces compressed into the same substrate: feature identity and feature integration. 
To test this dual encoding hypothesis, we develop sequential and joint-training architectures to capture identity and integration patterns simultaneously. Joint training achieves 41.3% reconstruction improvement and 51.6% reduction in KL divergence errors. This architecture spontaneously develops bimodal feature organization: orthogonal features contributing to integration pathways and the rest contributing directly to the residual. Small nonlinear components (3% of parameters) achieve 16.5% standalone improvements, demonstrating parameter-efficient capture of computational relationships crucial for behavior. Additionally, intervention experiments using 2×2 factorial stimulus designs demonstrated that integration features exhibit selective sensitivity to experimental manipulations and produce systematic behavioral effects on model outputs, including significant interaction effects across semantic dimensions. 
This work provides systematic evidence for (1) dual-encoding in neural representations, (2) meaningful nonlinearly encoded feature interactions, and (3) introduces an architectural paradigm shift from post-hoc feature analysis to integrated computational design, establishing foundations for next-generation SAEs.

Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations

With the rapid growth of visual content in open-world environments, zero-shot hashing image retrieval (ZSHIR) has emerged to tackle the challenge of recognizing novel classes using attribute-level and semantic information. However, existing methods often rely on shallow fusion of multi-source cues (e.g., attributes, labels, and visual features) through external supervision or feature concatenation, failing to capture the underlying semantic structure in a generative way. Particularly, current bridging strategies between modalities suffer from information fragmentation and weak alignment, hindering the model's ability to fully understand complex attribute-visual relations. Moreover, subtle semantic gaps or “semantic drift” between seen and unseen classes further degrade inter-class separability and the scalability of hashing models. To address these issues, we propose a novel framework called Proxy Zero-Shot Hashing with Multimodal Fusion via Stable Diffusion (PZSH), which integrates generative modeling and contrastive learning. PZSH leverages a pre-trained Stable Diffusion (SD) model to synthesize multimodal content, and uses dual BLIP encoders to enhance semantic alignment across modalities. We further design a proxy hashing loss to enforce discriminative binary representations. Extensive experiments on benchmark datasets show that PZSH achieves state-of-the-art performance with stronger generalization to unseen classes. Our code is available in the supplementary material.

Proxy Zero-Shot Hashing with Multimodal Fusion via Stable Diffusion

Gradient perturbation mechanisms, such as differential privacy (DP), aim to defend against gradient inversion attacks (GIA) by injecting noise into the shared gradients. Recent studies have shown that DP-based defenses lack robustness against advanced GIAs. However, existing gradient inversion methods typically rely on iterative refinement and assume static noise, resulting in low efficiency and limited reconstruction fidelity under high-noise conditions. In this paper, we propose Venom, a novel gradient inversion attack method based on a liquid diffusion mechanism. Venom reconstructs private data directly from DP-protected gradients without requiring any prior knowledge of the noise distribution. Specifically, we design a Structural Prior Extraction (SPE) module that analytically extracts deep feature representations from perturbed gradients through energy-based aggregation, enabling stable pre-reconstruction of users' latent data features. We further introduce a Diffusion-driven Liquid Recovery Network (Diff-LRN) for high-fidelity image reconstruction. Unlike traditional diffusion models that rely on iterative sampling with predefined noise schedules, Diff-LRN performs deterministic single-step reconstruction using adaptive liquid neural dynamics to handle spatially heterogeneous noise patterns. Experiments across four benchmarks demonstrate that Venom achieves a speedup of up to 38,343× over state-of-the-art attacks while maintaining high reconstruction fidelity under strong DP settings. These results challenge prevailing assumptions about DP robustness and underscore the need for more resilient privacy-preserving mechanisms in federated learning.

Venom: Liquid Diffusion-Guided Gradient Inversion for Breaking Differential Privacy in Federated Learning

Physical adversarial attacks in driving scenario can expose critical vulnerabilities in visual perception models. However, developing such attacks remains non-trivial due to diverse real-world environmental influences. Existing approaches either struggle to generalize to dynamic environments or fail to achieve consistent physical attack performance. To address these challenges, we propose MAGIC (Mastering Physical Adversarial Generation In Context), a novel framework powered by multi-modal LLM agents to automatically understand the scene context during testing time and generates the adversarial patch through synergistic interaction of language and vision understanding. In specific, MAGIC orchestrates three specialized LLM agents: the adv-patch generation agent masters the creation of deceptive patches via strategic prompt manipulation for text-to-image models; the adv-patch deployment agent ensures contextual coherence by determining optimal deployment strategies based on scene understanding; the self-examination agent completes this trilogy by providing critical oversight and iterative refinement of both processes. We validate our approach with both digital and physical scenarios, i.e., nuImage and real-world scenes, where both statistical and visual results prove that our MAGIC is powerful and effective for attacking widely applied object detection systems, i.e., YOLO and DETR series.

MAGIC: Mastering Physical Adversarial Generation in Context Through Collaborative LLM Agents

Existing paradigms for remote sensing change detection are caught in a trade-off: CNNs excel at efficiency but lack global context, while Transformers capture long-range dependencies at a prohibitive computational cost. This paper introduces ChangeRWKV, a new architecture that reconciles this conflict. By building upon the Receptance Weighted Key Value (RWKV) framework, our ChangeRWKV uniquely combines the parallelizable training of Transformers with the linear-time inference of RNNs. Our approach core features two key innovations: a hierarchical RWKV encoder that builds multi-resolution feature representation, and a novel Spatial-Temporal Fusion Module (STFM) engineered to resolve spatial misalignments across scales while distilling fine-grained temporal discrepancies. ChangeRWKV not only achieves state-of-the-art performance on the LEVIR-CD benchmark, with an 85.46% IoU and 92.16% F1 score, but does so while drastically reducing parameters and FLOPs compared to previous leading methods. This work demonstrates a new, efficient, and powerful paradigm for operational-scale change detection.

Beyond Quadratic: Linear-Time Change Detection with RWKV

Dataset distillation has achieved remarkable progress as an effective approach for data compression. However, real-world data often comes from diverse domains, leading to potential mismatches between the domains of synthesized images and those of the evaluation set. Existing methods primarily assume domain alignment between them, which limits their generalization ability in the above cross-domain scenarios.
In this paper, we aim to ensure that images synthesized from known domains maintain robust performance on unseen domains and propose a novel framework called Channel-masked Asymmetric Distribution Matching (CADM). During asymmetric distribution matching, domain-sensitive channels of real data are selectively masked at different layers to extract domain-invariant features that guide synthetic data optimization.
To further improve synthetic data representation, we introduce a class-focused domain-agnostic regularization to capture class-relevant knowledge while ignoring domain-specific information. Experiments show that our method produces domain-robust synthetic data and substantially improves generalization performance on unseen domains.

Content not yet available

Next from AAAI 2026

GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES