Singapore

This paper addresses cross-view geo-localization in real-world scenarios, where the field-of-view (FoV) is restricted and the orientation is unknown for ground-view images. This task is extremely challenging due to the huge domain gap. Existing methods typically treat tasks with different FoVs as independent tasks. These approaches not only require separate retraining for each FoV, but also neglect the strong correlations between different FoVs, leading to poor performance under extremely limited FoV. To overcome these limitations, we propose HCL-Geo, a framework follows human-like continual learning paradigm of &quot;first learn, then review&quot; for geo-localization: in the first &quot;learn&quot; stage, tasks are presented to the model in an easy-to-hard sequence to enable gradual learning and knowledge retention, so that their natural correlations could be exploited to facilitate knowledge transfer. In the second &quot;review&quot; stage, expert modules are incorporated to efficiently handle tasks with varying FoVs. This approach eliminates the need for retraining separate models and demonstrates state-of-the-art performance across different FoVs with strong generalization capabilities. Remarkably, the recall rate@top-1 improves from 49.1% to 68.3% and from 24.6% to 34.3% respectively on CVUSA and CVACT benchmarks with 70° FoV.

AAAI 2026

First Learn, Then Review: Human-Like Continual Learning for Cross-View Geo-Localization with Limited Field of View

limited field of view

cross-view geo-localization

continual learning

This paper addresses cross-view geo-localization in real-world scenarios, where the field-of-view (FoV) is restricted and the orientation is unknown for ground-view images. This task is extremely challenging due to the huge domain gap. Existing methods typically treat tasks with different FoVs as independent tasks. These approaches not only require separate retraining for each FoV, but also neglect the strong correlations between different FoVs, leading to poor performance under extremely limited FoV. To overcome these limitations, we propose HCL-Geo, a framework follows human-like continual learning paradigm of "first learn, then review" for geo-localization: in the first "learn" stage, tasks are presented to the model in an easy-to-hard sequence to enable gradual learning and knowledge retention, so that their natural correlations could be exploited to facilitate knowledge transfer. In the second "review" stage, expert modules are incorporated to efficiently handle tasks with varying FoVs. This approach eliminates the need for retraining separate models and demonstrates state-of-the-art performance across different FoVs with strong generalization capabilities. Remarkably, the recall rate@top-1 improves from 49.1% to 68.3% and from 24.6% to 34.3% respectively on CVUSA and CVACT benchmarks with 70° FoV.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large Language Models (LLMs) have become a crucial tool in Visual Question Answering (VQA) for handling knowledge-intensive questions in few/zero-shot scenarios. However, their reliance on massive training datasets often causes them to inherit language biases during knowledge acquisition. This limitation imposes two key constraints on existing methods: (1) LLM predictions become less reliable due to bias exploitation, and (2) despite strong knowledge reasoning capabilities, LLMs still struggle with out-of-distribution (OOD) generalization. To address these issues, we propose Object Attribute Description Promoter (OAD-Promoter), a novel approach for enhancing LLM-based VQA by mitigating language bias and improving domain-shift robustness. OAD-Promoter comprises three components: the Object-concentrated Example Generation (OEG) module, the Memory Knowledge Assistance (MKA) module, and the OAD Prompt. The OEG module generates global captions and object-concentrated samples, jointly enhancing visual information input to the LLM and mitigating bias through complementary global and regional visual cues. The MKA module assists the LLM in handling OOD samples by retrieving relevant knowledge from stored examples to support questions from unseen domains. Finally, the OAD Prompt integrates the outputs of the preceding modules to optimize LLM inference. Experiments demonstrate that OAD-Promoter significantly improves the performance of LLM-based VQA methods in few/zero-shot settings, achieving new state-of-the-art results. Our code will be made available upon acceptance.

OAD-Promoter: Enhancing Zero-Shot VQA Using Large Language Models with Object Attribute Description

The task of video-to-video human motion editing aims to transfer motion from a specific video to a reference video while preserving the background dynamics and the original protagonist's appearance. From analysis, we identify critical limitations in existing models that fail to capture the full complexity of human motions, particularly regarding 1) location changes, 2) orientation variations, and 3) complicated non-upright poses. To address these challenges, we propose a framework that selectively "copies and pastes" 2D and 3D features across spatio-temporal dimensions into a shared representation space for motion guidance. This is achieved through: 1) a mutual distillation mechanism that enhances the robustness and capability of individual encoders, and 2) a selective fusion module that adaptively weights and combines complementary information from spatio-temporal representations. To push the limits of motion editing algorithms with challenging scenarios, we introduce an evaluation dataset comprising real-world video clips from artistic gymnastics and figure skating competitions. These sports disciplines naturally encompass the aforementioned three aspects of motion complexity. Experiments demonstrate that our approach significantly outperforms existing methods, particularly in handling intricate human motions.

Collaboratively “Copy & Paste” 2D-3D Features for Complex Video-to-Video Motion Editing

Reconstructing a faithful geometric surface from sparse images remains a fundamental challenge in 3D computer vision. While recent methods have achieved remarkable progress, they still struggle to recover reliable geometry due to the lack of multi-view geometric cues, particularly in non-overlapping regions. To address this issue, we introduce VGGS, a Gaussian Splatting (GS) method that exploits multi-view geometric priors from VGGT for efficient and high-fidelity sparse-view surface reconstruction. Our primary contribution is an anchor-calibrated depth estimation scheme, which yields accurate depth maps. The insight is to align the VGGT depth prior to the underlying surface with a sparse set of multi-view consistent anchors, then infer depth for unreliable regions by relative depth estimation. Furthermore, to mitigate misalignment in complex scenes, we propose a relative depth consistency loss that penalizes the rendered depth if its relative depth relationship in local regions is inconsistent to the multi-view prior. Extensive experiments on widely-used benchmarks show that VGGS surpasses state-of-the-art methods in both accuracy and efficiency, delivering 4–7× faster optimization while reducing memory consumption compared to previous GS-based approaches.

VGGS: VGGT-guided Gaussian Splatting for Efficient and Faithful Sparse-View Surface Reconstruction

Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.

MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming

Large neural networks excel at prediction tasks, but their application to design problems, such as protein engineering or materials discovery, requires solving offline model-based optimization (MBO) problems. While predictive models may not directly translate to effective design, recent MBO algorithms incorporate reinforcement learning and generative modeling approaches. Meanwhile, theoretical work suggests that exploiting the target function’s structure can enhance MBO performance. We present Cliqueformer, a transformer- based architecture that learns the black-box function’s structure through functional graphical models (FGM), addressing distribution shift without relying on explicit conservative approaches. Across various domains, including chemical and genetic design tasks, Cliqueformer demonstrates superior performance compared to existing methods.

Cliqueformer: Model-Based Optimization with Structured Transformers

Despite Video Large Language Models~(Video-LLMs) have rapidly advanced in recent years, the perception hallucination issue has emerged as a significant bottleneck, hindering their real-world applicability.
While several methods for hallucination mitigation have been proposed, they often compromise the model’s capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model’s own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, reducing decoding cost by up to 79.6%. Experiments show that SmartSight substantially lowers hallucinations for QwenVL-2.5-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by 8.86% and surpassing the proprietary model Gemini 1.5 Pro. Consistent improvements are observed across 10 diverse Video-LLMs. These results highlight SmartSight’s effectiveness as a general solution for improving the reliability of state-of-the-art open-source Video-LLMs.

SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse

Recent generative unlearning models synthesize high quality samples while protecting private information by unlearning the identity.
However, existing generative identity unlearning methods face two challenges in multi-identity unlearning: 1) identity conflicts, which cause conflicts of model parameters in the continuous erasure of multiple identities; 2) fragile unlearning, where the model's unlearning ability deteriorates or fails under malicious attacks.
In this paper, we introduce a critical yet under-explored task called robust multi-identity unlearning, with the goals of resolving identity conflicts to achieve interference-free unlearning and protecting against malicious attacks to achieve robust unlearning.
To satisfy these goals, we propose a novel framework, RObust generatiVE continual identity unlearning against Relearning attacks (ROVER).
By filtering unlearning requests with latent similarity, our method effectively isolates benign unlearning from malicious attacks to preserve identity removal integrity.
Meanwhile, residual orthogonal resonator resolves identity conflicts in the continuous erasure of multiple identities, preserving stability in benign continual unlearning.
Moreover, we introduce the phantom guard network to block malicious attacks by absorbing adversarial gradients, ensuring irreversible identity unlearning.
The extensive experiments demonstrate that our proposed method achieves state-of-the-art performance in the task of multi-identity unlearning against relearning attacks.

ROVER: Robust Generative Continual Identity Unlearning Against Relearning Attacks

Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. 
Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model's rollouts but provide no reward supervision for the "think” process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model’s native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi‐stage RFT IQA framework (**Refine-IQA**). In **Stage-1**, we build the **Refine-Perception-20K** dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model’s visual quality perception. In **Stage-2**, targeting the quality scoring task, we introduce a \textbf{probability difference reward involved strategy} for "think" process supervision. The resulting **Refine-IQA Series Models** achieve outstanding performance on both perception and scoring tasks—and, notably, our paradigm activates a robust "think” (quality interpretating) capability that also attains exceptional results on the corresponding quality interpreting benchmark.

Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment

In this paper, we present Ev-iCRF, a novel self-supervised pipeline for high dynamic range (HDR) image reconstruction from a single-exposure low dynamic range (LDR) image, guided by asynchronous event streams generated by a bio-inspired event camera. The highlight of Ev-iCRF lies in its formulation of the inverse camera response function (iCRF) based on Event-LDR Correspondence. By leveraging the HDR properties of event data, the method enables direct iCRF estimation, offering a new perspective for event-guided HDR imaging. The pipeline is trained in a self-supervised manner using formulation-driven iCRF estimation loss and refinement loss, without the need for synchronized HDR supervision. Ev-iCRF adopts a two-stage coarse-to-fine reconstruction pipeline, allowing effective fusion of features from both LDR image and event data. The event information is used to optimize the iCRF, enabling accurate HDR reconstruction from LDR inputs. We evaluate Ev-iCRF on both real-world and synthetic datasets, and results show that it outperforms state-of-the-art methods in HDR reconstruction accuracy. Moreover, the reconstructed images demonstrate improved texture fidelity and structural detail.

Ev-iCRF: Self-supervised Event-guided iCRF Estimation for HDR Image Reconstruction

Existing cross-modal pedestrian detection (CMPD) employs complementary information from RGB and thermal-infrared (TIR) modalities to detect pedestrians in 24h-surveillance systems. RGB captures rich pedestrian details under daylight, while TIR excels at night. However, TIR focuses primarily on the person's silhouette, neglecting critical texture details essential for detection. 
While the near-infrared (NIR) captures texture under low-light conditions, which effectively alleviates performance issues of RGB and detail loss in TIR, thereby reducing missed detections. To this end, we construct a new Triplet RGB–NIR–TIR (TRNT) dataset, comprising 8,281 pixel-aligned image triplets, establishing a comprehensive foundation for algorithmic research. 
However, due to the variable nature of real-world scenarios, imaging devices may not always capture all three modalities simultaneously. This results in input data with unpredictable combinations of modal types, which challenge existing CMPD methods that fail to extract robust pedestrian information under arbitrary input combinations, leading to significant performance degradation.
To address these challenges, we propose the Adaptive Uncertainty-aware Network (AUNet) for accurately discriminating modal availability and fully utilizing the available information under uncertain inputs. 
Specifically, we introduce Unified Modality Validation Refinement (UMVR), which includes an uncertainty-aware router to validate modal availability and a semantic refinement to ensure the reliability of information within the modality. 
Furthermore, we design a Modality-Aware Interaction (MAI) module to adaptively activate or deactivate its internal interaction mechanisms per UMVR output, enabling effective complementary information fusion from available modalities. 
AUNet enables accurate modality validation and robust inference without fixed modality pairings, facilitating the effective fusion of RGB, NIR, and TIR information across diverse input configurations. The code and dataset will be made publicly available.

Downloads

Next from AAAI 2026

OAD-Promoter: Enhancing Zero-Shot VQA Using Large Language Models with Object Attribute Description

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

OAD-Promoter: Enhancing Zero-Shot VQA Using Large Language Models with Object Attribute Description

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads