Singapore

Self-supervised monocular depth estimation methods severely compromise accuracy in dynamic objects due to their static scene assumption. 
Existing approaches for dynamic scenes suffer from two critical shortcomings: 1) reliance on supervised segmentation models (requiring costly annotations) or computationally intensive multi-branch models to isolate moving objects, and 2) simple integration of 2D/3D motion flow without reliable supervision for dynamic objects. 
We propose AdaDepth, a two‑stage framework that jointly performs unsupervised scene decomposition and dynamic-aware depth learning. In the initial structural stage, our geometry-motion joint scene decomposition (GMoDecomp) module ensures the robust generation of a depth prior and simultaneously partitions the scene into multiple regions through the fusion of geometric and motion cues. 
In the region-adaptive refinement stage, we exploit the depth prior and decomposed regions to introduce motion-aware and geometry-consistent constraints, effectively improving depth estimation in dynamic scenes. 
AdaDepth achieves accurate depth prediction in highly dynamic scenes without relying on external labels or specialized segmentation models. Extensive experiments on KITTI, Cityscapes, and Waymo Open demonstrate its superiority over state-of-the-art approaches.

AAAI 2026

AdaDepth: Exploiting Inherent Scene Information for Self-Supervised Depth Estimation in Dynamic Scenes

dynamical scenes

monocular

depth estimation

self-supervised

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Reinforcement Fine-tuning (RFT) methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have demonstrated strong capabilities in aligning Large Language Models (LLMs) with human preferences. However, these approaches often suffer from limited data efficiency, necessitating extensive on-policy rollouts to maintain competitive performance. We propose PSPO (Prompt-Level Prioritization and Experience-Weighted Smoothing for Efficient Policy Optimization), a lightweight yet effective enhancement to GRPO that improves training stability and sample efficiency through two complementary techniques. First, we introduce an experience-weighted reward smoothing mechanism, which uses exponential moving averages to track group-level reward statistics for each prompt. This enables more stable advantage estimation across training steps without storing entire trajectories, allowing the model to capture historical reward trends in a lightweight and memory-efficient manner. Second, we adopt a prompt-level prioritized sampling strategy, which is an online data selection method inspired by prioritized experience replay. It dynamically emphasizes higher-impact prompts based on their relative advantages, thereby improving data efficiency. Experiments on multiple mathematical reasoning benchmarks and models show that PSPO achieves comparable or better accuracy than GRPO, while significantly accelerating convergence, and maintaining low computational and memory overhead.

PSPO: Prompt-Level Prioritization and Experience-Weighted Smoothing for Efficient Policy Optimization

While neural solvers have shown remarkable performance on Vehicle Routing Problems (VRPs), two key challenges persist. First, it remains difficult to determine which parts of the input graph are most critical for making optimal routing decisions during the decoding stage. Second, current neural models are typically trained on smaller problem instances (50-100 nodes), and their ability to generalize to large-scale scenarios is underexplored. To address these challenges, we introduce a novel U-Net architecture that captures multi-level information, enhancing the decision-making process in the decoder. Building on this, we propose a unified neural solver for a wide range of Vehicle Routing Problems. Our extensive experiments demonstrate the effectiveness of this framework on both small and large-scale problem instances, showcasing its superior performance and generalization capabilities.

Scale-Net: A Hierarchical U-Net Framework for Cross-Scale Generalization in Multi-Task Vehicle Routing

Recent advancements in large language models (LLMs) have greatly improved their ability to perform complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM's self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables LLMs to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27\% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16\% accuracy improvement.

Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Significant Gains in Reasoning Efficiency in Large Language Models

Pixel-level feature attributions are an important tool in eXplainable AI for Computer Vision (XCV), providing visual insights into how image features influence model predictions. The Owen formula for hierarchical Shapley values has been widely used to interpret machine learning (ML) models and their learned representations. However, existing hierarchical Shapley approaches do not exploit the multiscale structure of image data, leading to slow convergence and weak alignment with the actual morphological features. Moreover, no prior Shapley method has leveraged data-aware hierarchies for Computer Vision tasks, leaving a gap in model interpretability of structured visual data.

To address this, this paper introduces ShapBPT, a novel data-aware XCV method based on the hierarchical Shapley formula. 
ShapBPT assigns Shapley coefficients to a multiscale hierarchical structure tailored for images, the Binary Partition Tree (BPT). 
By using this data-aware hierarchical partitioning, ShapBPT ensures that feature attributions align with intrinsic image morphology, effectively prioritizing relevant regions while reducing computational overhead.
This advancement connects hierarchical Shapley methods with image data, providing a more efficient and semantically meaningful approach to visual interpretability. Experimental results confirm ShapBPT’s effectiveness, demonstrating superior alignment with image structures and improved efficiency over existing XCV methods, and a 20-subject user study confirming that ShapBPT explanations are preferred by humans.

ShapBPT: Image Feature Attributions Using Data-Aware Binary Partition Trees

Large Vision-Language Models (LVLMs) often suffer from object hallucination, making erroneous judgments about the presence of objects in images. We propose this primarily stems from spurious correlations arising when models strongly associate highly co-occurring objects during training, leading to hallucinated objects influenced by visual context. Current benchmarks mainly focus on hallucination detection but lack a formal characterization and quantitative evaluation of spurious correlations in LVLMs. To address this, we introduce causal analysis into the object recognition scenario of LVLMs, establishing a Structural Causal Model (SCM). Utilizing the language of causality, we formally define spurious correlations arising from co-occurrence bias. To quantify the influence induced by these spurious correlations, we develop Causal-HalBench, a benchmark specifically constructed with counterfactual samples and integrated with comprehensive causal metrics designed to assess model robustness against spurious correlations. Concurrently, we propose an extensible pipeline for the construction of these counterfactual samples, leveraging the capabilities of proprietary LVLMs and Text-to-Image (T2I) models for their generation. Our evaluations on mainstream LVLMs using Causal-HalBench demonstrate these models exhibit susceptibility to spurious correlations, albeit to varying extents.

Causal-HalBench: Uncovering LVLMs Object Hallucinations Through Causal Intervention

The missing of graph attributes poses a significant challenge in graph representation learning. Some existing graph attribute completion methods adopt the shared-space hypothesis or employ end-to-end frameworks to perform single-attribute imputation. However, these models can only generate one single attribute with a few specific patterns that either adhere to prior knowledge or are optimal for downstream tasks, making it difficult to capture the full range of variations in the target attribute distribution. This limitation negatively impacts the model's generalizability and efficiency.

Therefore, to address this issue, we proposed a new method based on a graph denoising diffusion model, called **Multi-attribute Imputation Graph Denoising Diffusion Model (MIGDiff)**, which can generate multiple high-quality attributes. Specifically, it employs a **Dual-source Auto-encoder** on existing attributes and graph topology to extract reliable knowledge, which serves as a condition for training the diffusion module.

Within diffusion, noise is added to the structural embeddings of nodes without attributes in the forward process. In the reverse process, a **Structure-aware Denoising Network** is devised to integrate feature and structural information via an attention mechanism and to perform neighbor‑guided refinement based on graph connectivity, thereby enhancing denoising and accurately recovering missing attributes while effectively maintaining structural consistency and distributional fidelity.

During generation, multiple initial values are sampled to produce diverse attribute imputations, avoiding focusing on a few easy-to-learn patterns. Extensive experiments conducted on four public datasets highlight the state-of-the-art performance of MIGDiff in both attribute imputation and node classification tasks.

MIGDiff: Multi-attributes Imputations for Attribute-missing Graphs via Graph Denoising Diffusion Model

Image retouching aims to enhance visual quality while aligning with users' personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called \textbf{PerTouch}. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms in the training process. To connect natural language instructions with visual control, we develop a VLM-driven agent that can handle both strong and weak user instructions. Equipped with mechanisms of feedback rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component’s effectiveness and the superior performance of PerTouch in personalized image retouching. Code and model will be released.

PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

The interpretative efficacy of large language models (LLMs) fundamentally hinges on the intricate alignment between user inputs and model-specific linguistic priors. Existing methodologies predominantly employ static input optimization strategies, failing to account for the empirically observed divergence in linguistic preference spaces across distinct LLM architectures, including variations in syntactic parsing heuristics, semantic grounding mechanisms, and knowledge retrieval pathways. We propose QueryAligner, an adaptive rewriting system implementing dynamic model-aware input transformation through architecture-specific preference modeling. Our framework introduces two pivotal innovations: 1) A dual-phase optimization engine integrating supervised learning on reverse-engineered cross-architectural training data with reinforcement learning driven by multi-objective reward signals, ensuring simultaneous preservation of semantic integrity and maximization of target model compatibility; 2) An architecture-informed rewriting protocol that automatically discovers latent alignment patterns encoded within distinct LLMs' parametric configurations. Experimental results demonstrate that our method achieves superior performance compared to conventional input optimization techniques.

QueryAligner: Customizing User Query to Match LLMs Preferences for Better Intent Recognition

Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available.

Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation

Hidden degenerations in industrial time series often precede observable failures, they remain undetected by standard monitoring systems until anomalies become apparent. This gap between microscopic degradation and macroscopic observation renders conventional predictors inherently reactive, as they rely on correlations in sensor data rather than uncovering the underlying, physics‑consistent degradation states. Crucially, the microscopic mechanisms governing system evolution depend on macroscopic state variables—whose measurements are expectations over microscopic probability distributions—so purely data‑driven “top‑down” or purely physics‑guided “bottom‑up” approaches cannot forecast degeneration‑entangled industrial faults. To address these challenges, we propose a Physics-Guided Bidirectional Inference Framework that represents hidden microscopic states from macroscopic measurements. Our approach uniquely combines: (1) bottom-up physics-based simulation using Continuum Damage Mechanics to model micro-scale damage evolution under environmental stressors, and (2) top-down probabilistic inference via maximum entropy formalism to estimate latent microstate distributions from sparse sensor data. This bidirectional mechanism enables early failure prediction by bridging observable measurements with unobservable degeneration. Validation on real-world railway infrastruc datasets demonstrates significant improvements in early fault prediction compared to state-of-the-art baselines. Our method establishes a new paradigm for safety-critical industrial applications requiring reliable prediction of hidden degeneration processes.

Downloads

Next from AAAI 2026

PSPO: Prompt-Level Prioritization and Experience-Weighted Smoothing for Efficient Policy Optimization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

PSPO: Prompt-Level Prioritization and Experience-Weighted Smoothing for Efficient Policy Optimization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads