Singapore

Egocentric gaze prediction serves as a critical indicator for decoding human visual attention and cognitive processes, but its inherently limited field of view creates prediction challenges. Although exo-view data provides supplementary contextual information, it exhibits significant spatial and semantic gaps. Existing methods focus solely on isolated feature encoding in single-view paradigms, neglecting cross-view gaze correlations. To make up for this gap, we make the first exploration of cross-view gaze relationship for egocentric gaze prediction, and propose Ego-PMOVE, a novel Prompt-aware Mixture of View Experts network. Unlike prior cross-view studies that forcibly align cross-view features thereby introducing inference noise, we leverage the popular Mixture-of-Experts (MoE) and a set of flexible prompts to disentangle features from different views into three parallel experts: a view-shared expert directly modeling common semantic relationships, a view-discrepancy expert adaptively adjusting the spatial position, scale and shifts based on different view-specific features, and an egocentric expert extracting independent features to compensate for the case of missing exocentric data. To balance these experts, we further design a soft router to dynamically weight them for mining useful information while suppressing noise. A view-query gaze decoder then generates view-specific gaze attention maps, jointly optimized by gaze-heamap and cross-view contrastive loss that regularize both shared and divergent features for accurate gaze prediction. Extensive experiments across the multi-view EgoMe dataset and single-view Ego4D and EGTEA Gaze++ datasets demonstrate the effectiveness and generalizability of our approach. Our code will be released soon.

AAAI 2026

Ego-PMOVE: Prompt-aware Mixture of View Experts Network for Egocentric Gaze Prediction

rob: multimodal perception & sensor fusion

ml: multi-instance/multi-view learning

cv: video understanding & activity analysis

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation uncovers two critical but underexplored challenges—the fragility of individual temporal sub-networks and the tendency for adversarial vulnerabilities to transfer across time. To overcome these limitations, we propose Robust Temporal self-Ensemble (RTE), a training framework that improves the robustness of each sub-network while reducing the temporal transferability of adversarial perturbations. RTE integrates both objectives into a unified loss and employs a stochastic sampling strategy for efficient optimization. Extensive experiments across multiple benchmarks demonstrate that RTE consistently outperforms existing training methods in robust-accuracy trade-off. Additional analyses reveal that RTE reshapes the internal robustness landscape of SNNs, leading to more resilient and temporally diversified decision boundaries. Our study highlights the importance of temporal structure in adversarial learning and offers a principled foundation for building robust spiking models.

Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble

Point-based geometric representations such as point clouds and Gaussian Splatting are fundamental for 3D understanding. However, the inherent irregularity and high-dimensional nature of point structures present significant challenges for direct 3D learning approaches, which often struggle with scalability and achieve suboptimal performance due to sparse data distributions. In contrast, 2D learning paradigms benefit from well-established architectures with superior optimization stability and efficiency. To bridge this gap, we propose Maniflat3D, a unified framework that systematically transforms volumetric point-based geometries into structured 2D representations through a two-stage process: a multilayer Ball-Pivoting reconstruction with adaptive density control, followed by Scalable Locally Injective Mapping (SLIM) to produce distortion-minimized, bijective UV parameterizations.
Our approach explicitly encodes both geometric and attribute information into the flattened domain, enabling conventional 2D neural networks to effectively learn from complex 3D structures such as Gaussian Splatting. Experiments on the ShapeSplat dataset demonstrate that Maniflat3D achieves comparable performance while reducing parameter count by 90\% compared to native 3D baselines, and simultaneously attains 21× compression ratio through neural encoding. These results establish a new paradigm for efficient geometric understanding, demonstrating successful transfer of planar learning advantages to challenging 3D manifold problems through dimensional reduction.

Maniflat3D: Learning 3D Geometry Through Planar Representations from Multi-Layer Unwrapping

The goal of inductive logic programming (ILP) is to find a set of logical rules that generalises training examples and background knowledge. We introduce an ILP approach that identifies pointless rules. A rule is pointless if it contains a logically redundant literal or cannot discriminate against negative examples. We show that ignoring pointless rules allows an ILP system to efficiently and soundly prune the hypothesis space. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce learning times by 99% whilst maintaining predictive accuracies.

Efficient Rule Induction by Ignoring Pointless Rules

Motion estimation in degraded scenes has long been a significant challenge, primarily attributed to substantial scene variations and insufficient training data. Existing approaches typically address this limitation by incorporating additional training strategies or modifying network architectures within conventional frameworks. However, these solutions not only require cumbersome training procedures or additional modal inputs, but also lack generalization capabilities. To address this problem, we propose a unified optical flow estimation framework specifically designed for degraded scenes. In this work, we employ large-scale pre-trained optical flow foundation models as both teacher and student networks. Our objective is to compensate for feature incompleteness during image degradation through pre-trained large models. Subsequently, we leverage supervised signals for fine-tuning and introduce an intra-inter frame distillation method to enable the student network to adapt to diverse cross-domain scenarios. Our proposed methodology provides deeper insights into learning style-invariant features from these learnable fine-tuning layers. Extensive experiments demonstrate that our approach achieves superior generalization performance and state-of-the-art results in degraded scenes (including low-light, rain, fog and other conditions) while requiring minimal training resources.

FlowAnyTime: Efficient Fine-tuning with Intra-Inter Frame Distillation for All-Weather Optical Flow Estimation

Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance.

UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Collaborative perception leveraging intermediate feature fusion has emerged as a leading paradigm to significantly enhance the environmental perception capabilities of autonomous driving systems. However, existing methods typically rely on discriminative supervision guided by downstream tasks. This paradigm compels models to learn minimal, task-specific representations, which conflicts with the goal of cooperative perception to capture comprehensive information, thereby limiting generalization. To address this issue, we propose DiGS-CP, a novel two-stage generative supervised collaborative perception framework. Specifically, we introduce a diffusion-based generative task that conditions on fused object-level features to generate representations of object-level point clouds. The proposed generative supervision provides fine-grained, task-agnostic signals that encourages the fusion module to learn comprehensive representations beyond task-specific requirements. By preserving and integrating complementary information from collaborative agents, our approach overcomes the limitations of task-specific learning and enhances the generalizability of the learned features. Furthermore, our two-stage architecture requires agents to transmit only object-level features, significantly reducing communication overhead. Extensive experiments on three benchmark datasets demonstrate that DiGS-CP achieves state-of-the-art performance in 3D object detection, while maintaining low bandwidth requirements and exhibiting excellent generalization ability. The model and code will be made publicly available.

From Discriminative to Generative: A Diffusion-Based Paradigm for Multi-Agent Collaborative Perception

We study the expressive power of graph neural networks (GNNs) with
mean as the aggregation function. In the non-uniform setting, we
show that such GNNs have exactly the same expressive power as ratio modal
logic, which has modal operators expressing that at least a
certain
ratio of the successors of a vertex satisfies a specified property.
The non-uniform expressive power of mean GNNs is thus higher than that of GNNs
with max aggregation, but lower than for sum
aggregation--the latter are characterized by modal logic and graded
modal logic, respectively. In the uniform setting, we show that the expressive power relative to MSO is exactly that of alternation-free modal logic, under the natural assumptions that combination functions are continuous and classification functions are thresholds.
This implies that, relative to MSO and in the uniform setting, mean
GNNs are strictly less expressive than sum GNNs and max
GNNs. When any of the assumptions is dropped, the expressive power increases.

Logical Characterizations of GNNs with Mean Aggregation

Non-Markovian Tasks (NMTs) are distinguished by their dependence on long-term memory and state-dependent dynamics, setting them apart from the traditional Markovian models typically employed in Reinforcement Learning (RL). NMTs not only suffer from reward sparseness but also rely on historical information, making their resolution considerably more challenging. In this paper, we propose a novel RL framework T4NMTD (Transition-centric framework for NMT Decomposition), designed specifically for learning NMTs which are specified by temporal logic. The core of T4NMTD is a task decomposition mechanism along with a parallel training approach for NMTs. An NMT is first decomposed as basic units based on the transitions of the automata which are derived from temporal logic formulae. The units are then modularized into sub-tasks according to their semantic similarity under logical interpretation. The training strategy of T4NMTD adopts a dual-level structure: the high-level learns to shape the boundaries and coordinate arrangement of the sub-tasks from a global perspective, while the low-level learns those sub-tasks in parallel. In addition, we invent a dynamic policy intervention scheme to mitigate the policy myopic issue during parallel training. A comprehensive evaluation is conducted on benchmark problems with respect to various metrics. The experimental results demonstrate that T4NMTD effectively addresses NMTs, achieving significant performance improvements compared with related studies.

T4NMTD: Transition-Centric Reinforcement Learning for Non-Markovian Task Decomposition

Amodal segmentation is an image-based algorithm that aims to predict masks for both visible and occluded parts of objects. Existing methods typically rely on supervised learning with annotated amodal masks or synthetic data. The effectiveness of these methods relies heavily on the quality of the datasets. This dependence can unintentionally restrict their generalization capabilities due to insufficient diversity and size. Although existing zero-shot methods perform well on their reported datasets, their performance does not necessarily transfer to other datasets. We propose a $\textbf{tuning-free}$ approach that re-purposes diffusion-based inpainting foundation models for amodal segmentation. Our approach is motivated by the “occlusion-free bias” of inpainting models, i.e., the inpainted objects tend to be complete and without occlusions. We reconstruct the occluded regions of an object via inpainting and then apply segmentation, all $\textbf{without additional training or fine-tuning}$. Experiments on five datasets, three previously unreported, demonstrate the generalizability of our approach. On average, our approach achieves 5.3% more accurate masks in mIoU compared to the publicly available state-of-the-art, pix2gestalt.

Tuning-Free Amodal Segmentation via the Occlusion-Free Bias of Inpainting Models

Model watermarking techniques can embed watermark information into the protected model for ownership declaration by constructing specific input-output pairs. However, existing watermarks are easily removed when facing model stealing attacks, and make it difficult for model owners to effectively verify the copyright of stolen models. In this paper, we analyze the root cause of the failure of current watermarking methods under model stealing scenarios and then explore potential solutions. Specifically, we introduce a robust watermarking framework, DeepTracer, which leverages a novel watermark samples construction method and a same-class coupling loss constraint. DeepTracer can incur a high-coupling model between watermark task and primary task that makes adversaries inevitably learn the hidden watermark task when stealing the primary task functionality. Furthermore, we propose an effective watermark samples filtering mechanism that elaborately select watermark key samples used in model ownership verification to enhance the reliability of watermarks. Extensive experiments across multiple datasets and models demonstrate that our method surpasses existing approaches in defending against various model stealing attacks, as well as watermark attacks, and achieves new state-of-the-art effectiveness and robustness.

Downloads

Next from AAAI 2026

Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads