Singapore

Open-Set Domain Generalization (OSDG) aims to generalize over unseen target domains containing open classes, and the core challenge lies in identifying unknown samples never encountered during training. Recently, CLIP has exhibited impressive performance in OSDG, while it still falls into the dilemma between structural risk of known classes and open space risk from unknown classes, and easily suffers from over-confidence, especially when distinguishing known-like unknown samples. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that leverages fine-grained semantics to boost unknown detection, so as to accommodate both risks and enable precise discrimination among categories. In SeeCLIP, we propose a semantic-aware prompt enhancement module to extract fine-grained key semantic features, and establish a fine-grained vision-language alignment. Duplex contrastive learning is proposed for prompt learning, which jointly optimizes duplex losses such that the unknown prompt is similar to known prompts, yet exhibits key semantic differences. We also design a semantic-guided diffusion module to enable nuanced capture in generation. By injecting perturbed key semantics into a diffusion model as control conditions, it generates the closest unknowns or pseudo-open samples with high similarity yet low belongingness to known classes. We formulate a generalization bound for OSDG, and show that SeeCLIP can achieve a lower generalization risk. Extensive experiments on benchmark datasets validate the superiority of SeeCLIP, it outperforms the SOTA methods by nearly 3% on accuracy and 5% on H-index, respectively.

AAAI 2026

The Finer the Better: Towards Granular-aware Open-set Domain Generalization

large vision models

domain generalization

transfer

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Object detection in High-Resolution Wide (HRW) shots, or gigapixel images, presents unique challenges due to extreme object sparsity and vast scale variations. State-of-the-art methods like SparseFormer have pioneered sparse processing by selectively focusing on important regions, yet they apply a uniform computational model to all selected regions, overlooking their intrinsic complexity differences. This leads to a suboptimal trade-off between performance and efficiency. In this paper, we introduce GigaMoE, a novel backbone architecture that pioneers adaptive computation for this domain by replacing the standard Feed-Forward Networks (FFNs) with a Mixture-of-Experts (MoE) module. Our architecture first employs a shared expert to provide a robust feature baseline for all selected regions. Upon this foundation, our core innovation---a novel Sparsity-Guided Routing mechanism---insightfully repurposes importance scores from the sparse backbone to provide a "computational bonus,'' dynamically engaging a variable number of specialized experts based on content complexity. The entire system is trained efficiently via a loss-free load-balancing technique, eliminating the need for cumbersome auxiliary losses. Extensive experiments show that GigaMoE sets a new state-of-the-art on the PANDA benchmark, improving detection accuracy by 1.1\% over SparseFormer while simultaneously reducing the computational cost (FLOPs) by a remarkable 32.3\%.

GigaMoE: Sparsity-Guided Mixture of Experts for Efficient Gigapixel Object Detection

Many marketing applications, including credit card incentive programs, offer rewards to customers who exceed specific spending thresholds to encourage increased consumption. 
Quantifying the causal effect of these thresholds on customers is crucial for effective marketing strategy design. 
Although regression discontinuity design is a standard method for such causal inference tasks, its assumptions can be violated when customers, aware of the thresholds, strategically manipulate their spending to qualify for the rewards.
To address this issue, we propose a novel framework for estimating the causal effect under threshold manipulation.
The main idea is to model the observed spending distribution as a mixture of two distributions: one representing customers strategically affected by the threshold, and the other representing those unaffected. 
To fit the mixture model, we adopt a two-step Bayesian approach consisting of modeling non-bunching customers and fitting a mixture model to a sample around the threshold. 
We show posterior contraction of the resulting posterior distribution of the causal effect under large samples. 
Furthermore, we extend this framework to a hierarchical Bayesian setting to estimate heterogeneous causal effects across customer subgroups, allowing for stable inference even with small subgroup sample sizes. 
We demonstrate the effectiveness of our proposed methods through simulation studies and illustrate their practical implications using a real-world marketing dataset.

Causal Inference Under Threshold Manipulation: Bayesian Mixture Modeling and Heterogeneous Treatment Effects

Intelligent perception among multiple agents enables them to extend their individual observation capabilities by sharing sensory information, thereby improving the completeness and accuracy of environmental understanding. However, real-world communication is often subject to non-negligible delays, which can degrade the effectiveness of perception. To mitigate this, delay alignment is commonly employed to synchronize delayed observations to a common timestamp. Yet, both alignment errors and inherent discrepancies between multi-view observations can lead to inconsistencies in the estimated position and orientation of shared targets. These inconsistencies can accumulate during feature fusion, ultimately reducing the accuracy and reliability of the perception results.To address this challenge, we propose IPDA, a delay-aware multi-agent intelligent perception method that performs joint calibration in both temporal and spatial domains. In the temporal dimension, we design a historical alignment attention mechanism to model dynamic delay correction across sequences, ensuring temporal coherence. In the spatial dimension, we introduce a discrepancy-quantized co-sensing network that captures and compensates for multi-view spatial deviations caused by viewpoint diversity and alignment inaccuracy. IPDA is evaluated on two large-scale intelligent perception benchmarks, DAIR-V2X and OPV2V. Experimental results demonstrate that our method effectively mitigates delay-induced inconsistencies and consistently outperforms state-of-the-art baselines under various delay conditions.

IPDA:Intelligent Perception Delay Alignment Method Based on Spatio-Temporal Co-Sensing Calibration

Implicit feedback, employed in training recommender systems, unavoidably confronts noise due to factors such as misclicks and position bias. Previous studies have attempted to identify noisy samples through their diverged data patterns, such as higher loss values, and mitigate their influence through sample dropping or reweighting. However, we observed that noisy samples and hard samples display similar patterns, leading to hard-noisy confusion issue. Such confusion is problematic as hard samples are vital for modeling user preferences. To solve this problem, we propose LLMHNI framework, leveraging two auxiliary user-item relevance signals generated by Large Language Models (LLMs) to differentiate hard and noisy samples. LLMHNI obtains user-item semantic relevance from LLM-encoded embeddings, which is used in negative sampling to select hard negatives while filtering out noisy false negatives. An objective alignment strategy is proposed to project LLM-encoded embeddings, originally for general language tasks, into a representation space optimized for user-item relevance modeling. LLMHNI also exploits LLM-inferred logical relevance within user-item interactions to identify hard and noisy samples. These LLM-inferred interactions are integrated into the interaction graph and guide denoising with cross-graph contrastive alignment. To eliminate the impact of unreliable interactions induced by LLM hallucination, we propose a graph contrastive learning strategy that aligns representations from randomly edge-dropped views to suppress unreliable edges. Empirical results demonstrate that LLMHNI significantly improves denoising and recommendation performance.

Hard vs. Noise: Resolving Hard-Noisy Sample Confusion in Recommender Systems via Large Language Models

Computational argumentation studies fundamental methods for reasoning within Artificial Intelligence (AI). Such methods are divided into two main approaches, namely the abstract and the structured one. Abstract argumentation focuses on the interactions between arguments, ignoring their internal structure, while structured approaches utilize a given knowledge base to construct the arguments. Thus, the latter approach incorporates the internal structure of arguments into the reasoning process. In this work we introduce a form of abstraction on the well established structured approach of Assumption-Based Argumentation (ABA). Our goal is to provide methods to simplify complicated scenarios, by applying clustering over defeasible parts. Abstraction, particularly clustering, has been explored in recent research on abstract argumentation and in the adjacent field of logic programming. In fact, while clustering has also been applied to ABA, our approach takes a different, or rather dual, direction. In contrast to prior work on over-approximation on ABA, we propose the dual approach of under-approximation. We provide semantics for reasoning over clustered frameworks in a sound manner relative to original semantics, ensuring that any set deemed acceptable in the clustered scenario corresponds to an acceptable set. We show fundamental properties of the under-approximating semantics and illustrate our approach using a conceptual example based on medical recommendations.

Under-Approximating Semantics in Clustered Assumption-Based Argumentation

Accurately localizing and segmenting occluded objects from faint light patterns beyond the field of view is highly challenging due to multiple scattering and medium-induced perturbations. Most existing methods, based on real-valued modeling or local convolutional operations, are inadequate for capturing the underlying physics of coherent light propagation. Moreover, under low signal-to-noise conditions, these methods often converge to non-physical solutions, severely compromising the stability and reliability of the observation. To address these challenges, we propose a novel physics-driven Wavefront Propagating Compensation Network (WavePCNet) to simulates wavefront propagation and enhance the perception of occluded objects. This WavePCNet integrates the Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) to incorporate complex amplitude transfer operators to precisely constrain coherent propagation behavior, along with a momentum memory mechanism to effectively suppress the accumulation of perturbations. Additionally, a High-frequency Cross-layer Compensation Enhancement is introduced to construct frequency-selective pathways with multi-scale receptive fields and dynamically models structural consistency across layers, further boosting the model’s robustness and interpretability under complex environmental conditions. Extensive experiments conducted on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness. All data and code will be publicly released to support and encourage continued research in the obscured object detection domain.

Wavefront-Constrained Passive Obscured Object Detection

Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from documents. In the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on $\textit{Event-type-only prompts}$ makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction (ZS-DEAE), which simulates the human collaborative cognitive process of “Propose–Evaluate–Revise.” Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure integrity constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement learning. In three zero-shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero-shot performance of other DEAE models.

Learning to Generate and Extract: A Multi-Agent Collaboration Framework for Zero-Shot Document-Level Event Arguments Extraction

Most state-of-the-art time series imputation methods can leverage textual information to improve imputation quality, but they often struggle for failing to effectively filter noisy information from large language model (LLM) derived textual information. Some existing solutions only filter from the entire token set, which can introduce erroneous conditional constraints, extreme token frequency effects, and increased computational complexity. Based on this, we propose CaT-Diff, a novel cascaded text-enhanced diffusion model for probabilistic imputation of multivariate time series under Missing Not At Random (MNAR) scenarios. To suppress irrelevant semantics and focus on context most predictive of missing values, CaT-Diff introduces an innovative Hierarchical Semantic Filter (HSF) that collaborates with a Mixture-of-Experts (MoE) Network. The MoE projects heterogeneous text embeddings into the time series latent space, and the HSF cascade-filters text embeddings from the segment level to the token level, thereby avoiding the pitfalls of direct token-level filtering and reducing overhead. We also incorporate a lightweight Missing Mechanism Estimator, jointly optimized with the denoising network to explicitly capture MNAR missingness patterns. Extensive tests on nine domains show CaT-Diff outperforms state-of-the-art baselines, cutting MSE by 14.7\% and MAE by 7.6\% relative to the next-best baselines. Our work establishes a new paradigm for selectively fusing LLM-derived textual information. Our anonymous code is provided in the supplementary materials.

CaT-Diff: Cascaded Text-enhanced Diffusion Model for Time-Series Imputation

Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement. Our code is available in the supplementary material.

Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

The prevalence of real-world multi-view data makes incomplete multi-view clustering (IMVC) a crucial research. The rapid development of Graph Neural Networks (GNNs) has established them as one of the mainstream approaches for multi-view clustering. Despite significant progress in GNNs-based IMVC, some challenges remain: (1) Most methods rely on the K-Nearest Neighbors (KNN) algorithm to construct static graphs from raw data, which introduces noise and diminishes the robustness of the graph topology. (2) Existing methods typically utilize the Mean Squared Error (MSE) loss between the reconstructed graph and the sparse adjacency graph directly as the graph reconstruction loss, leading to substantial gradient noise during optimization. To address these issues, we propose a novel $\textbf{D}$ynamic Deep $\textbf{G}$raph Learning for $\textbf{I}$ncomplete $\textbf{M}$ulti-$\textbf{V}$iew $\textbf{C}$lustering with $\textbf{M}$asked Graph Reconstruction Loss (DGIMVCM). Firstly, we construct a missing-robust global graph from the raw data. A graph convolutional embedding layer is then designed to extract primary features and refined dynamic view-specific graph structures, leveraging the global graph for imputation of missing views. This process is complemented by graph structure contrastive learning, which identifies consistency among view-specific graph structures. Secondly, a graph self-attention encoder is introduced to extract high-level representations based on the imputed primary features and view-specific graphs, and is optimized with a masked graph reconstruction loss to mitigate gradient noise during optimization. Finally, a clustering module is constructed and optimized through a pseudo-label self-supervised training mechanism. Extensive experiments on multiple datasets validate the effectiveness and superiority of DGIMVCM.

Downloads

Next from AAAI 2026

GigaMoE: Sparsity-Guided Mixture of Experts for Efficient Gigapixel Object Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

GigaMoE: Sparsity-Guided Mixture of Experts for Efficient Gigapixel Object Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads