Singapore

Although large vision-language models (LVLMs) have demonstrated promising versatile capabilities on various downstream tasks, they are shown to be susceptible to adversarial examples. Existing LVLM attackers simply implement adversarial patterns in an impracticable setting: i) add digital global perturbations to entire input image; ii) access prior knowledge of LVLMs for optimization; iii) do not consider realistic transformations. These make them difficult to deploy in the physical-world attack scenarios. Motivated by the research gap and counter-practice phenomenon, this paper proposes the first practical LVLM attack method based on a novel adversarial patch design, which can achieve physical and digital attack settings without using any LVLM details. In particular, we introduce adversarial homogeneous constraints in both spatial and spectral domains to improve the patch stealthy for resisting potential real-world defenses. Besides, we also develop a new technique for synthesizing reasonably realistic transformations that capture the expected patch appearance variations in daily life. Extensive experiments are conducted to verify the strong adversarial capabilities of our proposed attack against prevalent LVLMs spanning a spectrum of tasks.

AAAI 2026

Spatial-Spectral Homogeneous Attacks on Physical-World Large Vision-Language Models

adversarial patch

large vision-language models

adversarial attack

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Understanding 3D scene-level affordances from natural language instructions is essential for enabling embodied agents to interact meaningfully in complex environments. However, this task remains challenging due to the need for semantic reasoning and spatial grounding. Existing methods mainly focus on object-level affordances or merely lift 2D predictions to 3D, neglecting rich geometric structure information in point clouds and incurring high computational costs. To address these limitations, we introduce Task-Aware 3D Scene-level Affordance segmentation (TASA), a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner. To improve the affordance detection efficiency, TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding the selection of task-relevant views. To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry, resulting in accurate and spatially coherent 3D affordance masks. Experiments on SceneFun3D demonstrate that TASA significantly outperforms the baselines in both accuracy and efficiency in scene-level affordance segmentation.

Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement

Reconstructing 3D scenes from multi-view image sequences remains a significant challenge in practical applications. While recent advances in 3D Gaussian Splatting have enabled high-quality rendering, existing methods rely heavily on pixel-level $\mathcal{L}_1$ loss, which misaligns with human perception, leading to a lack of high-frequency details and the emergence of artifacts. Additionally, the position gradient-based densification strategy often results in under-densified Gaussian primitives, thereby desgrading rendering quality. 
To address these challenges, we propose Pano-GS, a perception-aware Gaussian optimization framework. Specifically, we introduce a gradient consistency-constrained loss to capture high-frequency details, mitigating the inherent shortcomings of traditional $\mathcal{L}_1$ loss and enhancing reconstruction fidelity. In addition, we use a multi-criteria densification strategy to reduce the sole reliance on average position gradients.
Extensive experiments demonstrate that Pano-GS achieves state-of-the-art performance, confirming its effectiveness and robust generalization across diverse real-world scenes.

Pano-GS: Perception-Aware Gaussian Optimization with Gradient Consistency and Multi-Criteria Densification for High-Quality Rendering

Existing work shows that injecting backdoors during the distillation process can threaten downstream models. However, these studies assume attackers can have access to the raw dataset and interfere with the entire distillation process, which is unrealistic. In contrast, this work is the first to address a more realistic and concerning threat: attackers may intercept the dataset distribution process, inject backdoors into the distilled datasets, and redistribute them to users. While distilled datasets were previously considered resistant to backdoor attacks, we demonstrate that they remain vulnerable to such attacks. Furthermore, we show that attackers do not even require access to any raw data to inject the backdoors successfully within one minute. Specifically, our approach reconstructs conceptual archetypes for each class from the model trained on the distilled dataset. Backdoors are then injected into these archetypes to update the distilled dataset. Moreover, we ensure the updated dataset not only retains the backdoor but also preserves the original optimization trajectory, thus maintaining the knowledge of the raw dataset. To achieve this, a hybrid loss is designed to integrate backdoor information along the benign optimization trajectory, ensuring that previously learned information is not forgotten. Extensive experiments demonstrate that distilled datasets are highly vulnerable to our attack, with risks pervasive across various raw datasets, distillation methods, and downstream training strategies

Poisoned Distillation: Injecting Backdoors into Distilled Datasets Without Raw Data Access

Federated multi-view clustering is designed to collaboratively mine heterogeneous multi-source information across clients. However, existing methods typically assume uniform view distributions across clients, thereby overlooking the dual uncertainties of view uncertainty (semantic inconsistency arising from arbitrary pairings of views) and aggregation uncertainty (divergent update directions and imbalanced contributions among clients). To address these, we propose a novel Enhanced Federated Deep Multi-View Clustering framework: hierarchical contrastive alignment within clients resolves view uncertainty by eliminating semantic conflicts; a view-adaptive drift module mitigates aggregation uncertainty through global-local prototype contrast that dynamically corrects parameter deviations; and a contribution-aware aggregation mechanism coordinates client updates. Experimental results demonstrate that EFDMVC achieves superior robustness against heterogeneous uncertain views across multiple benchmark datasets, consistently outperforming all state-of-the-art baselines in comprehensive evaluations.

Enhanced Federated Deep Multi-View Clustering Under Uncertainty Scenario

In natural scenarios, vision models often encounter the challenge of complex degradation scenarios(e.g., rain, snow, fog, or motion blur). These degradations severely corrupt image features, causing existing models to treat rarely seen or unseen degraded images as “unfamiliar”, thereby losing their inherent recognition and perception capabilities. To address this challenge, we propose a novel degradation disentanglement model (DDM) aimed at precisely disentangling degraded features from the image. The model enhances its perception of various degradations by controlling the matching of features across different degradation types and further strengthens the cross-correlation of target features by introducing a degradation suppression module. This enables the model to re-identify and re-localize targets while removing degradations. We validated the effectiveness of our method on more challenging few-shot segmentation datasets Degraded-Pascal and Degraded-COCO. Results on them outperform SOTA with 3.71% and 3.69% improvement respectively. The experimental results show that our method significantly improves the performance of vision models in various degradation scenarios and provides new ideas and solutions for visual understanding tasks in complex environments.

Piercing the Fog: Disentangling Key Features for Vision Models in Multi-Degradation Scenarios

3D Gaussian splatting (3DGS) has demonstrated impressive 3D reconstruction performance with explicit scene representations. Given the widespread application of 3DGS in 3D reconstruction and generation tasks, there is an urgent need to protect the copyright of 3DGS assets. However, existing copyright protection techniques for 3DGS overlook the usability of 3D assets, posing challenges for practical deployment.
Here we describe splats in splats, the first 3DGS steganography framework that embeds 3D content in 3DGS itself without modifying any attributes. To achieve this, we take a deep insight into spherical harmonics (SH) and devise an importance-graded SH coefficient encryption strategy to embed the hidden SH coefficients. Furthermore, we employ a convolutional autoencoder to establish a mapping between the original Gaussian primitives' opacity and the hidden Gaussian primitives' opacity. Extensive experiments indicate that our method significantly outperforms existing 3D steganography techniques, with 5.31\% higher scene fidelity and 3x faster rendering speed, while ensuring security, robustness, and user experience.

Splats in Splats: Robust and Effective 3D Steganography Towards Gaussian Splatting

Task-oriented dexterous grasping remains challenging in robotic manipulations of open-world objects under severe partial observation, where significant missing data invalidates generic shape completion. In this paper, to overcome this limitation, we study Task-Oriented Shape Completion, a new task that focuses on completing the potential contact regions rather than the entire shape. We argue that shape completion for grasping should be explicitly guided by the downstream manipulation task. To achieve this, we first generate multiple task-oriented shape completion candidates by leveraging the zero-shot capabilities of object functional understanding from several pre-trained foundation models. A 3D discriminative autoencoder is then proposed to evaluate the plausibility of each generated candidate and optimize the most plausible one from a global perspective. A conditional flow-matching model named FlowGrasp is developed to generate task-oriented dexterous grasps from the optimized shape. Our method achieves state-of-the-art performance in task-oriented dexterous grasping and task-oriented shape completion, improving the Grasp Displacement and the Chamfer Distance over the state-of-the-art by 16.17% and 55.26%, respectively. In particular, it shows good capabilities in grasping objects with severe missing data. It also demonstrates good generality in handling open-set categories and tasks.

TOSC: Task-Oriented Shape Completion for Open-World Dexterous Grasp Generation from Partial Point Clouds

Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.

Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency

Semi-supervised learning (SSL) based on pseudo-label and consistency has achieved significant success. The core idea behind these methods is to assign sample weights based on pseudo-label probabilities, thereby guiding the model toward biased learning. However, existing research still faces two major challenges in guiding learning: (1) how to evaluate learning states across different classes in the absence of labels, and (2) how to construct an effective sample weight space that provides precise guidance throughout training. To address these challenges, we propose the Bi-Dimensional Sample Weight Guidance algorithm, BidMatch. BidMatch introduces Class Information Entropy (CIE), which captures the learning relationships between classes and reflects the model’s learning state for each class. Additionally, Pseudo-label Probability Redistribution (PPR) is proposed to maintain distribution invariance and sparsity during training, thereby emphasizing differences in sample importance. By leveraging CIE and PPR, BidMatch generates sample weights that account for both class and instance dimensions, effectively guiding the model toward balanced and efficient learning across classes. BidMatch has demonstrated state-of-the-art performance on various SSL datasets. Notably, it achieved a 6.45% error rate on CIFAR-10 with only one label per class, significantly outperforming baseline methods.

BidMatch: Boosting Semi-Supervised Learning by Bi-Dimensional Sample Weight Guidance

Anomaly detection is a critical task across numerous domains and modalities, yet existing methods are often highly specialized, limiting their generalizability. These specialized models, tailored for specific anomaly types like textural defects or logical errors, typically exhibit limited performance when deployed outside their designated contexts. To overcome this limitation, we propose AnomalyMoE, a novel and universal anomaly detection framework based on a Mixture-of-Experts (MoE) architecture. Our key insight is to decompose the complex anomaly detection problem into three distinct semantic hierarchies: local structural anomalies, component-level semantic anomalies, and global logical anomalies. AnomalyMoE correspondingly employs three dedicated expert networks at the patch, component, and global levels, and is specialized in reconstructing features and identifying deviations at its designated semantic level. This hierarchical design allows a single model to concurrently understand and detect a wide spectrum of anomalies. Furthermore, we introduce an Expert Information Repulsion (EIR) module to promote expert diversity and an Expert Selection Balancing (ESB) module to ensure the comprehensive utilization of all experts. Experiments on 8 challenging datasets spanning industrial imaging, 3D point clouds, medical imaging, video surveillance, and logical anomaly detection demonstrate that AnomalyMoE establishes new state-of-the-art performance, significantly outperforming specialized methods in their respective domains.

Downloads

Next from AAAI 2026

Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads