Singapore

Membership Inference Attacks (MIAs) test whether a model has memorized training data, and are a key tool for auditing privacy risks in machine learning. Recent papers report near-perfect MIA success against large vision-language models such as CLIP, but almost all evaluations train on one web-scale corpus (e.g. LAION-400M) and treat samples from a different corpus (e.g. COCO or CC12M) as non-members -- thereby turning the task into out-of-distribution (OOD) detection rather than true membership testing, introducing spurious signals unrelated to true memorization.

We revisit the problem with a distribution-matched benchmark built from the CommonPool-L corpus of DataComp. A ViT-B/16 CLIP trained on 400 M pairs is accompanied by two 26-shard, i.i.d. splits that serve as member and non-member sets, sharing the exact same acquisition and preprocessing pipeline. Under this strictly in-distribution setting, every published MIA baseline collapses to chance ($\approx$51\% AUC). To explain this collapse, we derive a scaling-law upper bound for similarity-based attacks showing that the expected member vs. non-member similarity gap decays as $\mathcal{O}(T/N)$ for contrastive learning with $T$ epochs over $N$ samples. Empirically, as we vary the training set size while holding all hyper-parameters fixed, the gap follows the predicted linear trend in log–log space, and Cosine Similarity Attack AUC drops from 94\% to 51\%. %, matching the analytic sigmoid mapping.
Finally, we propose a simple, white-box, gradient-based MIA that outperforms prior attacks for CLIP without relying on OOD cues. We release code, checkpoints, and data to foster comprehensive and reproducible privacy research on multimodal foundation models.

AAAI 2026

Rethinking Membership Inference Attacks for CLIP

membership inference attacks

clip

contrastive learning

privacy

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Understanding the structural dynamics of biomolecules is vital for elucidating biological function. With the increasing availability of molecular dynamics (MD) simulation data, deep generative models have been developed to synthesize realistic MD trajectories. However, existing approaches generate fixed-length trajectories by jointly denoising high-dimensional spatiotemporal representations, which conflicts with MD’s frame-by-frame integration process and fails to capture time-dependent conformational diversity. Motivated by the sequential nature of MD, we introduce a novel probabilistic autoregressive (\textbf{ProAR}) framework for trajectory generation. ProAR employs a dual-network system that explicitly models each frame as a multivariate Gaussian distribution and uses an anti-drifting sampling strategy to mitigate cumulative errors, thereby capturing conformational uncertainty and time-coupled structural changes while flexibly generating trajectories of arbitrary length. Experiments on ATLAS, a large-scale protein MD dataset, show that for the long trajectory generation task, our model achieves a 7.5\% reduction in reconstruction RMSE and an average 25.8\% improvement in conformation change accuracy over previous state-of-the-art methods. Regarding the conformation sampling task, it attains comparable performance to specialized time-independent models, offering a flexible and reliable alternative to conventional MD simulations.

ProAR: Probabilistic Autoregressive Modeling for Molecular Dynamics

Recently, continuous transform-based tensor representation has emerged as a promising tool for multi-dimensional data recovery. 
However, the existing continuous transforms are essentially single-layer linear mappings, which limits their ability to capture the complex relationships inherent in multi-dimensional data.
To overcome this limitation, we propose a Hierarchical Nonlinear Continuous Transform-based Tensor Representation (HiNCoT) for multi-dimensional data recovery. By leveraging the hierarchical nonlinear continuous transform, HiNCoT constructs the recovered tensor from a latent tensor, which is generated by the deep representation module with a low-rank core tensor as input. Compared with the existing continuous transform-based methods, HiNCoT can more effectively capture the complex nonlinear relationships inherent in multi-dimensional data along the third dimension.
To evaluate the effectiveness of the proposed HiNCoT, we suggest an HiNCoT-based multi-dimensional data recovery model. Extensive experiments on diverse degeneration scenarios demonstrate the superiority of our hierarchical nonlinear transform-based method over existing single-layer linear transform-based methods.

HiNCoT: Hierarchical Nonlinear Continuous Transform-based Tensor Representation for Multi-Dimensional Data Recovery

Existing end-to-end approaches of robotic manipulation often lack generalization to unseen objects or tasks due to limited data and poor interpretability. While recent Multimodal Large Language Models (MLLMs) demonstrate strong commonsense reasoning, they struggle with geometric and spatial understanding required for pose prediction. In this paper, we propose RobMRAG, a 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation (MRAG) framework for zero-shot robotic manipulation. Specifically, We construct a multi-source manipulation knowledge base containing object contact frames, task completion frames, and pose parameters. During inference, a Hierarchical Multimodal Retrieval module first employs hybrid semantic search to find task-relevant object prototypes, then selects the geometrically closest reference example based on pixel-level similarity and Instance Matching Distance (IMD). We further introduce a 3D-Aware Pose Refinement module based on 3D Gaussian Splatting into the MRAG framework, which aligns the pose of the reference object to the target object in 3D space. The aligned results are reprojected onto the image plane and used as input to the MLLM to enhance the generation of the final pose parameters. Extensive experiments show that on a test set containing 30 categories of household objects, our method improves the success rate by 7.76% compared to the best-performing zero-shot baseline under the same setting, and by 6.54% compared to the state-of-the-art supervised baseline. Our results validate that RobMRAG effectively bridges the gap between high-level semantic reasoning and low-level geometric execution, enabling robotic systems that generalize to unseen objects while remaining inherently interpretable.

Zero-Shot Robotic Manipulation via 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation

Large language models (LLMs) have shown great promise in automating data science workflows. However, existing models still struggle with multi-step reasoning and tool use, limiting their effectiveness on complex data analysis tasks.
To address this limitation, we propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks and associated data files.
Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task–solution pairs that reflect authentic tool-use patterns in practical data science scenarios.
To further enhance the multi-step reasoning capabilities, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning.
During inference, Jupiter combines the value model and
node visit counts to efficiently collect executable multi-step plans with minimal search steps.
Experimental results show that Qwen2.5-7B and 14B-Instruct models on NbQA solve 75.10\% and 84.44\% of tasks on InfiAgent-DABench, respectively—matching or surpassing GPT-4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool-use reasoning across diverse multi-step reasoning tasks.

Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing approaches, despite requiring no additional training or complex modifications. Notably, when integrated with LLaVA-NeXT-7B, VisionDrop achieves a $2.7\times$ reduction in inference latency and $6\times$ in FLOPs, while retaining 95.71\% of the original performance.

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

Spatiotemporal analysis of facial behavior is a crucial method for evaluating the mental state of depression patients. However, in practice, depressed patients often display facial behaviors similar to healthy individuals due to masking tendencies. Additionally, facial expressions among depressed patients are also often different, increasing the difficulty of assessment. To address this, we propose a video-based automatic depression assessment model Dep-MAP for complex facial behaviors of depression patients. Dep-MAP adopts a dual-branch architecture to extract visual features of facial behavior and capture corresponding emotional semantic information. Specifically, the extracted deep semantic features are clustered, resulting in semantically distinct prototype sets, where each severity group learns a set of discriminative facial behavior prototype representations, to suppress inter-class semantic confusion. Subsequently, we propose a semantic prototype-supervised contrastive learning method, which aligns latent semantics between shallow and deep features, realizing emotional semantic guidance and self-knowledge distillation for the visual feature branch, effectively suppressing intra-class difference. Then, we integrate key depression cues across multiple spatiotemporal scales via a multi-scale weighted fusion strategy, achieving automatic depression assessment. Experimental results demonstrate that Dep-MAP effectively identifies potential key frames in temporal sequences, and aggregates key frame representations with semantic consistency, achieving significantly superior state-of-the-art results on the AVEC2013 and AVEC2014 public datasets.

Dep-MAP: A Multi-level Alignment Framework with Semantic Prototypes for Video-based Automatic Depression Assessment

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction.
Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions.
Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane.
GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions.
To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions.
This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions.
Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 28.3\% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

GUI-G²: Gaussian Reward Modeling for GUI Grounding

Infrared small target detection often faces significant domain gaps across datasets due to varying sensors and scene distributions. Currently, most existing methods are typically based on single-domain learning ($i.e.$, training and test are on the same dataset), requiring training separate detectors when considering different datasets. However, they overlook the valuable public knowledge across domains and limit the applicability in multiple infrared scenarios. To break through single-domain learning, implementing only one universal detector simultaneously on multiple datasets, as the first exploration, we propose a cross-domain joint learning task framework with prototype-guided Mixture-of-Experts (CoMoE). Specifically, it designs a hyperspherical prototype learning to adaptively maintain both domain-specific prototypes and global prototypes, enhancing cross-domain feature representation. Meanwhile, a domain-aware Mixture-of-Experts with Top-K routing strategy is proposed to select the optimal domain experts. Moreover, to enhance cross-domain feature alignment, we design an adaptive cross-domain feature modulation with noise-guided contrastive learning. The extensive experiments on a newly constructed benchmark comprising three datasets verify the superiority of our CoMoE, even under limited data settings. It could often surpass general joint learning methods, and state-of-the-art (SOTA) single-domain ones. Codes will be open after acceptance.

Cross-domain Joint Learning with Prototype-guided Mixture-of-Experts for Infrared Moving Small Target Detection

Current methodologies for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient finetuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importance-based Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3)Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting. Our code is available in the supplementary material.

YOLO-IOD: Towards Real Time Incremental Object Detection

Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers’ emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross‑modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, verifying its effectiveness in complex multimodal scenarios.

Downloads

Next from AAAI 2026

ProAR: Probabilistic Autoregressive Modeling for Molecular Dynamics

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

ProAR: Probabilistic Autoregressive Modeling for Molecular Dynamics

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads