Singapore

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is **ACTOR** (**A**ction-aware **C**ross-modal **T**ransf**OR**mer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a **P**erceptual **D**istilled **Q**uery **D**ecoder (**PDQD**), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-of-the-art performance and strong generalization. Code will be released upon publication.

AAAI 2026

QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

cv: scene analysis & understanding

hai: human-computer interaction

cv: multi-modal vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Brain decoding aims to reconstruct video from brain signals. Existing brain decoding frameworks are primarily built on a subject-dependent paradigm, which requires large amounts of brain data for each subject. However, the expensive cost of collecting brain-video data causes severe data scarcity for brain decoding. Although some cross-subject methods being introduced, they often exhibit an excessive preoccupation with subject-invariant information while neglecting subject-specific information, resulting in slow fine-tune-based adaptation strategy. To achieve fast and data-efficient new subject adaptation, we propose **MindCross**, a novel cross-subject brain decoding framework. MindCross's *N* specific encoders and one shared encoder are designed to extract subject-specific and subject-invariant information, respectively. Additionally, a Top-*K* collaboration module is adopted to enhance new subject decoding with the knowledge learned from previous subjects' encoders. Extensive experiments on fMRI/EEG-to-video benchmarks demonstrate MindCross's efficacy and efficiency of cross-subject decoding and new subject adaptation using only one model. Code of our framework will be released upon publication.

MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals

Partial linear models (PLM) have attracted much attention for regression estimation and variable selection due to their feasibility on utilizing linear and nonlinear approximations jointly. However, theoretical understanding of how they control the false discovery rate (FDR) during variable selection remains limited. To address this issue, we formulate a new integral-based knockoffs (IKO) inference scheme for controlled variable selection in PLM, where integral-based knockoff statistics are used to measure the variable importance and B-splines (or random Fourier features) are employed for approximating nonlinear components. In theory, FDR control is guaranteed for both linear and nonlinear parts, and the statistical analysis for its power is established. Empirical evaluations validate the effectiveness of our proposed approach.

Integral-based Knockoffs Inference for Partially Linear Models

Computing geodesic distances on 3D surfaces is fundamental to many tasks in 3D vision and geometry processing, with deep connections to tasks such as shape correspondence. Recent learning-based methods achieve strong performance but rely on large 3D backbones, leading to high memory usage and latency, which limit their use in interactive or resource-constrained settings. We introduce LiteGE, a lightweight approach that constructs compact, category-aware shape descriptors by applying PCA to unsigned distance field (UDFs) samples at informative voxels. This descriptor is efficient to compute and removes the need for high-capacity networks. LiteGE remains robust on sparse point clouds, supporting inputs with as few as 300 points, where prior methods fail. Extensive experiments show that LiteGE reduces memory usage and inference time by up to 300x compared to existing neural approaches. In addition, by exploiting the intrinsic relationship between geodesic distance and shape correspondence, LiteGE enables fast and accurate shape matching. Our method achieves up to 1000x speedup over state-of-the-art mesh-based approaches while maintaining comparable accuracy on non-isometric shape pairs, including evaluations on point-cloud inputs. The source code and trained models are available at https://github.com/yya-111/LiteGE.

LiteGE: Lightweight Geodesic Embedding for Efficient Geodesics Computation and Non-Isometric Shape Correspondence

The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the all-type ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake
detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL front-end by learning specialized prompt tokens for ADD, requiring 458× fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.

Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLMs). As an alternative, we introduce LLaVA³ (pronounced LLaVA Cube), a novel method that improves the 3D scene understanding capabilities of VLMs using only multi-view 2D images, and without requiring any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single 2D picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object.
These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D visual question answering and 3D language grounding show that our approach significantly outperforms previous 2D-based VLM solutions.

LLaVA³: Representing 3D Scenes Like a Cubist Painter to Boost 3D Scene Understanding of VLMs

Existing audio adversarial attack methods suffer from poor transferability, primarily due to insufficient exploration of model decision mechanisms and overreliance on heuristic-driven algorithm design. This paper aims to alleviate this gap.
Specifically, through observations across three mainstream audio tasks (Automatic Speech Recognition, Speaker Verification, and Keyword Spotting), we reveal that these models primarily rely on local temporal features—inputs with time shuffled retain 83.7% of original accuracy. The SHAP-based visualization further validated that time shuffle leads to a significant shift in the salient regions of the model, but the samples can still be correctly identified, indicating the presence of redundant features that can affect decision-making.
Inspired by these findings, we propose Time-Shuffle (TS) adversarial attack (including segments-based TS and phoneme-level-based TS-p). This method divides audio or phonemes into segments, randomly shuffles them, and computes gradients on the shuffled structure.
By forcing perturbations to exploit transferable local temporal features and reduce overfitting to source-specific patterns, TS/TS-p inherently enhances transferability. As a model-agnostic framework, TS/TS-p can seamlessly integrate with existing attack methods.
Comprehensive experiments demonstrate that TS-p achieved SOTA and boosts transferability by about 23%/14.7%/6.3% on ASR/ASV/KWS.

Time Shuffle: A Transferability-Booster for Multiple Audio Adversarial Tasks

At present, spectral clustering is an important branch of unsupervised learning, and its application in deep learning has been widely concerned. However, for high-dimensional sparse datasets, the complexity of network scale leads to parameter explosion, and static Gaussian kernel often has wrong preset data structure. To overcome these challenges, we propose a novel deep clustering model, Deep Clustering Based on Sparse Kolmogorov-Arnold Network (KAN) and Spectral Constraint. It contains a deep sparse clustering framework, in which sparse KAN and the orthogonal layer are designed to enhance the sparsity of the activation function matrix, reduce the number of parameters and improve the stability of model convergence. Additionally, we add an adaptive optimized affinity matrix based on spectral constraint, which overcomes the limitations of static Gaussian kernels, and improves the performance and stability of spectral constraint. Experimental results on both synthetic and real datasets demonstrate that our model outperforms existing methods in clustering performance, computational efficiency, and stability.

Deep Clustering Based on Sparse Kolmogorov-Arnold Network and Spectral Constraint

Latent Diffusion Models have become a powerful tool for generating high-fidelity unrestricted adversarial examples. However, the existing methods typically perturb only the initial latent or rely on prompt engineering, which is ill-suited to the iterative nature of the diffusion process, plus optimization instability due to external text prompts and cumulative drift that push the adversarial images off the data manifold. In this paper, we propose a hierarchical attack framework that operates in alignment with the model's generative manifold and leverages intermediate denoising states to maximize attack transferability and visual fidelity. Extensive experiments show that the proposed attack improves adversarial transferability by $10$-$20$\% against a diverse set of normally-trained models and achieves over 10.5\% higher success rate against adversarially-defended models, while simultaneously enhancing visual quality by $1.0$-$1.2$ FID reduction and 16.7\% LPIPS improvements.

Beyond Single-Point Perturbation: A Hierarchical, Manifold-Aware Approach to Diffusion Attacks

Learning Curve Extrapolation (LCE) is a critical technique for accelerating automated machine learning by terminating unpromising training runs early. Recent state-of-the-art methods have improved predictive accuracy by incorporating contextual information, such as neural network architecture. However, these approaches, whether context-agnostic or architecture-aware, still operate under the implicit assumption of a uniform task landscape. They overlook a pivotal, complementary factor: the intrinsic difficulty of the learning task itself. This oversight leads to a significant degradation in performance, especially for tasks whose learning dynamics diverge from the model's priors. In this work, we argue that task difficulty is a crucial yet neglected dimension for robust LCE. We introduce a novel framework, Difficulty-Adaptive Learning Curve Extrapolation (DA-LCE), which explicitly conditions its predictions on task complexity. Our core contributions are threefold: (1) We propose a transparent, {rule-based method} to quantify task difficulty from the early shape of learning curves, eliminating the need for external meta-features. (2) We design a novel data generation pipeline using a {conditional diffusion model} to create a high-fidelity, difficulty-conditioned synthetic prior for training. (3) We introduce a {Conditional Difficulty-aware PFN (CD-PFN)} that leverages this information to achieve superior predictive accuracy. Extensive experiments on a wide range of benchmarks demonstrate that our CD-PFN significantly outperforms both difficulty-agnostic baselines and even state-of-the-art architecture-aware models. This result highlights that task difficulty is a powerful, complementary source of information, whose impact can be as significant as, or even greater than, that of the model architecture.

Difficulty-Aware Learning Curve Extrapolation

Spiking Neural Networks (SNNs) are emerging as a promising energy-efficient alternative to Artificial Neural Networks (ANNs) due to their event-driven computation paradigm. However, recent advances toward large-scale high-performance SNNs inevitably lead to substantial memory and computational overhead. While quantization offers a potential solution, many quantization approaches fail to deliver verifiable efficiency gains on resource-constrained hardware platforms. In this paper, we propose a lightweight and hardware-friendly SNN that applies quantization to both weights and membrane potentials, termed HardF-SNN. Specifically, we first build a baseline model that adopts shared-scale quantization and batch normalization (BN) folding to simulate integer-only inference during training, since this baseline model has not been thoroughly discussed in previous SNN work. Although the baseline enables integer-arithmetic-only inference, it suffers from performance degradation and may even lead to training failure. To address these issues, we thoroughly analyze the problems caused by quantization and BN folding, and propose solutions to enhance the baseline’s performance. Specifically, we introduce proportional shared-scale quantization to enhance the representation capability, and propose an integer-only BN method to stabilize training convergence through integer arithmetic and bit-shifting operations. Extensive experiments show that HardF-SNN achieves an optimal balance between performance and efficiency, exhibiting excellent compatibility with mainstream hardware accelerators. To demonstrate its effectiveness on resource-constrained platforms, HardF-SNN is deployed on a dedicated FPGA-based hardware accelerator. Evaluation results indicate that our implementation surpasses current state-of-the-art accelerators.

Downloads

Next from AAAI 2026

MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads