Singapore

Transformer models have achieved remarkable success across diverse deep learning fields, including natural language processing (NLP) and computer vision (CV). One drawback of these models is that the computational cost of the softmax attention, the core component of the transformer, exhibits quadratic complexity in both time and memory. As data scales up various attempts have been reported to overcome this bottleneck. The objective of this study is to propose a novel attention mechanism, “Cumulant Attention,” that systematically balances efficiency and accuracy. This proposal introduces a statistical-mechanics perspective and a reliable approximation based on cumulant expansion into the attention layer. The low-order variant reduces computational complexity to linear order, similar to the linear attention, while keeping nonlinearity of the softmax attention. We evaluate several variants on CV tasks, including image classification with ViT on ImageNet-100 and video classification with ViViT on UCF-101. Experimental results demonstrate that the cumulant attention outperforms the linear attention and achieves accuracy comparable to the softmax attention. These findings validate the effectiveness of our approach and highlight future directions, including scaling to larger models, extending to other modalities, and optimizing implementations for GPU hardware.

AAAI 2026

Cumulant Attention in Vision Transformers (Student Abstract)

ai architectures

machine learning

computer vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We propose Tailored ViT Slimming (TVS), a budget-aware multi-dimensional pruning framework for Vision Transformers. TVS injects learnable masks into MHSA and MLP modules and applies adaptive non-convex sparsity regularization to achieve maximal utilization of parameters under strict module-wise budgets. In addition, by retaining scaled masks after pruning, TVS avoids abrupt accuracy drops and provides stable initialization for fine-tuning. On ImageNet-1k with DeiT-S and DeiT-B, TVS consistently outperforms prior ViT compression methods. This result empirically shows that the non-convex sparsity regularizer is effective not only in CNNs but also in ViTs.

Tailored ViT Slimming: Budget-Aware Multi-Dimensional Sparsity Regularization for Vision Transformers Pruning (Student Abstract)

Word Sense Disambiguation (WSD) has been a central challenge since the earliest proposals for Machine Translation (MT), most famously Weaver's 1949 memorandum. Classical systems treated WSD as an explicit task, grounded in lexical resources and annotated data. Recently, however, Large Language Models (LLMs) have blurred the boundary between disambiguation and general language understanding, leading some to suggest that WSD might be obsolete. This paper surveys the role of WSD in the LLM era, drawing on recent studies of encoder-based sense separation and disambiguation, and decoder-based definition selection and generation, as well as multilingual evaluation. Closed-source instruction-tuned LLMs now achieve performance comparable to specialized WSD systems, yet systematic weaknesses remain: non-predominant senses are often misclassified and disambiguation biases in MT persist. We argue that WSD is not "dead" but redefined as a diagnostic lens for assessing lexical-semantic competence, robustness, and interpretability in LLMs.

Is Word Sense Disambiguation Dead in the LLM Era?

Continuous cardiac monitoring during sleep is vital for detecting silent arrhythmia and other nocturnal cardiac events. While electrocardiogram (ECG) is the clinical gold standard, its reliance on electrodes and physical contact makes it intrusive for daily long-term use. Millimeter-wave (mmWave) radar offers a compelling non-contact alternative by capturing cardiac-induced chest-wall micro-vibrations. Existing radar-to-ECG methods often rely on direct waveform regression, assuming posture-stable mappings that break under natural sleep movements and obscure true cardiac rhythms. Inspired by the modality-invariant perception observed in speech and vision, we introduce mmJEPA-ECG, a physiology-guided framework for reconstructing clinical ECGs by anchoring radar sensing to invariant cardiac dynamics. It addresses two fundamental challenges: (i) disentangling robust cardiac representations from posture-induced artifacts, and (ii) generalizing ECG reconstruction across individuals under signal ambiguity. To address these challenges, Physiology-Oriented Self-Supervised Pretraining builds on a Joint Embedding Predictive Architecture (JEPA) with domain-informed masking and heart rate consistency to extract posture-robust cardiac embeddings. Conditional Diffusion-based ECG Reconstruction then generates personalized ECG waveforms through a hierarchical conditional diffusion process by spectral fidelity and denoising constraints. Extensive experiments on both public and self-collected multi-subject datasets demonstrate that our method outperforms state-of-the-art across waveform and rhythm metrics, halving R-R peak errors even under posture shifts and arrhythmic conditions. Codes and dataset are released at https://github.com/lanyangyang/mmJEPA-ECG.

mmJEPA-ECG: Cross-Posture Robust Contactless Electrocardiogram Monitoring via Millimeter Wave Radar Sensing

Distributed multi-stage image compression—where visual content traverses multiple processing nodes under varying quality requirements—poses challenges. Progressive methods enable bitstream truncation but underutilize available compute resources; successive compression repeats costly pixel-domain operations and suffers cumulative quality loss and inefficiency; fixed-parameter models lack post-encoding flexibility. In this work, we developed the Hierarchical Cascade Framework (HCF) that achieves high rate-distortion performance and better computational efficiency through direct latent-space transformations across network nodes in distributed multi-stage image compression system. Under HCF, we introduced policy-driven quantization control to optimize rate–distortion trade-offs, and established the edge quantization principle through differential entropy analysis. The configuration based on this principle demonstrates up to 0.6dB PSNR gains over other configurations. When comprehensively evaluated on the Kodak, CLIC, and CLIC2020-mobile datasets, HCF outperforms successive-compression methods by up to 5.56% BD-Rate in PSNR on CLIC, while saving up to 97.8% FLOPs, 96.5% GPU memory, and 90.0% execution time. It also outperforms state-of-the-art progressive compression methods by up to 12.64% BD-Rate on Kodak and enables retraining-free cross-quality adaptation with 7.13-10.87% BD-Rate reductions on CLIC2020-mobile.

HCF: Hierarchical Cascade Framework for Distributed Multi-Stage Image Compression

Understanding 3D scenes in open-world settings poses fundamental challenges for vision and robotics, particularly due to the limitations of closed-vocabulary supervision and static annotations. To address this, we propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning, which enables generalizable and interactive 3D scene understanding. Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction. The framework comprises two key components: (1) a dynamic scene graph generation module that detects objects and infers semantic relationships without fixed label sets, and (2) a retrieval-augmented reasoning pipeline that encodes scene graphs into a vector database to support text/image-conditioned queries. We evaluate our method on 3DSSG and Replica benchmarks across four tasks—scene question answering, visual grounding, instance retrieval, and task planning—demonstrating robust generalization and superior performance in diverse environments. Our results highlight the effectiveness of combining open-vocabulary perception with retrieval-based reasoning for scalable 3D scene understanding.

Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks such as mathematical problem solving. However, these systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers. From a causal inference perspective, we attribute this phenomenon primarily to the presence of confounding semantic features. To address it, we propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path. CRA trains sparse autoencoders on the PRM’s internal activations to recover interpretable features, then corrects confounding by using backdoor adjustment. Experiments on math solving datasets demonstrate that CRA mitigates reward hacking and improves final accuracy, without modifying the policy model or retraining PRM.

Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction

Detecting the origin of information or infection spread in networks is a fundamental challenge with applications in misinformation tracking, epidemiology, and beyond. We study the multi-source detection problem: given snapshot observations of node infection status on a graph, estimate the set of source nodes that initiated the propagation. Existing methods either lack statistical guarantees or are limited to specific diffusion models and assumptions. We propose a novel conformal prediction framework that provides statistically valid recall guarantees for source set detection, independent of the underlying diffusion process or data distribution. Our approach introduces principled score functions to quantify the alignment between predicted probabilities and true sources, and leverages a calibration set to construct prediction sets with user-specified recall and coverage levels. The method is applicable to both single- and multi-source scenarios, supports general network diffusion dynamics, and is computationally efficient for large graphs. Empirical results demonstrate that our method achieves rigorous coverage with competitive accuracy, outperforming existing baselines in both reliability and scalability.

Conformal Prediction for Multi-Source Detection on a Network

Learning to manipulate diverse objects with multi-finger dexterous hands remains a significant challenge in robotics. Human-Object Interaction datasets constitute a rich repository of knowledge about task information and embodied interactions. Instead of solely imitating the human demonstrations, we consider the hand-object interaction process as a whole by predicting the hand-object future states holistically. The predicted object future states can not only facilitate the reinforcement learning by alleviating the heavy reliance on task-specific reward design, but also enable our pipeline to be more general to various task settings. We conduct extensive robot experiments across 3 challenging tasks with novel objects. Results demonstrate that our methods outperform existing SOTA methods in all 3 tasks with higher success rates and better adaptive ability to novel object configurations. We also validate the cross-embodiment compatibility of our methods on different robots to prove the learned priors' universal utility.

Learning Object-Centric Motion Priors from Human for Robotic Dexterous Manipulation

Downstream fine-tuning of multi-modal large language models (MLLMs) is advancing rapidly, allowing general models to achieve superior performance on domain-specific tasks. Yet most prior research focuses on performance gains and overlooks the vulnerability of the fine-tuning pipeline: attackers can easily poison the dataset to implant backdoors into MLLMs. We conduct an in-depth investigation of backdoor attacks on MLLMs and reveal the phenomenon of **Attention Hijacking** and its **Hierarchical Mechanism**. Guided by this insight, we propose **PurMM**, a **test-time backdoor purification** framework that removes visual tokens exhibiting anomalous attention, thereby avoiding targeted outputs while restoring correct answers. PurMM contains three stages: (1) locating tokens with abnormal attention, (2) filtering them using deep-layer cues, and (3) zeroing out their corresponding components in the visual embeddings. Unlike existing defences, PurMM dispenses with retraining and training-process modifications, operating at test-time to restore model performance while eliminating the backdoor. Extensive experiments across multiple MLLMs and datasets show that PurMM maintains normal performance, sharply reduces attack success rates, and consistently converts backdoor outputs to benign ones, offering a new perspective for safeguarding MLLMs.

PurMM: Attention-Guided Test-Time Backdoor Purification in Multimodal Large Language Models

Lifelong person re-identification (LReID) aims to retrieve the target person from sequentially collected data. Due to significant domain gaps between datasets and the continuous increase of training data from different scenarios, weak inter-domain generalization and catastrophic forgetting issues have remained major challenges for LReID. To tackle these issues, a novel LReID method called Unified Representation Causal Prompt Distillation (URCPD) is proposed. Specifically, to reduce domain gaps among different scene datasets and improve model inter-domain generalization capability, a Feature Decoupling Style Transfer module (FDST) is proposed to map new features into a unified feature space. Furthermore, to reduce the accumulated forgetting of old knowledge during the training stage, a Causal Prompt Distillation module (CPD) is introduced. This module eliminates the re-inference process for distillation and embeds memory prompts to combat catastrophic forgetting. Extensive experiments on five classic LReID seen datasets and seven unseen datasets demonstrate that our method significantly outperforms state-of-the-art methods.

Downloads

Next from AAAI 2026

Tailored ViT Slimming: Budget-Aware Multi-Dimensional Sparsity Regularization for Vision Transformers Pruning (Student Abstract)

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES