Singapore

Continual learning (CL) aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge. With the progress of vision-language models like Contrastive Language-Image Pre-training (CLIP), their promise for CL has attracted increasing attention due to their strong generalizability. However, the potential of rich textual semantic priors in CLIP in addressing the stability–plasticity dilemma remains underexplored. During backbone training, most approaches transfer past knowledge without considering semantic relevance, leading to interference from unrelated tasks that disrupt the balance between stability and plasticity. Besides, while text-based classifiers provide strong generalization, they suffer from limited plasticity due to the inherent modality gap in CLIP. Visual classifiers help bridge this gap, but their prototypes lack rich and precise semantics. To address these challenges, we propose Semantic-Enriched Continual Adaptation (SECA), a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer in the backbone and reinforce the semantic structure of the visual classifier. Specifically, a Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) module is proposed to assess new images&#39; relevance to diverse historical visual knowledge via textual cues, and aggregate relevant knowledge in an instance-adaptive manner as distillation signals. Moreover, a Semantic-Enhanced Visual Prototype Refinement (SE-VPR) module is introduced to refine visual prototypes using inter-class semantic relations captured in class-wise textual embeddings. Extensive experiments on multiple benchmarks validate the effectiveness of our approach. Code is available in the supplementary materials.

AAAI 2026

Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning

continual learning; semantic-enriched; knowledge transfer; prototype refinement

Continual learning (CL) aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge. With the progress of vision-language models like Contrastive Language-Image Pre-training (CLIP), their promise for CL has attracted increasing attention due to their strong generalizability. However, the potential of rich textual semantic priors in CLIP in addressing the stability–plasticity dilemma remains underexplored. During backbone training, most approaches transfer past knowledge without considering semantic relevance, leading to interference from unrelated tasks that disrupt the balance between stability and plasticity. Besides, while text-based classifiers provide strong generalization, they suffer from limited plasticity due to the inherent modality gap in CLIP. Visual classifiers help bridge this gap, but their prototypes lack rich and precise semantics. To address these challenges, we propose Semantic-Enriched Continual Adaptation (SECA), a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer in the backbone and reinforce the semantic structure of the visual classifier. Specifically, a Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) module is proposed to assess new images' relevance to diverse historical visual knowledge via textual cues, and aggregate relevant knowledge in an instance-adaptive manner as distillation signals. Moreover, a Semantic-Enhanced Visual Prototype Refinement (SE-VPR) module is introduced to refine visual prototypes using inter-class semantic relations captured in class-wise textual embeddings. Extensive experiments on multiple benchmarks validate the effectiveness of our approach. Code is available in the supplementary materials.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Unrestricted adversarial attacks aim to fool DNNs by generating effective yet photorealistic examples. However, previous methods usually rely on global perturbations to enhance attack performance, which inevitably introduces visual distortions. To reduce visual distortions in the background, we propose a diffusion-based framework that focuses on local perturbations to generate object-level unrestricted adversarial examples (ObjectAdv). Since the cross-attention maps of Stable Diffusion contain the object information, we directly leverage the attention maps to localize the semantic region of object where for attacking. Second, a prompt-switching strategy is proposed for both imperceptibility and attack capacity. Specifically, to preserve layout and object shape of clean image, a prompt of true category is used at early denoising steps. At the later steps, we propose a well-designed prompt to guide the diffusion model to generate transferable adversarial examples. This local attack may cause inconsistency between the perturbed object and the background in adversarial examples. An FFT-based edge smoother is utilized to ensure seamless blending of the edges. ObjectAdv achieves an average ASR of 99.2% in white-box test on the ImageNet-compatible dataset, and outperforms existing methods on defense performance (+5%) and image quality metrics, e.g., SSIM of 0.9140 (+0.1048) and FID of 25.63 (-19.27).

ObjectAdv: Object-Level Unrestricted Adversarial Attacks via Diffusion Models

Graph classification is a critical task in analyzing graph data, with applications across various domains. While graph neural networks (GNNs) have achieved remarkable results, their ability to generalize across graphs of varying scales remains a challenge. Conventional models often perform well on largescale graphs but struggle with distributions that are skewed towards small scales. Conversely, models tailored to address scale imbalances frequently prioritize small-scale graphs, leading to diminished performance in more balanced scenarios. To overcome these limitations, we introduce a Unbalanced-Balanced Representation Converter (U2B), which exhibits no explicit bias toward graph scales. U2B employs a two-step workflow: a distillation phase to extract base features from both node-level and graph-level representations, followed by a refinement phase to generate biased representations for improved balance. In the distillation phase, a static constraint guides node-level adjustments, improving the representation of nodes in small graphs. Simultaneously, a dynamic constraint in the graph-level process mitigates biases toward features from large graphs. To ensure harmony between the representations, a consistency alignment loss is introduced, aligning node-level and graph-level features to create more cohesive and balanced graph representations. Extensive experiments with 16 baselines across 8 datasets demonstrate that U2B achieves SOTA performance, boasting improvements of up to 22.19%. Additionally, we establish its strong compatibility with a range of other models. All associated code is provided in Supplement.

U2B: Scale-unbiased Representation Converter for Graph Classification with Imbalanced and Balanced Scale Distributions

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model’s temporal characteristics; and (ii) a selective computation module that selectively compute within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-$\alpha$ and DiT demonstrate that ProCache achieves up to 1.96$\times$ and 2.90$\times$ acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.

ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration

6-DoF object grasping is a crucial skill for embodied intelligent robots. Previous methods often rely on large-scale networks for feature extraction, followed by grasp pose prediction, which increases the network's parameter count and overlooks the geometric and graph features of the point cloud. To address these challenges, we propose GraphGrasp, a graph-guided 6-DoF grasping pose prediction method. It performs graph analysis from the perspectives of scene, object, and grasping graphs. First, we introduce a graph feature embedding method based on local-global features to model the scene graph effectively. Then, we use a graph transformer strategy to represent spatial relationships between objects in the object graph. Finally, we propose a multi-metric, multi-level grasp pose evaluation algorithm to predict and explore graspable points, enabling effective construction of grasp graphs and accurate grasp pose evaluation. We test GraphGrasp on the GraspNet-1Billion dataset, and the results show that, compared to previous methods, it achieves nearly the same performance with about $\frac{1}{5}$ of the parameters of state-of-the-art methods, significantly improving grasp pose prediction speed. Additionally, in real-world robot grasping scenarios, GraphGrasp outperforms previous methods in practical grasp pose prediction tasks.

GraphGrasp: Lightweight and Efficient Graph-Guided 6-DoF Robotic Grasp Pose Estimation Network

Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.

Top-Down Semantic Refinement for Image Captioning

Embodied navigation is a fundamental capability that enables embodied agents to effectively interact with the physical world in various complex environments. 
However, a significant gap remains between current embodied navigation tasks and real-world requirements, as existing methods often struggle to integrate high-level human instructions with spatial understanding, which is essential for agents to perceive their surroundings, adapt to intricate layouts, and make informed decisions based on spatial relationships.
To address this gap, we propose a new task of embodied navigation called spatial navigation, which encompasses two key components: spatial object navigation (SpON) for object-specific guidance and spatial area navigation (SpAN) for navigating to designated areas. Specifically, SpON guides agents to specific objects by leveraging spatial relationships and contextual understanding, while SpAN focuses on navigating to defined areas within complex environments. Together, these components significantly enhance agents' navigation capabilities, enabling more effective interactions in real-world scenarios.
To support this task, we have generated a spatial navigation dataset consisting of 10,000 trajectories within the AI2THOR simulator, with 5,000 trajectories allocated to each component. This dataset includes high-level human instructions, detailed observations, and corresponding navigation actions, providing a comprehensive resource to enhance agent training and performance. By offering diverse scenarios and rich contextual information, this dataset aims to facilitate improved learning and adaptability for embodied agents in complex environments.
Building on the spatial navigation dataset, we introduce SpNav, a hierarchical navigation framework designed to embody the principle of "What You See is What You Reach." SpNav employs a vision-language model (VLM) to interpret high-level human instructions and accurately identify target objects or areas within the observation range. It subsequently achieves precise point-to-point navigation using a spatial map, thereby successfully completing the spatial navigation task. This framework enhances the agent's ability to operate effectively in complex environments, bridging the gap between perception and action.
Extensive experiments demonstrate that SpNav not only achieves state-of-the-art performance in spatial navigation tasks, surpassing all baseline methods, but also showcases remarkable zero-shot simulation-to-reality transfer capabilities, highlighting its potential for real-world deployment and practical applications in embodied AI.
To support ongoing research in this field, we will release the dataset, benchmark, and source code, enabling the community to build upon our work and explore new avenues for advancement.

What You See Is What You Reach: Towards Spatial Navigation with High-Level Human Instructions

Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed PPF-Tracker. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality. The complete codebase will be made publicly available.

Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds

Deep unrolling models (DUMs) have shown great potential in sparse-view CT reconstruction by combining iterative optimization and deep learning. 
However, most DUMs insufficiently account for physical degradation from sparse-view imaging, leading to slow convergence and persistent artifacts.
To address this, we propose PAUM, a Physics-Aware Accelerated Unrolling Model explicitly incorporating CT imaging physics into the iterative reconstruction.
PAUM introduces a Dual-Domain Physics-Aware Extrapolation (DDPE) module.
By modeling dual-domain degradations, it performs row-wise extrapolation in the sinogram domain to improve missing view recovery, and pixel-wise extrapolation in the image domain to address spatially variant degradation from incomplete backprojection.
This physics-aware extrapolation aligns optimization dynamics with underlying physical imaging degradation, significantly accelerating convergence.
Subsequently, we develop a lightweight Block-Attention Deformable Regularization Network (BDRN), leveraging deformable convolutions and block-wise attention to model spatially variant and structured artifact physical characteristics.
This enables spatially adaptive regularization on extrapolated results, effectively improving reconstruction quality.
Extensive experiments demonstrate PAUM achieves over 1dB PSNR improvement compared to SOTA methods, while reducing iteration count by 50\%. Code will be released.

Physics-Aware Accelerated Unrolling Model for Sparse-View CT Reconstruction

Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled $k$-$t$ space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. 
In this work, we proposed MoCo‑INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion‑compensated (MoCo) framework. Using the explicit motion modeling and the continuous prior of INRs, our MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Moreover, we present a new INR network architecture tailored to the CMR problem, which can greatly stabilize model optimization.
Experiments on retrospective (*i.e.*, simulated) datasets demonstrate the superiority of MoCo‑INR over state‑of‑the‑art methods, achieving fast convergence and fine‑detailed reconstructions at ultra‑high acceleration factors (*e.g.*, 20$\times$ in VISTA sampling).
In addition, evaluations on prospective (*i.e.*, real-acquired) free‑breathing CMR scans highlight its clinical practicality for real‑time imaging. Several ablation studies also confirm the effectiveness of critical components of MoCo-INR. The code will be publicly released for improving reproducibility.

Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation

Video-based human pose estimation aims to localize keypoints across frames, enabling robust analysis of human motion in applications such as sports, surveillance, and healthcare. However, existing methods rely solely on visual cues, limiting their robustness in complex scenes involving occlusion, motion blur, or poor lighting. In contrast, dual coding theory from psychology suggests that human cognition is inherently multimodal: we learn by integrating visual perception with linguistic context to form structured, semantic understandings of the world. Visual input provides concrete spatiotemporal grounding, while language offers symbolic abstraction that enhances reasoning and generalization. Motivated by this cognitive principle, we present the first framework that explicitly incorporates language as an auxiliary modality to enhance video-based pose estimation. To address the lack of paired video-text datasets, we first employ a Multimodal Large Language Model (MLLM) to generate textual descriptions of human interactions from videos. We then propose a novel coarse-to-fine multimodal alignment pipeline: a cross-modal semantic interaction module establishes initial grounding between spatiotemporal visual features and textual embeddings, while an optimal transport-based feature matching mechanism enforces fine-grained, geometry-aware alignment. This cognitively inspired design enables more accurate and robust pose estimation, especially in visually challenging scenes like occlusion and motion blur. Extensive experiments on three benchmarks confirm that our method consistently outperforms state-of-the-art approaches. Our code is released and included in the supplementary materials.

Downloads

Next from AAAI 2026

ObjectAdv: Object-Level Unrestricted Adversarial Attacks via Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

ObjectAdv: Object-Level Unrestricted Adversarial Attacks via Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads