Singapore

Property-constrained molecular generation and editing are crucial in AI-driven drug discovery but remain hindered by two factors: (i) capturing the complex relationships between molecular structures and multiple properties remains challenging, and (ii) the narrow range and incomplete annotations of molecular properties limit the effectiveness of models that rely heavily on property information. To tackle these limitations, we propose HSPAG, a data-efficient framework featuring hierarchical structure–property alignment. By treating SMILES and molecular properties as complementary modalities, the model learns their relationships at atom, substructure, and whole-molecule levels during pre-training, thereby implicitly encoding property information into molecular representations. Moreover, we select representative samples through scaffold clustering and hard samples via an auxiliary variational auto-encoder (VAE), substantially reducing the required pre-training data. In addition, we incorporate a property relevance-aware masking mechanism and diversified perturbation strategies to enhance generation quality under sparse annotations. Experimental results demonstrate that HSPAG effectively models complex molecular characteristics to capture fine-grained structure–property insights, enabling controllable molecular generation under multiple property constraints. Two real-world case studies further validate the editing capabilities of HSPAG, highlighting its practical potential in lead compound screening and optimization.

AAAI 2026

Hierarchical Structure-Property Alignment for Data-Efficient Molecular Generation and Editing

autoencoders

deep generative models

multimodal learning

self-supervised learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

In few-shot learning, utilizing local and global geometric priors to capture both subtle local class metrics and coarse global structures within the meta-task are important to obtain discriminative embeddings. However, existing graph-based and curvature-based few-shot approaches only focus on either one kind of geometric prior but neglect the other. To effectively utilize the pros of these two paradigms, we propose a novel Dual-Geometry Graph Network (DGGN) to adaptively integrate the local and global geometric priors via two key pathways. Specifically, the local-wise metric modeling pathway utilizes Ollivier-Ricci curvature to capture task-specific local class metrics among the instances, and the global-wise connectivity modeling pathway utilizes resistive embedding to capture global instance distributions and connectivity patterns of the entire meta-task. In addition, we introduce two new regularization loss functions to explicitly enhance the geometric representation ability of the local and global pathways respectively. We validate that DGGN's superior performance stems from its adaptively topological refinements by measuring the graph edit distance, demonstrating its ability to match the underlying data distribution. Extensive experiments show that DGGN sets a new state-of-the-art on standard, cross-domain, and semi-supervised few-shot benchmarks. Code is available in our Supplementary Material.

Dual-Geometry Graph Network: Unifying Local and Global Priors for Few-Shot Learning

Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.

Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

Effectively capturing multimodal co-occurrence signals, such as hand shapes, facial expressions, and body postures, is critical for semantic understanding in sign language recognition (SLR) and translation (SLT). Although skeleton data offer greater efficiency and robustness than RGB inputs, existing methods typically rely on pairwise graph structures, limiting their ability to model complex high-order interactions across body regions. To address this limitation, we propose HyperSign, a hierarchical hypergraph neural network that systematically captures high-order co-occurrence patterns among diverse body parts. The Co-occurrence Graph Perception Module jointly learns relational structures via three complementary pathways: (1) traditional graph convolutions for modeling physical joint connections, (2) dynamic geometric hypergraphs constructed via k-nearest neighbors to encode local spatial patterns, and (3) soft hypergraphs generated by learnable prototypes to reveal latent semantic associations. To further enhance structural modeling and semantic consistency, a Meta-Part Hypergraph Fusion Module abstracts feature streams from the hands, face, and body into unified hypergraph nodes, while leveraging empirically derived co-occurrence priors to model high-order cross-part dependencies. Moreover, an uncertainty-aware collaborative distillation mechanism guides the model to focus on critical body regions. Extensive experiments on standard SLR and SLT benchmarks (e.g., PHOENIX 2014, PHOENIX 2014T, and CSL Daily) demonstrate that HyperSign not only outperforms existing skeleton-based approaches in both speed and accuracy but also achieves competitive or superior results compared to several state-of-the-art RGB-based methods across multiple evaluation metrics.

HyperSign: Hierarchical Hypergraph-based Co-occurrence Modeling for Sign Language Recognition and Translation

The static nature of knowledge within Large Language Models (LLMs) makes it difficult for them to adapt to evolving information, rendering knowledge editing a critical task. However, existing methods struggle with challenges of scalability and retrieval efficiency, particularly when handling complex, multi-hop questions that require multi-step reasoning. To address these challenges, this paper introduces ALEX (A Light Editing-knowledge Extractor), a lightweight knowledge editing framework. The core innovation of ALEX is its hierarchical memory architecture, which organizes knowledge updates (edits) into semantic clusters. This design fundamentally reduces retrieval complexity from a linear 
$O(N)$ to a highly scalable $O(K+N/C)$. Furthermore, the framework integrates an Inferential Query Synthesis (IQS) module to bridge the semantic gap between queries and facts , and a Dynamic Evidence Adjudication (DEA) engine that executes an efficient two-stage retrieval process. Experiments on the MQUAKE benchmark demonstrate that ALEX significantly improves both the accuracy of multi-hop answers (MultiHop-ACC) and the reliability of reasoning paths (HopWise-ACC). It also reduces the required search space by over 80\% , presenting a promising path toward building scalable, efficient, and accurate knowledge editing systems.

ALEX:A Light Editing-knowledge Extractor

Multi-modal object Re-identification (ReID) aims to retrieve individuals by leveraging complementary information from different modalities. Recent CLIP-based approaches show promising results, but they usually employ prompt-based or hybrid prompt-adapter tuning and still face the problems of heterogeneous domain gap, fine-grained identity discrimination and noise instance interference. To address these problems, we introduce a novel Parameter-Efficient Fine-Tuning framework with Bag-of-Adapters (PEFT-BoA) based on the pre-trained CLIP's vision encoder for multi-modal object ReID. Specifically, we first propose a Domain-specific Patch Adapter (DPA) automatically adapts and aligns visual features across different modalities at the local patch level. Meanwhile, we propose a Task-specific Class Adapter (TCA) enhance the fine-grained identity discrimination ability by optimizing global class token. Finally, we propose an Instance-specific Fusion Adapter (IFA) dynamically selects and combines only the most useful features across different modalities for each instance. Our PEFT-BoA achieves the better performance on multi-modal object re-identification benchmarks, while maintaining fewer trainable parameters (6.62M) and a higher training throughput (246.2fps).

PEFT-BoA: Parameter-Efficient Fine-Tuning with Bag-of-Adapters for Multi-Modal Object Re-identification

Recent advances in extreme image compression have revealed that mapping pixel data into highly compact latent representations can significantly improve coding efficiency. However, most existing methods compress images into 2-D latent spaces via convolutional neural networks (CNNs) or Swin Transformers, which tend to retain substantial spatial redundancy, thereby limiting overall compression performance. In this paper, we propose a novel Mixed RWKV-Transformer (MRT) architecture that encodes images into more compact 1-D latent representations by synergistically integrating the complementary strengths of linear-attention-based RWKV and self-attention-based Transformer models. Specifically, MRT partitions each image into fixed-size windows, utilizing RWKV modules to capture global dependencies across windows and Transformer blocks to model local redundancies within each window. The hierarchical attention mechanism enables more efficient and compact representation learning in the 1-D domain. To further enhance compression efficiency, we introduce a dedicated RWKV Compression Model (RCM) tailored to the structure characteristics of the intermediate 1-D latent features in MRT. Extensive experiments on standard image compression benchmarks validate the effectiveness of our approach. The proposed MRT framework consistently achieves superior reconstruction quality at bitrates below 0.02 bits per pixel (bpp). Quantitative results based on the DISTS metric show that MRT significantly outperforms the state-of-the-art 2-D architecture GLC, achieving bitrate savings of $43.75\%$, $30.59\%$ on the Kodak and CLIC2020 test datasets, respectively. The source code will be released soon.

MRT: Learning Compact Representations with Mixed RWKV-Transformer for Extreme Image Compression

In the field of Explainable Constraint Solving, it is common to explain to a user why a problem is unsatisfiable.
A recently proposed method for this is to compute a sequence of explanation steps.
Such a step-wise explanation shows individual reasoning steps involving constraints from the original problem specification, that in the end explain the conflict.
However, computing a step-wise explanation is computationally expensive, limiting the scale of problems on which it can be used.
We investigate how we can use proofs generated by a constraint solver as a starting point for computing step-wise explanations, instead of computing them step-by-step.
More specifically, we define a framework of abstract proofs, in which both proofs and step-wise explanations can be represented.
We then propose several methods for converting a proof to a step-wise explanation sequence, with special attention to trimming and simplification techniques to keep the sequence and its individual steps small.
Our results show our method significantly speeds up the generation of step-wise explanation sequences, while the resulting step-wise explanation has a quality similar to the current state-of-the-art.

Using Certifying Constraint Solvers for Generating Step-wise Explanations

Text-to-Video (T2V) generation has advanced greatly, yet maintaining consistency remains challenging, especially for tuning-free long video generation. 
We attribute the consistency problem to cumulative deviations for long video generation at three levels: 
the random noise lacking correlation results initial deviation between frames; 
discrepancy in semantic feature tokens between denoising network blocks gradually accumulates as the frame count grows, leading to greater deviations;
attention mechanisms struggle to capture global relationships across distant frames in long videos. 
To address these, we propose FreeMem, a tuning-free framework leveraging hierarchical memory update and injection: 
the noise memory stabilizes consistency by manipulating low and high frequency components in the initial noise space; 
the token memory combats inconsistency through adaptive fusion of historical and current semantic feature tokens between denoising network blocks; 
and the attention memory establishes persistent cache to model long-range relationships within self attention layers. 
Evaluated on VBench, FreeMem improves subject and background consistency matrics across various methods, offering a practical solution for low-cost, high-consistency long video generation.

FreeMem: Enhancing Consistency in Long Video Generation via Tuning-Free Memory

Logical reasoning-based recommendation methods formulate logical expressions to characterize user-item interaction patterns, incorporating regularization constraints to ensure consistency with logical rules. However, these methods face two critical challenges: (1) As sequence length increases, they cannot effectively capture the dynamic transfer of user interests across subsequences (i.e., subsequence interest drift), thereby degenerating logical expressions to single-subsequence inference. (2) The time complexity of logical reasoning and rule learning scales quadratically with the sequence length, severely constraining computational efficiency in long-sequence recommendation. To address these challenges, we propose ELECTOR, an intErest-shift-aware long-sequence Logical reasoning for EffiCienT lOng-sequence Recommendation method. Specifically, we design a Subsequence Interest Learning Module (SIL) to model cross-subsequence interest drifts in long sequences. SIL employs a local attention mechanism to extract subsequence interests effectively and a global attention mechanism to capture the correlations among subsequence interests. Subsequently, we propose an Interest-aware Logical Reasoning (ILR) mechanism that performs logical reasoning using a limited set of subsequence and short-term interests, rather than reasoning over the entire sequence, significantly reducing time complexity. Additionally, ILR employs interest logical reasoning contrastive loss to ensure the model simultaneously considers multiple interests. Experiments on four real-world datasets demonstrate that our method significantly outperforms all baselines regarding computational efficiency and recommendation accuracy, confirming its effectiveness.

Interest-Shift-Aware Logical Reasoning for Efficient Long-Sequence Recommendation

Source-Free Object Detection (SFOD) aims to adapt a source-pretrained object detector to a target domain without access to source data. However, existing SFOD methods predominantly rely on internal knowledge from the source model, which limits their capacity to generalize across domains and often results in biased pseudo-labels, thereby hindering both transferability and discriminability. In contrast, Vision Foundation Models (VFMs), pretrained on massive and diverse data, exhibit strong perception capabilities and broad generalization, yet their potential remains largely untapped in the SFOD setting. In this paper, we propose a novel SFOD framework that leverages VFMs as external knowledge sources to jointly enhance feature alignment and label quality. Specifically, we design three VFM-based modules: (1) Patch-weighted Global Feature Alignment (PGFA) distills global features from VFMs using patch-similarity–based weighting to enhance global feature transferability; (2) Prototype-based Instance Feature Alignment (PIFA) performs instance-level contrastive learning guided by momentum-updated VFM prototypes; and (3) Dual-source Enhanced Pseudo-label Fusion (DEPF) fuses predictions from detection VFMs and teacher models via an entropy-aware strategy to yield more reliable supervision. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art SFOD performance, validating the effectiveness of integrating VFMs to simultaneously improve transferability and discriminability.

Downloads

Next from AAAI 2026

Dual-Geometry Graph Network: Unifying Local and Global Priors for Few-Shot Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Dual-Geometry Graph Network: Unifying Local and Global Priors for Few-Shot Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads