Singapore

CLIP (Contrastive Language-Image Pre-training) has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG (Multimodal Learngene), a novel framework designed to extract and leverage generalizable components from CLIP. Specifically, we first establish multimodal and unimodal blocks to extract the multimodal and unimodal generalizable knowledge in a weighted-sum manner. Subsequently, we employ these components to numerically initialize descendant models of varying scales and modalities. Extensive experiments demonstrate MM-LG&#39;s effectiveness, which achieves performance gains over existing learngene approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and comparable or superior results to the pre-training and fine-tuning paradigm (e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Notably, MM-LG requires only around 25% of the parameter storage while reducing around 2.8× pre-training costs for diverse model scales compared to the pre-training and fine-tuning paradigm, making it particularly suitable for efficient deployment across diverse downstream tasks.

AAAI 2026

Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge

ml: multimodal learning

cv: multi-modal vision

ml: learning on the edge & model compression

CLIP (Contrastive Language-Image Pre-training) has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG (Multimodal Learngene), a novel framework designed to extract and leverage generalizable components from CLIP. Specifically, we first establish multimodal and unimodal blocks to extract the multimodal and unimodal generalizable knowledge in a weighted-sum manner. Subsequently, we employ these components to numerically initialize descendant models of varying scales and modalities. Extensive experiments demonstrate MM-LG's effectiveness, which achieves performance gains over existing learngene approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and comparable or superior results to the pre-training and fine-tuning paradigm (e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Notably, MM-LG requires only around 25% of the parameter storage while reducing around 2.8× pre-training costs for diverse model scales compared to the pre-training and fine-tuning paradigm, making it particularly suitable for efficient deployment across diverse downstream tasks.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

While Transformers have revolutionized time series forecasting, they remain trapped by manual architecture design—every model uses the same attention mechanism, normalization, and activation choices. What if we could automatically discover the perfect architectural recipe for each dataset? This work introduces STrans (Spontaneous Transformer), a comprehensive neural architecture search framework for time series Transformers that simultaneously explores attention variants, normalization techniques, activation functions, and encoding operations. Using differentiable architecture search, STrans automatically discovers architectures that outperform manually designed baselines. However, the experiments reveal a surprising and counterintuitive finding: complex searched architectures often fail catastrophically, while simpler configurations generalize better. This "search overfitting" phenomenon challenges fundamental assumptions about automated architecture design in time series domains. The work not only advances automated model design but uncovers critical insights that will reshape how we think about neural architecture search for temporal data.

STrans: Spontaneous Architecture Evolution for Adaptive Time Series Forecasting

Adversarial patch attacks inject localized perturbations into images to mislead deep vision models. These attacks can be physically deployed, posing serious risks to real-world applications. In this paper, we propose CertMask, a certifiably robust defense that constructs a provably sufficient set of binary masks to neutralize patch effects with strong theoretical guarantees. While the state-of-the-art approach (PatchCleanser) requires two rounds of masking and incurs $O(n^2)$ inference cost, CertMask performs only a single round of masking with $O(n)$ time complexity, where $n$ is the cardinality of the mask set to cover an input image. Our proposed mask set is computed using a mathematically rigorous coverage strategy that ensures each possible patch location is covered at least $k$ times, providing both efficiency and robustness. We offer a theoretical analysis of the coverage condition and prove its sufficiency for certification. Experiments on ImageNet, ImageNette, and CIFAR-10 show that CertMask improves certified robust accuracy by up to +13.4\% over PatchCleanser, while maintaining clean accuracy nearly identical to the vanilla model.

CertMask: Certifiable Defense Against Adversarial Patches via Theoretically Optimal Mask Coverage

Progress in medical image segmentation is fundamentally constrained by the scarcity of annotated data. While diffusion models offer a promising solution by generating high-fidelity image–mask pairs, their utility for downstream tasks remains underexplored. A key bottleneck lies in the misalignment between generation outputs and task-specific needs—samples are produced independently of their utility for downstream training. To this end, we propose Value-Guided Diffusion (VGD), a lightweight sampling framework that integrates downstream model feedback into the generative inference process. VGD estimates a value score for each sample based on its utility to downstream training, and leverages this signal to iteratively guide the denoising trajectory toward high-reward regions of the data manifold. Crucially, VGD can be seamlessly integrated into existing medical diffusion models without any additional training or architectural modifications. Extensive experiments across multiple diffusion backbones and segmentation benchmarks demonstrate that VGD significantly boosts downstream segmentation performance while maintaining visual fidelity. Our findings highlight a task-aware sampling principle with potential to underpin future synthetic segmentation pipelines.

VGD: Value-Guided Diffusion Toward High-Utility Medical Image Segmentation

Preference learning has gained significant attention in tasks involving subjective human judgments, such as speech emotion recognition (SER) and image aesthetic assessment. While pairwise frameworks such as RankNet offer robust modeling of relative preferences, they are inherently limited to local comparisons and struggle to capture global ranking consistency. To address these limitations, we propose RankList, a novel listwise preference learning framework that generalizes RankNet to structured list-level supervision. Our formulation explicitly models local and non-local ranking constraints within a probabilistic framework. The paper introduces a log-sum-exp approximation to improve training efficiency. We further extend RankList with skip-wise comparisons, enabling progressive exposure to complex list structures and enhancing global ranking fidelity. Extensive experiments demonstrate the superiority of our method across diverse modalities. On benchmark SER datasets (MSP-Podcast, IEMOCAP, BIIC Podcast), RankList achieves consistent improvements in Kendall's Tau and ranking accuracy compared to standard listwise baselines. We also validate our approach on aesthetic image ranking using the Artistic Image Aesthetics dataset, highlighting its broad applicability. Through ablation and cross-domain studies, we show that RankList not only improves in-domain ranking but also generalizes better across datasets. Our framework offers a unified, extensible approach for modeling ordered preferences in subjective learning scenarios.

RankList – a Listwise Preference Learning Framework for Predicting Subjective Preferences

Large Language Models (LLMs) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.

TimeBill: Time-Budgeted Inference for Large Language Models

Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, naïve low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains.
Such aggressive truncation, however, typically results in substantial performance degradation.
To address this trade-off, we propose \textit{SkipCat}, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget.
Experimental results show that, \textit{without any additional fine-tuning}, our method outperforms previous low-rank compression approaches by 7\% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.

SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping

Designing molecules with desired properties, aka the oRiented molEcule Design (RED), is a fundamental task in chemistry and materials science. While graph diffusion models (GDMs) and reinforcement learning techniques (RL) show promise in molecule structure generation and property optimization stages individually, their integration in the unified RED task often suffers from poor compatibility. The large variance among candidate molecular structures generated by GDMs can be amplified in the iterative optimization process of RL, leading to slow and unstable convergence. In this work, motivated by the adaptive and divide-and-conquer characteristics of Mixture of Experts (MoE) architecture, we propose a novel framework called MoE-Guided Graph Diffusion Model (MEGD) that incorporates the MoE architecture to guide the orchestration of GDM and RL, promoting faster and more stable convergence in the design process. MEGD is evaluated on benchmark datasets optimizing the physical and chemical properties of AI-generated molecular structures. On all three datasets, our method outperforms the best of 9 alternative models by 7.73\% on the target structural properties, while not penalizing other important application-level quality metrics of the generated molecules. A real-world case study on an emerging class of material, i.e., metal-organic framework, is also conducted, which further demonstrates the effectiveness of our method in accomplishing the RED task.

MoE-Guided Graph Diffusion for Oriented Molecule Design

We make three novel contributions to parameter learning and inference in probabilistic sentential decision diagrams (PSDDs). First, rather than traversing the entire PSDD during parameter learning for each dataset example, we pioneer the use of determinism to focus only on the activated partition. Second, we demonstrate how to prune deterministic computation in inference, thereby eliminating the need to propagate probability over every node in the network for each query. Third, we introduce a technique that parallelizes a single circuit evaluation, rather than parallelizing individual multiplications or layer-wise inference. For both learning and inference, experimental results on benchmark PSDDs from various application domains demonstrate state-of-the-art performance.

Paths Not Taken: Structure-Based Pruning in PSDD Learning and Inference

Image geo-localization aims to determine the geographic location of a query image. While Multimodal Large Language Models (MLLMs) show potential for this task due to their rich world knowledge and explainable abilities, they often struggle with confirmation bias, i.e., committing to early, potentially incorrect guesses caused by visual clues with varied geographic likelihoods. In this paper, we propose GeoBayes, a novel training-free framework that formulates geolocalization as a Maximum a Posteriori (MAP) estimation task over multiple geographic hypotheses and performs probabilistic thought via sequential Bayesian reasoning. GeoBayes treats each visual object and its associated geographic clues as probabilistic evidence, integrating them iteratively through a Hypothesize–Verify–Update loop. At each step, it evaluates how new evidence supports existing hypotheses and updates their posterior probabilities, gradually converging on the most probable location. This allows GeoBayes to explicitly quantify and fuse the varied geographic probabilities implied by various visual elements, reducing the risk of overcommitting to misleading clues. Furthermore, considering the natural hierarchy of geographic labels (e.g., country, city), GeoBayes introduces a state memory mechanism that stores hypotheses, inference context, and evidence scores across levels. This design enables the framework to propagate prior knowledge across levels of the geographic hierarchy and incorporate geographic structural constraints into the Bayesian update process, achieving a coarse-to-fine geo-localization. Experiments on IM2GPS3k and YFCC4K show that GeoBayes improves MLLM-based geo-localization accuracy without extra training. This demonstrates the effectiveness of probabilistic reasoning for robust and interpretable geo-localization.

GeoBayes: Probabilistic Image Geo-Localization Inference via Sequential Bayesian Updating

This paper presents FAMDR, a Feature-Aligned Multimodal Denoising framework for Reliable Diagnostic Reconciliation. Existing approaches suffer from two major limitations: (1) an overemphasis on simplifying observational descriptions and (2) a failure to denoise the misleading content in radiological findings against clinical histories. Current methods often dismiss such cross-modal inconsistencies as noise rather than clinically significant signals. To bridge this gap, the framework integrates four synergistic components: (1) noise-aware multimodal alignment that preserves discriminative discrepancy features while ensuring semantic coherence, (2) cross-modal retrieval augmentation leveraging external medical knowledge to resolve ambiguous cases, (3) granular localization of noises at pixel and phrase levels using adaptive thresholding, and (4) medical noise uncertainty quantification to provide reliable confidence estimates. Evaluated on an extended MIMIC-CXR dataset enriched with expert-annotated noise and longitudinal records, FAMDR achieves superior accuracy in denoising and inconsistency localization while preserving clinical interpretability. Its capability to generate actionable, uncertainty-aware reports advances safer and more reliable integration into diagnostic workflows.

Content not yet available

Next from AAAI 2026

STrans: Spontaneous Architecture Evolution for Adaptive Time Series Forecasting

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES