Singapore

Recent advances in context optimization (CoOp) guided by large language model (LLM)–distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises–Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability.

AAAI 2026

vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs

unified manifold learning

context optimization

vision-language models

equilibrium

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Modern planning systems utilize various plan representations - sequential, parallel, partially ordered (PO), and partial-order causal link (POCL) - each with different models for concurrency.
These formalisms are often implicitly assumed to have the same base properties, particularly regarding makespan.
We challenge this assumption, proving the relationship between them is fundamentally asymmetric. Our analysis shows conversions from plans with rigid concurrency layers (sequential, parallel) to those with flexible partial orders (PO, POCL) can preserve makespan.
However, the reverse generally fails; the flexible orderings in PO/POCL plans can yield shorter makespans for solutions that cannot be represented in parallel plans without serialization.
We prove that finding an optimal parallel representation for a given POCL plan is $\textsf{NP}$-complete, resolving a key question about their practical interchangeability.
We also provide tight complexity bounds for makespan-bounded plan existence.
Notably, our results disprove a claim in the literature that planning graph-based planners maximize concurrency by minimizing the critical path in derived PO plans.

Makespan Investigations of Sequential, Parallel, PO, and POCL Plans

In this paper, we rethink model agent behaviors from a geometric structure perspective in multi-agent reinforcement learning.
Modeling agent behaviors is essential for understanding how agents interact and facilitating effective decisions. The key lies in capturing the dependencies and sequential relationships among agent decisions. Since each decision influences the subsequent choices, this forms a hierarchical and nested tree-like structure of interdependencies. While modeling tree-like data in Euclidean spaces could cause distortion, which results in a loss of agent decision structure information. Motivated by this, we reconsider model agent behaviors in hyperbolic space and propose the Hyperbolic Multi-Agent Representations (HMAR) method, which projects the agent behaviors into a Poincaré ball and leverages hyperbolic neural networks to learn agent policy representations. Additionally, we designed a contrastive loss function to train this network, minimizing the distance in feature space between different representations of the same agent while maximizing the distance between representations of distinct agents. Experimental results provide empirical evidence for the effectiveness of the HMAR method in cooperative and competitive environments, demonstrating the potential of hyperbolic agent representations for effective decision-making in multi-agent environments.

Exploiting Geometric Structures for Modeling Multi-Agent Behaviors: A New Thinking

Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks by explicitly generating intermediate reasoning steps. However, we find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks, particularly as CoT outputs become lengthy or complex. Additionally, our analysis reveals that increased dataset size does not always enhance performance due to varying data complexities. Motivated by these findings, we propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union (gIoU) rewards as complexity indicators to progressively structure training data from simpler to more challenging examples. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and LISA datasets demonstrate the effectiveness of our approach. CuRPO consistently outperforms existing methods, including Visual-RFT, with notable improvements of up to +15.49 mAP on RefCOCO. Moreover, CuRPO exhibits exceptional efficiency and robustness, delivering strong localization performance even in few-shot learning scenarios, particularly benefiting tasks characterized by ambiguous and intricate textual descriptions.

Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding

Zero-Shot Relation Triplet Extraction (ZSRTE) aims to extract head-tail entity pairs and their corresponding relations from sentences, where the relations available during inference are not seen during training. Existing methods typically assume that entities are continuous; however, in practice, entities can be discontinuous, which poses challenges to these approaches. To address this issue, we are the first to discuss and study the ZSRTE task involving discontinuous entities, and propose an innovative BoG framework, which is based on our proposed Boundary Token Graph structure. This method first predicts and adds edges between boundary tokens of (dis)continuous entities to construct a token graph, and then innovatively transforms the relation triplet extraction task into a process of finding paths in the graph. Additionally, we design a Boundary Token-Aware Prompt for each relation to further enhance the interaction between boundary tokens and relation semantics. Experimental results on four ZSRTE datasets—with or without discontinuous entities—consistently demonstrate that our method outperforms previous approaches, achieving state-of-the-art results.

A Boundary Token Graph for Zero-Shot Relation Triplet Extraction Involving Discontinuous Entities

Cross-modal hashing has emerged as a pivotal solution for efficient retrieval across diverse modalities, such as images and texts, by mapping them into compact binary hash spaces. However, in real-world scenarios, the modalities data is often missing or misaligned. Existing methods are most rely on fully paired training data and ignore missing or misaligned modalities data, resulting in the semantic inconsistencies. To address these challenges, we propose an Adaptive Graph Attention-Based Discrete Hashing (AGADH) method, which consists of three parts. First, to solve the problem of missing modalities, AGADH employs a masked completion strategy to reconstruct missing modalities. Second, to mitigate semantic misalignment, AGADH leverages a Graph Attention Network (GAT) encoder-decoder architecture with alignment module to construct features from different modalities. Additionally, to enhance the fusion performance, an adaptive fusion module dynamically adjusting the contributions of image and text modalities with learnable weighting coefficients is proposed. Extensive experiments on three benchmark datasets, MS-COCO, NUS-WIDE, and MIRFlickr-25K, demonstrating that AGADH outperforms state-of-the-art methods in both fully paired and incompletely paired scenarios, showing its robustness and effectiveness in cross-modal retrieval tasks.

Adaptive Graph Attention Based Discrete Hashing for Incomplete Cross-modal Retrieval

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data such as images and text. However, the number of visual tokens in these models often far exceeds that of textual tokens, resulting in substantial redundancy and high inference costs. Existing pruning methods primarily rely on either unimodal information or cross-modal attention mechanisms. The former often overlooks the semantic alignment between instructions and visual representations in the multimodal space, while the latter is prone to attention drift and dispersion, leading to significant performance degradation under high pruning ratios. All the above issues stem from the lack of effective textual guidance during the pruning process. To identify effective informational cues for guiding pruning, we conduct an in-depth analysis of the interaction between language instructions and visual features based on the cross-modal information bottleneck attribution (CIBA) method, revealing the presence of noun anchors. Based on this analysis, we propose the Instruction-Guided Cross-Modal Clustering Token Pruning (ICCTP) method, a plug-and-play, training-free pruning paradigm. Specifically, ICCTP first leverages global attention to retain a small set of visual tokens that preserve global context. It then extracts nouns from the instruction as clustering centers to perform cross-modal clustering over the remaining visual tokens. To balance semantic diversity and global relevance while reducing intra-cluster redundancy, we design an importance scoring mechanism. Finally, visual tokens within each cluster are pruned according to a specified pruning ratio. We evaluate ICCTP on multiple VLM architectures, including LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-7B. Experimental results show that ICCTP maintains strong performance across various pruning rates without requiring retraining. Notably, even under an extreme setting where 94.4% of visual tokens are removed, ICCTP retains 90.02% of the original accuracy while reducing TFLOPs by 82.36%.

Instruction-Guided Cross-Modal Clustering for Training-Free Visual Token Pruning in Vision-Language Models

Urban region embedding, which learns dense vector representations for urban zones, plays a foundational role in data-driven urban intelligence. These representations are critical for downstream applications like public safety management and infrastructure development, requiring nuanced understanding of urban functionality. A core challenge remains effective fusion of multi-view data (e.g., human mobility flows and static regional attributes) into unified zone representations. To this end, we propose \textbf{MVJC}, a \textbf{M}ulti-\textbf{v}iew \textbf{J}oint Learning and \textbf{C}ontrastive Learning framework, which employs: (1) Multi-view Joint Learning (MVJL) layer to model intra-view dependencies to extract view-specific features and (2) Multi-view Contrastive Learning (MVCL) layer to perform cross-region aggregation to derive consensus representations while capturing the regional complementarity. We further introduce a structure-aware contrastive loss that mitigates false negatives by aligning representations through region topology instead of instance identity. Extensive experiments on New York City datasets demonstrate MVJC's superiority: it reduces crime prediction MAE by 9.1\% (vs. 66.9 baseline) and improves land use clustering F-measure by 55.6\% (vs. 0.45 baseline) over state-of-the-art method, which is attributed to MVJC's synergy of joint and contrastive learning, yielding representations that are simultaneously predictive and semantically discriminative.

Comprehensive Urban Region Representation Learning via Multi-View Joint Learning and Contrastive Learning

Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned with the prevalent Taylor+ metric, Prune\&Comp reduces PPL from 512.78 to 16.34 and retains 90.57\% of the original performance across 9 question-answering tasks, outperforming the baseline by 24.72\%.

Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Tournaments are widely used models to represent pairwise dominance between candidates, alternatives, or teams. We study the problem of providing certified explanations for why a candidate appears among the winners under various tournament rules. To this end, we identify minimal supports—minimal sub-tournaments in which the candidate is guaranteed to win regardless of how the rest of the tournament is completed (that is, the candidate is a necessary winner of the sub-tournament). This notion corresponds to an abductive explanation for the question, "Why does the winner win the tournament?"—a central concept in formal explainable AI. We focus on common tournament solutions: the top cycle, the uncovered set, the Copeland rule, the maximin rule, the weighted uncovered set and the Borda rule. For each rule we determine the size of the smallest minimal supports, we present polynomial-time algorithms to compute them for all but the weighted uncovered set, for which the problem is NP-complete. Finally, we show how minimal supports can serve to produce compact, certified, and intuitive explanations.

Explaining Tournament Solutions with Minimal Supports

Signal Temporal Logic (STL) is a powerful formal language for specifying real-time specifications of Cyber-Physical Systems (CPS). Transforming specifications written in natural language into STL formulas automatically has attracted increasing attention. Existing rule-based methods depend heavily on rigid pattern matching and domain-specific knowledge, limiting their generalizability and scalability. Recently, Supervised Fine-Tuning (SFT) of large language models (LLMs) has been successfully applied to transform natural language into STL. However, the lack of fine-grained supervision on atomic proposition correctness, semantic fidelity, and formula readability often leads SFT-based methods to produce formulas misaligned with the intended meaning. To address these issues, we propose RESTL, a reinforcement learning (RL)-based framework for the transformation from natural language to STL. RESTL introduces multiple independently trained reward models that provide fine-grained, multi-faceted feedback from four perspectives, i.e., atomic proposition consistency, semantic alignment, formula succinctness, and symbol matching. These reward models are trained with a curriculum learning strategy to improve their feedback accuracy, and their outputs are aggregated into a unified signal that guides the optimization of the STL generator via Proximal Policy Optimization (PPO). Experimental results demonstrate that RESTL significantly outperforms state-of-the-art methods in both automatic metrics and human evaluations. The code is available in the supplementary material.

Downloads

Next from AAAI 2026

Makespan Investigations of Sequential, Parallel, PO, and POCL Plans

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES