Singapore

Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder model can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder models in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available to the community for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals&#39; effectiveness, surpassing previous state-of-the-art methods by up to +8.21\% in P@1 on the largest dataset.

AAAI 2026

Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

text embeddings

multi-modal embeddings

extreme classification

contrastive learning

Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder model can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder models in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available to the community for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art methods by up to +8.21\% in P@1 on the largest dataset.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Prediction of pedestrian behavior is crucial for autonomous driving systems and intelligent transportation.Conventional methods predict the behavior based solely on either the pedestrian intention or the distance-related interactions between the pedestrian and its surroundings. However, these methods overlook the associations between intention and interaction for behavior prediction, in which they should be aligned with each other, thus leading to sub-optimal predictions. To solve this problem, we propose to predict the behavior by learning the association between intention and interaction, enabling them to mutually enhance each other during the prediction. Specifically, we first predict the short-term intention of all objects, including the target pedestrian and its surroundings.Then, instead of using the distance-related interactions, we predict the interactions by learning the correlated intentions. Finally, the intention-driven interactions refine the initial intention prediction, thus ensuring the alignment between intention and interaction for behavior prediction. We evaluate our method on two downstream tasks, the pedestrian trajectory prediction and pedestrian intention estimation, and show that it outperforms all the existing methods.

Understanding Interaction as You Need: Intention-Driven Pedestrian Behavior Prediction

Large-scale vision-language models (VLMs) embedded with expansive representations and visual concepts have showcased significant potential in image and text understanding. Efficiently adapting VLMs such as CLIP to downstream tasks like few-shot image classification has garnered growing attention, with prompt learning emerging as a representative approach. However, most existing prompt-based adaptation methods, which rely solely on coarse-grained textual prompts, suffer from limited performance and interpretability when handling domain tasks that require specific knowledge. This results in a failure to satisfy the stringent trustworthiness requirements of Explainable Artificial Intelligence (XAI) in high-risk scenarios like healthcare. To address this issue, we propose a Knowledge-Enhanced Explainable Prompting (KEEP) framework that leverages fine-grained domain-specific knowledge to enhance the adaptation process of VLMs across various domains and image modalities. By incorporating retrieval augmented generation and domain foundation models, our framework can provide more reliable image-wise knowledge for prompt learning in various domains, alleviating the lack of fine-grained annotations, while offering both visual and textual explanations. Extensive experiments and explainability analyses conducted on eight datasets of different domains and image modalities demonstrate that our method simultaneously achieves superior performance and interpretability, highlighting the effectiveness of the collaboration between foundation models and XAI. The code will be made publicly available.

Knowledge-Enhanced Explainable Prompting for Vision-Language Models

Binary networked public goods games model situations in which players can choose whether to participate in an action at some cost which benefits players in their immediate vicinity within some typically social or infrastructural network.
An important underlying assumption for this model is that participation in an action impacts *the entire* vicinity of participating players.
However, there are numerous natural settings in which participation influences only a subset of the neighbors and is in fact more "interaction-specific''.
In this work, we introduce a type of game that is more appropriate in such settings. 
We initiate the investigation of these games, by studying the complexity of deciding existence of their Nash equilibria in general and with respect to well-motivated structural restrictions on the network.
The outcome is a comprehensive understanding of the complexity of computing Nash equilibria with respect to any combination of three natural properties of the network structure.

Edge-Binary Public Goods Games

Step-wise explanations can explain logic puzzles and other satisfaction problems by showing how to derive decisions step by step. Each step consists of a set of constraints that derive an assignment to one or more decision variables. However, many candidate explanation steps exist, with different sets of constraints and different decisions they derive. To identify the most comprehensible one, a user-defined objective function is required to quantify the quality of each step. However, defining a good objective function is challenging. Here, interactive preference elicitation methods from the wider machine learning community can offer a way to learn user preferences from pairwise comparisons. We investigate the feasibility of this approach for step-wise explanations and address several limitations that distinguish it from elicitation for standard combinatorial problems. First, because the explanation quality is measured using multiple sub-objectives that can vary a lot in scale, we propose two dynamic normalization techniques to rescale these features and stabilize the learning process. We also observed that many generated comparisons involve similar explanations. For this reason, we introduce MACHOP (Multi-Armed CHOice Perceptron), a novel query generation strategy that integrates non-domination constraints with upper confidence bound-based diversification. We evaluate the elicitation techniques on Sudokus and Logic-Grid puzzles using artificial users, and validate them with a real-user evaluation. In both settings, MACHOP consistently produces higher-quality explanations than the standard approach.

Preference Elicitation for Step-Wise Explanations in Logic Puzzles

Stringent regulations like General Data Protection Regulation (GDPR) mandate that an application's code-level data handling must align with its natural-language privacy policy, creating a critical auditing challenge. However, existing methods, predominantly reliant on static analysis, suffer from a critical limitation: in their pursuit of soundness via over-approximation, they exhibit "semantic blindness"—detecting what data flows exist but not why. This leads to an overwhelming volume of false positives, rendering automated auditing impractical. To bridge this gap, we introduce PriAgent, a novel framework that approaches compliance auditing as a multi-stage, AI-driven reasoning task. Instead of a monolithic model, PriAgent deploys a team of specialized agents that execute a divide-and-conquer strategy. They systematically prune the analysis space by abstracting data flows, pinpoint semantic loci critical for inspection, and perform on-demand summarization of large code blocks to ensure scalability. PriAgent leverages Retrieval-Augmented Generation (RAG) with a curated knowledge base of Android APIs, equipping agents to discern potentially non-compliant behavior from benign functionality. By correlating code-level evidence with the app's stated privacy policy, PriAgent delivers a holistic and explainable verdict for each potential violation. Our evaluations demonstrate that PriAgent significantly reduces false positives, enabling a more scalable and precise compliance audit.

PriAgent: A Collaborative Multi-Agent Framework for Auditing Android Privacy Compliance

Property-constrained molecular generation and editing are crucial in AI-driven drug discovery but remain hindered by two factors: (i) capturing the complex relationships between molecular structures and multiple properties remains challenging, and (ii) the narrow range and incomplete annotations of molecular properties limit the effectiveness of models that rely heavily on property information. To tackle these limitations, we propose HSPAG, a data-efficient framework featuring hierarchical structure–property alignment. By treating SMILES and molecular properties as complementary modalities, the model learns their relationships at atom, substructure, and whole-molecule levels during pre-training, thereby implicitly encoding property information into molecular representations. Moreover, we select representative samples through scaffold clustering and hard samples via an auxiliary variational auto-encoder (VAE), substantially reducing the required pre-training data. In addition, we incorporate a property relevance-aware masking mechanism and diversified perturbation strategies to enhance generation quality under sparse annotations. Experimental results demonstrate that HSPAG effectively models complex molecular characteristics to capture fine-grained structure–property insights, enabling controllable molecular generation under multiple property constraints. Two real-world case studies further validate the editing capabilities of HSPAG, highlighting its practical potential in lead compound screening and optimization.

Hierarchical Structure-Property Alignment for Data-Efficient Molecular Generation and Editing

In few-shot learning, utilizing local and global geometric priors to capture both subtle local class metrics and coarse global structures within the meta-task are important to obtain discriminative embeddings. However, existing graph-based and curvature-based few-shot approaches only focus on either one kind of geometric prior but neglect the other. To effectively utilize the pros of these two paradigms, we propose a novel Dual-Geometry Graph Network (DGGN) to adaptively integrate the local and global geometric priors via two key pathways. Specifically, the local-wise metric modeling pathway utilizes Ollivier-Ricci curvature to capture task-specific local class metrics among the instances, and the global-wise connectivity modeling pathway utilizes resistive embedding to capture global instance distributions and connectivity patterns of the entire meta-task. In addition, we introduce two new regularization loss functions to explicitly enhance the geometric representation ability of the local and global pathways respectively. We validate that DGGN's superior performance stems from its adaptively topological refinements by measuring the graph edit distance, demonstrating its ability to match the underlying data distribution. Extensive experiments show that DGGN sets a new state-of-the-art on standard, cross-domain, and semi-supervised few-shot benchmarks. Code is available in our Supplementary Material.

Dual-Geometry Graph Network: Unifying Local and Global Priors for Few-Shot Learning

Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.

Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

Effectively capturing multimodal co-occurrence signals, such as hand shapes, facial expressions, and body postures, is critical for semantic understanding in sign language recognition (SLR) and translation (SLT). Although skeleton data offer greater efficiency and robustness than RGB inputs, existing methods typically rely on pairwise graph structures, limiting their ability to model complex high-order interactions across body regions. To address this limitation, we propose HyperSign, a hierarchical hypergraph neural network that systematically captures high-order co-occurrence patterns among diverse body parts. The Co-occurrence Graph Perception Module jointly learns relational structures via three complementary pathways: (1) traditional graph convolutions for modeling physical joint connections, (2) dynamic geometric hypergraphs constructed via k-nearest neighbors to encode local spatial patterns, and (3) soft hypergraphs generated by learnable prototypes to reveal latent semantic associations. To further enhance structural modeling and semantic consistency, a Meta-Part Hypergraph Fusion Module abstracts feature streams from the hands, face, and body into unified hypergraph nodes, while leveraging empirically derived co-occurrence priors to model high-order cross-part dependencies. Moreover, an uncertainty-aware collaborative distillation mechanism guides the model to focus on critical body regions. Extensive experiments on standard SLR and SLT benchmarks (e.g., PHOENIX 2014, PHOENIX 2014T, and CSL Daily) demonstrate that HyperSign not only outperforms existing skeleton-based approaches in both speed and accuracy but also achieves competitive or superior results compared to several state-of-the-art RGB-based methods across multiple evaluation metrics.

HyperSign: Hierarchical Hypergraph-based Co-occurrence Modeling for Sign Language Recognition and Translation

The static nature of knowledge within Large Language Models (LLMs) makes it difficult for them to adapt to evolving information, rendering knowledge editing a critical task. However, existing methods struggle with challenges of scalability and retrieval efficiency, particularly when handling complex, multi-hop questions that require multi-step reasoning. To address these challenges, this paper introduces ALEX (A Light Editing-knowledge Extractor), a lightweight knowledge editing framework. The core innovation of ALEX is its hierarchical memory architecture, which organizes knowledge updates (edits) into semantic clusters. This design fundamentally reduces retrieval complexity from a linear 
$O(N)$ to a highly scalable $O(K+N/C)$. Furthermore, the framework integrates an Inferential Query Synthesis (IQS) module to bridge the semantic gap between queries and facts , and a Dynamic Evidence Adjudication (DEA) engine that executes an efficient two-stage retrieval process. Experiments on the MQUAKE benchmark demonstrate that ALEX significantly improves both the accuracy of multi-hop answers (MultiHop-ACC) and the reliability of reasoning paths (HopWise-ACC). It also reduces the required search space by over 80\% , presenting a promising path toward building scalable, efficient, and accurate knowledge editing systems.

Downloads

Next from AAAI 2026

Understanding Interaction as You Need: Intention-Driven Pedestrian Behavior Prediction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Understanding Interaction as You Need: Intention-Driven Pedestrian Behavior Prediction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads