Singapore

Embedding-based generalized zero-shot learning (GZSL) models often first forge robust latent semantic correlations between visual and attribute features so that knowledge can generalize to unseen categories. Despite leveraging attributes as priors and learning a shared embedding space, current methods exhibit two critical flaws. First, attributes with heterogeneous granularity are treated uniformly, leading to semantic ambiguity. Second, the source of class-level misclassification seldom aligns with attribute-level errors, preventing models from targeting the specific attributes responsible. To overcome these limitations, we introduce Structured Attribute-Guided Enhancement (SAGE), a unified framework for GZSL. Consensus-aware bidirectional attention first synchronizes visual–semantic focus regions via a mutual-distillation scheme. Next, we partition all attributes into pairwise-disjoint subsets—Global, Context, and Local—and couple them with visual features extracted at matching spatial scales. Finally, we design a cross-sample, subset-aware distillation mechanism—when a sample is misclassified, SAGE identifies the culpable attribute subset, retrieves high-confidence prototypes from a memory bank, and
applies a Kullback–Leibler (KL) divergence constraint to the corresponding feature branch. Comprehensive experiments and ablations on the challenging AwA2, CUB, and SUN benchmarks demonstrate the contribution of each component, with SAGE achieving a new state-of-the-art throughout. These findings underscore SAGE’s robustness and versatility, marking a substantial advance in generalized zero-shot learning and paving the way for broader zero-resource recognition.

AAAI 2026

SAGE: Structured Attribute-Guided Enhancement for GZSL

genaralized zero-shot learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Game UI development is essential to the game industry. However, the traditional workflow requires substantial manual effort to integrate pairwise UI and UX designs into a cohesive game user interface (GameUI). 
The inconsistency between the aesthetic UI design and the functional UX design typically results in mismatches and inefficiencies. To address the issue, we present an automatic system, $\textbf{AutoGameUI}$, for efficiently and accurately constructing GameUI. The system centers on a two-stage multimodal learning pipeline to obtain the optimal correspondences between UI and UX designs. The first stage learns the comprehensive representations of UI and UX designs from multimodal perspectives. The second stage incorporates grouped cross-attention modules with constrained integer programming to estimate the optimal correspondences through top-down hierarchical matching. The optimal correspondences enable the automatic GameUI construction. We create the GAMEUI dataset, comprising pairwise UI and UX designs from real-world games, to train and validate the proposed method. Besides, a universal data protocol and a web tool are implemented to ensure high-fidelity effects and facilitate human-in-the-loop interaction. Extensive experiments on the GAMEUI and RICO datasets demonstrate the effectiveness of our system in maintaining consistency between the constructed GameUI and the original designs. When deployed in the workflow of several mobile games, AutoGameUI achieves a 3$\times$ improvement in time efficiency, conveying significant practical value for game UI development.

AutoGameUI: Constructing High-Fidelity GameUI via Multimodal Correspondence Matching

We study a new online matching problem termed Online Capacitated General Matching with Knapsack (OCGMK), which generalizes the Online General Matching (OGM) problem. In the original OGM, vertices arrive sequentially and need to be paired with other vertices to maximize the total reward of pairing. Our study is the first to consider capacitated vertices in OGM: we allow each vertex to be assigned to multiple vertices up to a capacity limit. We also consider a previously unexamined knapsack constraint in OGM: assigning a pair of vertices has a cost, but the total cost is budgeted. To solve the OCGMK problem, we propose the Online Capacity-Knapsack Assignment (OCKA) algorithm, which constructs capacity-friendly sets and knapsack-friendly sets to simultaneously and effectively address both constraints. OCKA achieves a competitive ratio of $\alpha=\frac{\gamma}{2\beta}$, where $\gamma=1/(3+e^{-2})$ and $\beta$ is the ratio between the overall cost of all edges and the cost budget. When the knapsack constraint is not imposed but the capacitated vertices remain, the competitive ratio of OCKA is $\alpha'=1/2$, recovering the previous best result for single-capacity OGM. We implement trace-driven experiments to evaluate the practical performance of OCKA on a real-world dating dataset, demonstrating the superior performance of OCKA in online dating applications.

Online Capacitated General Matching with Knapsack

The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce **MMhops**, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, **Bridging** and **Comparison**, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose **MMhops-R1**, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.

MMhops-R1: Multimodal Multi-hop Reasoning

Vulnerability-Fixing Commit Identification(VFCI) is a critical task in software security maintenance that aims to automatically identify code commits that patch security vulnerabilities. However, existing approaches face challenges in handling low-quality commit messages and entangled commits, which limit their identification performance. To address these issues, we propose VFCionX, a novel VFCI framework that integrates large and small language models in a collaborative architecture. VFCionX consists of three core modules: Message Classifier, Patch Classifier, and Ensemble Classifier. The Message Classifier employs a multi-source contextual augmentation strategy to enhance the quality of commit messages and fine-tunes the Qwen2.5-1.5B model, significantly improving classification performance in the textual modality. The Patch Classifier combines heuristic rules with a Qwen2.5-Coder-7B-driven file selector to filter noise from entangled commits, and incorporates a line-level feature extractor based on CodeBERT and CNN to capture local pattern differences between added and deleted code lines. The Ensemble Classifier integrates predictions from both channels using the AdaBoost algorithm, enhancing model robustness and generalization. Experimental results on five popular C/C++ repositories comprising 24,630 commits show that VFCionX achieves an F1-score of 81.47%, outperforming the best baseline by 9.42%. Ablation studies validate the effectiveness of each component, while sensitivity analysis reveals optimal parameter settings for balancing performance and noise resilience. This work provides a new and effective solution for robust vulnerability patch identification.

VFCionX: Bridging Large and Small Models for Robust Vulnerability-Fixing Commit Identification

Time series forecasting under distribution shift remains challenging, as existing deep learning models often rely on local statistical normalization (e.g., mean and variance) that fails to capture global distribution shift. Methods like RevIN and its variants attempt to decouple distribution and pattern but still struggle with missing values, noisy observations, and invalid channel-wise affine transformation. To address these limitations, we propose **A**ffine **P**rototype-**T**imestamp(**APT**), a lightweight and flexible plug-in module that injects global distribution features into the normalization–forecasting pipeline. By leveraging timestamp-conditioned prototype learning,
APT dynamically generates affine parameters that modulate both input and output series, enabling the backbone to learn from self-supervised, distribution-aware clustered instances.
APT is compatible with arbitrary forecasting backbones and normalization strategies while introducing minimal computational overhead. Extensive experiments across six benchmark datasets and multiple backbone-normalization combinations demonstrate that APT significantly improves forecasting performance under distribution shift.

APT: Affine Prototype-Timestamp for Time Series Forecasting Under Distribution Shift

Personalized Federated Learning (PFL), which aims to customize models for each client while preserving data privacy, has become an important research topic in addressing the challenges of data heterogeneity. Existing studies usually enhance the localization of global parameters by injecting local information into the globally shared model. However, these methods focus excessively on the personalized characteristics of individual clients and fail to fully exploit distinctive information across clients, limiting the quality of local models to represent unseen samples well. To address this issue, we propose a novel personalized $\underline{\textbf{Fed}}$erated $\underline{\textbf{P}}$rivacy-preserving 
$\underline{\textbf{K}}$nowledge 
$\underline{\textbf{D}}$ynamic $\underline{\textbf{A}}$lignment ($\textbf{FedPKDA}$) framework, which ensures data privacy during both the collection of client-side key information and its incorporation into federated model training. Specifically, to ensure data privacy during the cross-client information collection phase, we first conduct feature clipping and add Laplacian noise to the local prototypes extracted from each client. Further, we compute the centroid of the uploaded local prototypes in a latent space and leverage Mahalanobis distance to guide the generation of global prototypes, thereby preserving the semantic contributions from participating clients. Moreover, to boost the personalization of the local model, we dynamically align representations learned by the shared model with both a set of local prototypes and privacy-preserving global prototypes, facilitating effective cross-client knowledge sharing under heterogeneous settings while preserving client-specific characteristics. Extensive experiments on benchmark datasets have verified the superiority of FedPKDA against its competitors.

FedPKDA: Personalized Federated Learning with Privacy-Preserving Knowledge Dynamic Alignment

Isolated cold-start node classification on multimodal graphs is challenging because such nodes have no edges and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to MLPs for isolated cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a "student" prediction based only on self-information (i.e., the node's own features), and a "teacher" prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer’s capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experimental results on public datasets show that NTSFormer achieves superior performance on multimodal isolated cold-start node classification tasks. Our code is provided.

NTSFormer: A Self-Teaching Graph Transformer for Multimodal Isolated Cold-Start Node Classification

Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle–speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.

BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

Deep learning-based 3D anomaly detection methods have demonstrated significant potential in industrial manufacturing. However, many approaches are specifically designed for anomaly detection tasks, which limits their generalizability. In contrast, self-supervised point cloud models aim for general-purpose representation learning, yet our investigation reveals that these classical models are suboptimal at anomaly detection under the unified fine-tuning paradigm. This motivates us to develop a more generalizable 3D model that can effectively detect anomalies without relying on task-specific designs. Interestingly, we find that using only the curvature of each point as its anomaly score already outperforms several classical self-supervised and dedicated anomaly detection models, highlighting the critical role of \textbf{curvature} in 3D anomaly detection. In this paper, we propose a Curvature-Augmented Self-supervised Learning (CASL) framework based on a reconstruction paradigm. Built upon the classical U-Net architecture, our approach introduces multi-scale curvature prompts to guide the decoder in predicting the spatial coordinates of each point. Without relying on any dedicated anomaly detection mechanisms, it achieves state-of-the-art performance through straightforward classification fine-tuning, improving the average O-AUROC by 5.6\% on the Real3D-AD dataset and 4.8\% on the Anomaly-ShapeNet dataset. Moreover, the learned representations generalize well to standard 3D understanding tasks such as point cloud classification and part segmentation.

CASL: Curvature-Augmented Self-supervised Learning for 3D Anomaly Detection

Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D²Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D²Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2\% while retaining 99.2\% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7\% performance at a 90\% token reduction rate, marking a significant advancement with up to 63. 53\% improvement over existing methods. The code will be released on GitHub.

Downloads

Next from AAAI 2026

AutoGameUI: Constructing High-Fidelity GameUI via Multimodal Correspondence Matching

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

AutoGameUI: Constructing High-Fidelity GameUI via Multimodal Correspondence Matching

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads