Singapore

Scene recognition (SR) is a fundamental task in computer vision (CV). In recent years, transformer-based methods have achieved remarkable success in scene recognition tasks. Most existing approaches primarily rely on visual features and fail to effectively model the structural relationships within scenes, which are crucial for accurate scene recognition. To this end, we propose TANSR, an innovative method that leverages topological relationships from graphs to guide scene recognition. Specifically, GAMGN generates topology-aware masks from graph representations constructed by GGM and integrates them with patch embeddings by TAG, enabling the transformer&#39;s attention mechanism to be aware of topological information. Furthermore, we introduce an innovative attention-driven multimodal fusion strategy that integrates graph-derived topological cues with visual patch embeddings, substantially enhancing the transformer’s capability to capture topological information and improving performance in complex scene recognition tasks. We evaluate the model on the benchmarks MIT-67, Scene-15 and SUN397, where it achieves consistent state-of-the-art (SOTA) performance, including 98.58% accuracy on MIT-67.

AAAI 2026

Topology-Aware Vision Transformers for Enhanced Scene Recognition

vision transformers

cv: scene analysis & understanding

cv: multi-modal vision

Scene recognition (SR) is a fundamental task in computer vision (CV). In recent years, transformer-based methods have achieved remarkable success in scene recognition tasks. Most existing approaches primarily rely on visual features and fail to effectively model the structural relationships within scenes, which are crucial for accurate scene recognition. To this end, we propose TANSR, an innovative method that leverages topological relationships from graphs to guide scene recognition. Specifically, GAMGN generates topology-aware masks from graph representations constructed by GGM and integrates them with patch embeddings by TAG, enabling the transformer's attention mechanism to be aware of topological information. Furthermore, we introduce an innovative attention-driven multimodal fusion strategy that integrates graph-derived topological cues with visual patch embeddings, substantially enhancing the transformer’s capability to capture topological information and improving performance in complex scene recognition tasks. We evaluate the model on the benchmarks MIT-67, Scene-15 and SUN397, where it achieves consistent state-of-the-art (SOTA) performance, including 98.58% accuracy on MIT-67.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Heterogeneous Graph Neural Networks (HGNNs) are widely used for deep learning on heterogeneous graphs. Typical end-to-end HGNNs require repetitive message passing during training, limiting efficiency for large-scale real-world graphs. Pre-computation-based HGNNs address this by performing message passing only once during preprocessing, collecting neighbor information into regular-shaped tensors, which enables efficient mini-batch training. Label-based pre-computation methods collect neighbors' label information but suffer from training label leakage, where a node's own label information propagates back to itself during multi-hop message passing---the echo effect. Existing mitigation strategies are memory-inefficient on large graphs or suffer from compatibility issues with advanced message passing methods. We propose Echoless Label-based Pre-computation (Echoless-LP), which eliminates training label leakage with Partition-Focused Echoless Propagation (PFEP). PFEP partitions target nodes and performs echoless propagation, where nodes in each partition collect label information only from neighbors in other partitions, avoiding echo while remaining memory-efficient and compatible with any message passing method. We also introduce an Asymmetric Partitioning Scheme (APS) and a PostAdjust mechanism to address information loss from partitioning and distributional shifts across partitions. Experiments on public datasets demonstrate that Echoless-LP achieves superior performance and maintains memory efficiency compared to baselines. Our code is provided.

Echoless Label-Based Pre-computation for Memory-Efficient Heterogeneous Graph Learning

Interactive segmentation aims to delineate a user-specified target in an image by leveraging positive and negative clicks. While effective on natural images, existing methods often fail in remote sensing scenarios, where satellite imagery is characterized by ultra-high resolution, sparse object distribution, and significant scale variation. These factors hinder accurate segmentation of fine-grained targets like roads, buildings, and aircraft. To overcome these problems, we propose CrossCut, a novel interactive segmentation framework tailored for remote sensing imagery. Unlike previous approaches that either process the entire image or treat each patch independently, CrossCut enables simultaneous segmentation across multiple patches by propagating user click information to all patches. This design allows the model to fully utilize click guidance regardless of object location, effectively resolving the challenge of inter-patch information isolation. Furthermore, CrossCut supports flexible inference by allowing segmentation results from different patch configurations to be fused, enhancing both accuracy and robustness. Extensive evaluations across multiple remote sensing datasets demonstrate that CrossCut achieves state-of-the-art performance. Quantitative results and visualizations show that CrossCut significantly advances the field of interactive segmentation for remote sensing imagery.

CrossCut: Cross-Patch Aware Interactive Segmentation for Remote Sensing Images

Molecular representation plays a central role in computational drug discovery. Pharmacophores, functional groups responsible for molecular bioactivity, have been widely studied in cheminformatics. However, their incorporation into molecular representation learning, particularly in a context reasoning or generalization, remains relatively limited. To address this gap, we propose PharmaQA, a pharmacophore oriented question answering framework that formulates tailored prompts to extract context-aware molecular semantics. Rather than encoding pharmacophore features, PharmaQA learns to answer pharmacophore related queries. This design enables flexible reasoning across diverse tasks, including molecular property prediction, compound-target interaction prediction, and binding affinity estimation. Experimental results on benchmark datasets demonstrate that PharmaQA achieves competitive performance. In a ligand discovery case study using FDA-approved compounds, the framework identified potential inhibitors for three therapeutic targets, with strong docking performance. As a generalizable and modular solution, PharmaQA incorporates pharmacophoric knowledge into molecular embeddings, enhancing both predictive accuracy and interpretability in drug discovery applications.

PharmaQA: Prompt-Based Molecular Representation Learning via Pharmacophore-Oriented Question Answering

We present $\textit{Infinite-Story}$, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: $\textit{Identity Prompt Replacement}$, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising $\textit{Adaptive Style Injection}$ and $\textit{Synchronized Guidance Adaptation}$, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6$\times$ faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

Infinite-Story: A Training-Free Consistent Text-to-Image Generation

Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to train deep neural networks with formal privacy guarantees. However, the addition of differential privacy (DP) and per-sample gradient clipping often degrades model accuracy by introducing both noise and bias. Existing techniques typically address only one of these issues, as reducing DP noise can exacerbate clipping bias and vice versa. In this paper, we propose a novel method, DP-PMLF, which integrates per-sample momentum with a low-pass filtering strategy to simultaneously mitigate DP noise and clipping bias. Our approach uses per-sample momentum to smooth gradient estimates prior to clipping, thereby reducing sampling variance, and employs a post-processing low-pass filter to attenuate high-frequency DP noise without consuming additional privacy budget. We provide a theoretical analysis demonstrating an improved convergence rate under rigorous DP guarantees, and our empirical evaluations reveal that DP-PMLF significantly enhances the privacy-utility trade-off compared to several state-of-the-art DPSGD variants.

Enhancing DPSGD via Per-Sample Momentum and Low-Pass Filtering

In standard fair division models, we assume that all agents are selfish. However, in many scenarios, division of resources has a direct impact on the whole group or even society. Therefore, we study fair allocations of indivisible items that, at the same time, maximize social impact. In this model, each agent is associated with two additive functions that define their value and social impact for each item. The goal is to allocate items so that the social impact is maximized while maintaining some fairness criterion. We reveal that the complexity of the problem heavily depends on whether the agents are socially aware, i.e., they take into consideration the social impact functions.
For socially unaware agents, we prove that the problem is NP-hard for a variety of fairness notions, and that it is tractable only for very restricted cases, e.g., if for every agent valuation equals social impact and it is binary. On the other hand, social awareness allows for fair allocations that maximize social impact, and such allocations can be computed in polynomial time. Interestingly, the problem becomes again intractable as soon as the definition of social awareness is relaxed.

Dividing Indivisible Items for the Benefit of All: It Is Hard to Be Fair Without Social Awareness

With the rapid deployment of Chinese large language models (LLMs), culturally-grounded bias evaluation remains understudied due to the dominance of English benchmarks and simplistic Chinese scenarios. To address this, we propose GeWu, a comprehensive benchmark featuring a culturally-aware dataset of 60,192 questions spanning 14 social groups with fine-grained Chinese contexts—significantly exceeding existing resources in breadth and depth. Our two-stage evaluation first quantifies bias via multiple-choice questions using a novel probability-based scoring mechanism to sensitively capture bias tendencies, distilling high-bias scenarios into GeWu-1K. This refined subset then enables multi-turn dialogue evaluations for in-depth analysis under realistic conditions. Experiments reveal that GeWu effectively exposes social biases in state-of-the-art Chinese LLMs, with 13.93% of scenarios eliciting universal bias across all models. This highlights persistent challenges and provides actionable insights for bias mitigation in Chinese contexts.

GeWu: A Culturally-Grounded Chinese Benchmark for Multi-Stage Social Bias Evaluation in Large Language Models

In recent years, RF fingerprinting (RFF) has emerged as a promising technology for wireless device authentication. 
However, temporal variations in device load and temperature, along with channel effects, lead to inconsistencies in RFF distributions between training and testing phases. 
As a result, deep learning (DL)-based recognition models often suffer from degraded performance. 
To address this problem, we propose the first test-time-adaptation (TTA) approach to improve the domain generalization ability of RFF recognition models. 
We first analyze the causes of time-varying RFF distribution shifts, such as carrier frequency offset (CFO), and develop a physical impairment-based data augmentation strategy.
Based on this, we further propose a physically information-aware prototype to guide the model for TTA.
Our method requires no model retraining or labeled test samples, and is a lightweight, nonparametric solution.
Finally, our approach is extensively evaluated using mobile phones with the IEEE 802.11 orthogonal frequency division multiplexing (OFDM) system, which demonstrates that our scheme can effectively
improve RFF average recognition performance by about 7.8%.

RFF-TTA: Physical Information-Aware Prototype for Temporally Varying RF Fingerprinting Online Test-Time-Adaptation

Structured Electronic Health Record (EHR) data stores patient information in relational tables and plays a central role in clinical decision-making. 
Recent advances have explored the use of large language models (LLMs) to process such data, showing promise across various clinical tasks.
However, the absence of standardized evaluation frameworks and clearly defined tasks makes it difficult to systematically assess and compare LLM performance on structured EHR data.
To address these evaluation challenges, we introduce EHRStruct, a benchmark specifically designed to evaluate LLMs on structured EHR tasks.
EHRStruct defines 11 representative tasks spanning diverse clinical needs and includes 2,200 task-specific evaluation samples derived from two widely used EHR datasets.
We use EHRStruct to evaluate 20 advanced and representative LLMs, covering both general and medical models.
We further analyze key factors influencing model performance, including input formats, few-shot generalisation, and finetuning strategies, and compare results with 11 state-of-the-art LLM-based enhancement
methods for structured data reasoning. 
Our results indicate that many structured EHR tasks place high demands on the understanding and reasoning capabilities of LLMs.
In response, we propose SEMaster, a code-augmented method that achieves state-of-the-art performance and offers practical insights to guide future research.

EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks

The Open Digital Rights Language (ODRL) is a pivotal standard for automating data rights management. However, the inherent logical complexity of authorization policies, combined with the scarcity of high-quality ``Natural Language-to-ODRL" training datasets, impedes the ability of current methods to efficiently and accurately translate complex rules from natural language into the ODRL format. To address this challenge, this research leverages the potent comprehension and generation capabilities of Large Language Models (LLMs) to achieve both automation and high fidelity in this translation process. We introduce AgentODRL, a multi-agent system based on an Orchestrator-Workers architecture. The architecture consists of specialized Workers, including a Generator for ODRL policy creation, a Decomposer for breaking down complex use cases, and a Rewriter for simplifying nested logical relationships. The Orchestrator agent dynamically coordinates these Workers, assembling an optimal pathway based on the complexity of the input use case. Specifically, we enhance the ODRL Generator by incorporating a validator-based syntax strategy and a semantic reflection mechanism powered by a LoRA-finetuned model, significantly elevating the quality of the generated policies. Extensive experiments were conducted on a newly constructed dataset comprising 770 use cases of varying complexity, all situated within the context of data spaces. The results, evaluated using ODRL syntax and semantic scores, demonstrate that our proposed Orchestrator-Workers system, enhanced with these strategies, achieves superior performance on the ODRL generation task.

Downloads

Next from AAAI 2026

Echoless Label-Based Pre-computation for Memory-Efficient Heterogeneous Graph Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES