Singapore

Vision Language Models (VLMs) have demonstrated strong performance in multimodal understanding, offering promise for the circuit-to-netlist translation task. However, the diverse component symbols and complex connections in circuit images challenge VLMs in understanding physical layouts and reasoning for electrical connection logic. To address these, we propose Circuit-Think, the first multimodal reasoning framework for the automated circuit-to-netlist translation task, which employs the Trajectory-Guided Reinforcement Learning (TGRL) learning paradigm for structured logical reasoning on circuit images. Circuit-Think initializes reasoning capabilities through supervised fine-tuning (SFT) on image-netlist pairs, then optimizes reasoning trajectories and netlist generation decisions using TGRL. Firstly, TGRL introduces a step-by-step reasoning paradigm, which guides the model with stepwise reward functions to simulate the human cognitive trajectory of &quot;identifying ports, recognizing devices, and inferring connections&#39;&#39;. Secondly, we customize a multi-level reward that maps reasoning and answers into graph structures and node sets, jointly optimizing logical consistency and netlist accuracy via graph similarity and set matching. Thirdly, TGRL contains a reflective learning mechanism for low-scoring samples, which corrects the reasoning trajectory through reference answers as hints, avoiding local optima caused by sparse reward signals or erroneous reasoning paths. Moreover, we construct a circuit image-netlist reasoning dataset with 3,100 samples, offering step-by-step annotations for converting circuit images to netlists. Extensive experiments demonstrate that Circuit-Think achieves SOTA netlist accuracy and significantly improves the accuracy of downstream tasks. Our circuit image-netlist reasoning dataset is open-source.

AAAI 2026

Circuit-Think: A Multimodal Reasoning Framework for Automated Circuit-to-Netlist Translation with Trajectory-Guided Reinforcement Learning

image reasoning

graph

multimodal

Vision Language Models (VLMs) have demonstrated strong performance in multimodal understanding, offering promise for the circuit-to-netlist translation task. However, the diverse component symbols and complex connections in circuit images challenge VLMs in understanding physical layouts and reasoning for electrical connection logic. To address these, we propose Circuit-Think, the first multimodal reasoning framework for the automated circuit-to-netlist translation task, which employs the Trajectory-Guided Reinforcement Learning (TGRL) learning paradigm for structured logical reasoning on circuit images. Circuit-Think initializes reasoning capabilities through supervised fine-tuning (SFT) on image-netlist pairs, then optimizes reasoning trajectories and netlist generation decisions using TGRL. Firstly, TGRL introduces a step-by-step reasoning paradigm, which guides the model with stepwise reward functions to simulate the human cognitive trajectory of "identifying ports, recognizing devices, and inferring connections''. Secondly, we customize a multi-level reward that maps reasoning and answers into graph structures and node sets, jointly optimizing logical consistency and netlist accuracy via graph similarity and set matching. Thirdly, TGRL contains a reflective learning mechanism for low-scoring samples, which corrects the reasoning trajectory through reference answers as hints, avoiding local optima caused by sparse reward signals or erroneous reasoning paths. Moreover, we construct a circuit image-netlist reasoning dataset with 3,100 samples, offering step-by-step annotations for converting circuit images to netlists. Extensive experiments demonstrate that Circuit-Think achieves SOTA netlist accuracy and significantly improves the accuracy of downstream tasks. Our circuit image-netlist reasoning dataset is open-source.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The widespread adoption of graph neural networks (GNNs) has brought increased attention to fairness issues related to sensitive attributes, such as gender and race, in practical scenarios. However, this concern remains largely unexplored in the context of graph clustering. Conventional fair graph clustering methods primarily depend on spectral clustering approaches. Meanwhile, we argue that existing graph learning works mainly focus on a single type of fairness, whereas graph clustering should achieve group equality-informed individual fairness. In this paper, we introduce for the first time a fairness-aware framework termed FairGC for deep graph clustering, which integrates the dual objectives of individual and group fairness while maintaining accurate clustering results. Specifically, we construct two views with distinct semantics using Siamese encoders. Then, we apply multi-step random walks on view-specific affinity graphs to capture high-order affinities of node pairs, thereby reformulating the contrastive learning with a focus on individual similarity. Besides, we utilize adversarial learning by making node representations independent of the estimated sensitive attributes to further eliminate group biases of clustering results. Extensive experiments on four benchmarks demonstrate the effectiveness and superiority of our proposed framework FairGC.

FairGC: Fostering Individual and Group Fairness for Deep Graph Clustering

The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise \textit{ spatial alignment}, which accurately locates the coordinates of each element, and, more critically, a correct \textit{ semantic alignment}, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving \textit{spatial alignment} for these MLLMs, we find that inefficient exploration bottlenecks \textit{semantic alignment}, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency $\eta=U/C$. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 8.3\% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding.

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

Graph Neural Networks (GNNs) have achieved remarkable success in analyzing graph-structured data, with their performance dependent on the graph structure. However, the graph structure at test time often differs from that used during training. This phenomenon, known as graph structure shifts, leads to significant performance degradation in GNNs. Existing methods tackle this problem by improving the robustness of GNNs, but they often overlook representation deviation caused by structure shifts. To address this limitation, we propose an attribute-guided dynamic prompt learning model that generates prompt vectors to approximate the intrinsic information of nodes. With these prompt vectors, the trained GNNs are expected to maintain their performance under graph structure shifts. Unlike previous prompt-based methods that learn unified prompt vectors for all nodes, we obtain node-level prompts by encoding node attributes that provide the unique information. Given the diversity of shifted graph structures, we introduce a structure-aware adaptation mechanism that adjusts the prompt vectors based on the input graph. Furthermore, we apply gradient-based attacks to generate perturbed graphs, encouraging the model to generalize to unseen structures. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and robustness of our model.

Attribute-guided Dynamic Prompt Learning for Graph Neural Networks

Quantitative trading using mathematical models and automated execution to generate trading decisions has been widely applied acorss financial markets. Recently, reinforcement learning (RL) has emerged as a promising approach for developing profitable trading strategies, especially in highly volatile markets like cryptocurrency. However, existing RL methods for cryptocurrency trading face two critical drawbacks: 1) Prior RL algorithms segment markets using handcrafted indicators (e.g., trend or volatility) to train specialized sub-policies. However, these coarse labels oversimplify market dynamics into rigid categories, biasing policies toward obvious patterns like trend-following and neglecting nuanced but lucrative opportunities. 2) Current RL methods fail to systematically use demonstration data. While some approaches ignore demonstrations altogether, others rely on “optimal” yet overly granular trajectories or human-crafted strategies, both of which can overwhelm learning and introduce significant bias, resulting in high variance and significant profit losses. To address these problems, we propose ArchetypeTrader, a novel reinforcement learning framework that automatically selects and refines data-driven trading archetypes distilled from demonstrations. The framework operates in three phases: 1) We use dynamic programming (DP) to generate representative expert trajectories and train a vector-quantized encoder-decoder architecture to distill these demonstrations into discrete, reusable strategic archetypes through self-supervised learning, capturing nuanced market-behavior patterns without human heuristics. 2) We then train an RL agent to select contextually appropriate archetypes from the learned codebook and reconstruct action sequences for the upcoming horizons, effectively performing demonstration-guided strategy reuse. 3) We finally train a policy adapter that leverages hindsight-informed rewards to dynamically refine the archetype actions based on real-time market observations and performance, enabling more fine-grained decision-making and yielding profitable and robust trading strategies. Extensive experiments on four popular cryptocurrency trading pairs demonstrate that ArchetypeTrader significantly outperforms state-of-the-art approaches in both profit generation and risk management.

ArchetypeTrader: Reinforcement Learning for Selecting and Refining Learnable Strategic Archetypes in Quantitative Trading

SHapley Additive exPlanations (SHAP) is a key tool for interpreting decision tree ensembles by assigning contribution values to features. It is widely used in finance, advertising, medicine, and other domains. Two main approaches to SHAP calculation exist: Path-Dependent SHAP, which leverages the tree structure for efficiency, and Background SHAP, which uses a background dataset to estimate feature distributions.

We introduce Woodelf, a SHAP algorithm that integrates decision trees, game theory, and Boolean logic into a unified framework. 
For each consumer, Woodelf constructs a pseudo-Boolean formula that captures their feature values, the structure of the decision tree ensemble, and the entire background dataset. It
then leverages this representation to compute Background SHAP in linear time. Woodelf can also compute Path-Dependent SHAP, Shapley interaction values, Banzhaf values, and Banzhaf interaction values.

Woodelf is designed to run efficiently on CPU and GPU hardware alike. Available via the Python package XXXX, it is implemented using NumPy, SciPy, and CuPy without relying on custom C++ or CUDA code. This design enables fast performance and seamless integration into existing frameworks, supporting large-scale computation of SHAP and other game-theoretic values in practice.

For example, on a dataset with 3,000,000 rows, 5,000,000 background samples, and 127 features, Woodelf computed all Background Shapley values in 162 seconds on CPU and 16 seconds on GPU—compared to 44 minutes required by the best method on any hardware platform, representing 16x and 165x speedups, respectively.

From Decision Trees to Boolean Logic: A Fast and Unified SHAP Algorithm

As AI models tackle increasingly complex problems, ensuring reliable human oversight becomes more challenging due to the difficulty of verifying solutions. Approaches to scaling AI supervision include debate, in which two agents engage in structured dialogue to help a judge evaluate claims; critique, in which models identify potential flaws in proposed solutions; and prover-verifier games, in which a capable 'prover' model generates solutions that must be verifiable by a less capable 'verifier'. Evaluations of the scalability of these and similar approaches to difficult problems benefit from datasets that include (1) long-form expert-verified correct solutions and (2) long-form flawed solutions with annotations highlighting specific errors, but few are available. To address this gap, we present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. Each dataset contains questions and long-form solutions with expert annotations validating their correctness or identifying specific error(s) in the reasoning. We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments: models performing more poorly on particular datasets can serve as judges/verifiers for more capable models.

FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods.

Multigranular Evaluation for Brain Visual Decoding

Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks.

See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection

With rapid urbanization in the modern era, traffic signals from various sensors have been playing a significant role in monitoring the states of cities, which provides a strong foundation in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic time series modeling often rely on the original data modality, i.e., numerical direct readings from the sensors in cities. However, this unimodal approach overlooks the semantic information existing in multimodal heterogeneous urban data in different perspectives, which hinders a comprehensive understanding of traffic signals and limits the accurate prediction of complex traffic dynamics. To address this problem, we propose a novel Multimodal framework, \textit{MTP}, for urban Traffic Profiling, which learns multimodal features through numeric, visual, and textual perspectives in the frequency domain. The three branches drive a multimodal perspective of traffic signal learning for augmentation, while the frequency learning strategies delicately refine the information for extraction. Specifically, we first conduct the visual augmentation for the traffic time series, which transforms the original modality into periodicity images and frequency images for visual learning. Also, we augment descriptive texts for the traffic time series based on the specific topic, background information and item description for textual learning. To complement the numeric information, we utilize frequency multilayer perceptrons for learning on the original modality. We design a hierarchical contrastive learning on the three branches to fuse the three modalities. Finally, extensive experiments on six real-world datasets demonstrate superior performance compared with the state-of-the-art approaches.

MTP: Exploring Multimodal Urban Traffic Profiling with Modality Augmentation and Spectrum Fusion

Learning on multi-view data is a fundamental task, which integrates the information from different views to improve the final performance. It is also a basic task for learning on the long-tailed data in real applications, followed by the downstream tasks, i.e., classification. The existing works for trusted classification on multi-view data or long-tailed data usually aim to improve the final performance and dynamically consider the confidence of prediction for the data which is crucial in cost-sensitive domains. However, these methods pay few attentions to the pairwise trusted problem which considers the trusted pairs instead of trusted annotated data points. Besides, the problem of classification on long-tailed multi-view data has never been studied so far. In this work, we focus on the pairwise trusted problem on long-tailed multi-view classification and give a general framework, which considers the trusted pairs instead of trusted annotated data points. We then construct a specific example under the general framework and introduce a novel Enhanced Normal-Inverse Gamma distribution (ENIG). ENIG is a joint probabilistic distribution built on Dirichlet distribution and NIG. A novel combination rule based on ENIG for long-tailed multi-view data is also given, which adaptively integrates the long-tailed data from different views to achieve a consensus one at the level of evidence and effectively produces a trusted long-tailed multi-view classification result. Our method is robust and able to be dynamically aware of the uncertainty for the long-tailed data from each view. The accurate uncertainty can be induced by the proposed learning framework, leading to both robustness and reliability for classification on long-tailed multi-view data. Experimental results on different long-tailed multi-view datasets demonstrate the effectiveness of our method in terms of accuracy, robustness and reliability.

Downloads

Next from AAAI 2026

FairGC: Fostering Individual and Group Fairness for Deep Graph Clustering

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES