Singapore

The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the \textit{thinking with images} paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm also introduces significant challenges as diverse errors may occur during reasoning processes. This naturally necessitates Process Reward Models (PRMs) as an essential pattern for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment of PRMs&#39; capabilities under this paradigm,~\ourbench. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under \textit{thinking with images} paradigm. Our main contributions are as follows: (1) Through extensive analysis of reasoning trajectories under \textit{thinking with images} paradigm and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct and curate a comprehensive benchmark comprising 1,134 manually annotated high-quality thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our comprehensive experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, consistent positive evaluation bias, and notable sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.

AAAI 2026

What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the \textit{thinking with images} paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm also introduces significant challenges as diverse errors may occur during reasoning processes. This naturally necessitates Process Reward Models (PRMs) as an essential pattern for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment of PRMs' capabilities under this paradigm,~\ourbench. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under \textit{thinking with images} paradigm. Our main contributions are as follows: (1) Through extensive analysis of reasoning trajectories under \textit{thinking with images} paradigm and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct and curate a comprehensive benchmark comprising 1,134 manually annotated high-quality thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our comprehensive experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, consistent positive evaluation bias, and notable sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model. To effectively activate the copied parameters and avoid disturbance from unstable control conditions, we adopt a dedicated initialization scheme for Laytrol. In this scheme, the layout encoder is initialized as a pure text encoder to ensure that its output tokens remain within the data domain of MM-DiT. Meanwhile, the outputs of the layout control network are initialized to zero. In addition, we apply Object-level Rotary Position Embedding to the layout tokens to provide coarse positional information. Qualitative and quantitative experiments demonstrate the effectiveness of our method.

Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers

Federated Learning (FL) faces significant challenges arising from both data and system heterogeneity. While Clustered Federated Learning (CFL) mitigates data heterogeneity by grouping clients with similar data distributions, it remains vulnerable to system heterogeneity, which can slow convergence due to performance disparities among clients. Moreover, data drift may degrade clustering accuracy and training efficiency over time. In this work, we propose a Model Structure-aware Clustered Federated Learning (MSCFL) framework that simultaneously addresses the issues of data heterogeneity, system heterogeneity, and data drift. MSCFL incorporates model pruning (MP) into the CFL framework to enhance training efficiency under system heterogeneity. To enable this integration, we address the key challenge of performing effective clustering based on heterogeneous, pruned local models with varying structures. To this end, we design a model structure-based similarity computation algorithm to integrate CFL with MP. To effectively address data drift, we propose a dynamic cluster migration strategy that efficiently monitors model structures via Hamming Distance and triggers re-clustering only when necessary. Extensive experimental results show that MSCFL improves the accuracy and convergence speed of cluster models, outperforming traditional CFL in various settings. Additional results and our codes are available in supplementary materials.

MSCFL: Model Structure-Aware Clustered Federated Learning for System Heterogeneity and Data Drift

High‑quality Question–Answer (QA) datasets are foundational for reliable Large Language Model (LLM) evaluation, yet even expert‑crafted datasets exhibit persistent gaps in domain coverage, misaligned difficulty distributions, and factual inconsistencies. The recent surge in generative model-powered datasets has compounded these quality challenges. In this work, we introduce RefineLab, the first LLM‑driven framework that automatically refines raw QA textual data into high-quality datasets under a controllable token‑budget constraint. RefineLab takes a set of target quality attributes as refinement objectives and performs selective edits within a predefined token budget to ensure practicality and efficiency. In essence, RefineLab addresses a constrained optimization problem: improving the quality of QA samples as much as possible while respecting resource limitations. With a set of available refinement operations, RefineLab takes as input the original dataset, a specified set of target quality dimensions, and a token budget, and determines which refinement operations should be applied to each QA sample. This process is guided by an assignment module that selects optimal refinement strategies to maximize overall dataset quality while adhering to the budget constraint. Experiments demonstrate that RefineLab consistently narrows divergence from expert datasets across coverage, difficulty alignment, factual fidelity, and distractor quality. RefineLab pioneers a scalable, customizable path to reproducible dataset design, with broad implications for LLM evaluation.

Better Datasets Start from RefineLab: Automatic Optimization for High-Quality Dataset Refinement

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks (e.g., math, physics, chemistry, and biology) demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.

Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

Malware detection in real-world settings must deal with evolving threats, limited labeling budgets, and uncertain predictions. Traditional classifiers, without additional mechanisms, struggle to maintain performance under concept drift in malware domains, as their supervised learning formulation cannot optimize when to defer decisions to manual labeling and adaptation. Modern malware detection pipelines combine classifiers with monthly active learning (AL) and rejection mechanisms to mitigate the impact of concept drift. In this work, we develop a novel formulation of malware detection as a one-step Markov Decision Process and train a deep reinforcement learning (DRL) agent, simultaneously optimizing sample classification performance and rejecting high-risk samples for manual labeling. We evaluated the joint detection and drift mitigation policy learned by the DRL-based Malware Detection (DRMD) agent through time-aware evaluations on Android malware datasets subject to realistic drift requiring multi-year performance stability. The policies learned under these conditions achieve a higher Area Under Time (AUT) performance compared to standard classification approaches used in the domain, showing improved resilience to concept drift. Specifically, the DRMD agent achieved a $5.18\pm5.44$, $14.49\pm12.86$, and $10.06\pm10.81$ average AUT performance improvement for the classification only, classification with rejection, and classification with rejection and AL settings, respectively. Our results demonstrate for the first time that DRL can facilitate effective malware detection and improved resiliency to concept drift in the dynamic environment of the Android malware domain.

DRMD: Deep Reinforcement Learning for Malware Detection Under Concept Drift

Heterogeneous agent reinforcement learning (HARL) enables intelligent agents to execute complex cooperative tasks by adopting agent-specific policies. Existing HARL methods often use distinct networks per agent to ensure monotonic improvement, leading to substantial computational overhead and limited scalability in resource-constrained environments. To overcome this limitation, we propose SDE-HARL to scale HARL to a large number of agents while maintaining effective inter-agent coordination. Specifically, SDE-HARL decomposes the policy network of each agent into a lightweight local network and a global network. As such, our proposed method enables efficient local computing while allowing agent-specific properties. Moreover, to achieve efficient adaptation, agents with similar roles are grouped via a role-aware mechanism and share partial parameters in their global networks, while an identity-aware mechanism is introduced to promote behavioral diversity among agents within the same group. In certain scenarios across Google Research Football and StarCraft II, SDE-HARL reaches about 90% win rate while halving inference time compared to standard architectures.

SDE-HARL: Scalable Distributed Policy Execution for Heterogeneous-Agent Reinforcement Learning

Artificial Intelligence systems are increasingly deployed in settings where ensuring robustness, fairness, or domain-specific properties is essential for regulation compliance and alignment with human values. However, especially on Neural Networks, property enforcement is very challenging, and existing methods are limited to specific constraints or local properties (defined around datapoints), or fail to provide full guarantees. We tackle these limitations by extending SMiLE, a recently proposed enforcement framework for NNs, to support global relational properties (defined over the entire input space). The proposed approach scales well with model complexity, accommodates general properties and backbones, and provides full satisfaction guarantees. We evaluate SMiLE on monotonicity, global robustness, and individual fairness, on synthetic and real data, for regression and classification tasks. Our approach is competitive with property-specific baselines in terms of accuracy and runtime, and strictly superior in terms of generality and level of guarantees. Overall, our results emphasize the potential of the SMiLE framework as a platform for future research and applications.

SMiLE: Provably Enforcing Global Relational Properties in Neural Networks

Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs' information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

The transfer of knowledge from large-scale pre-trained models to diverse downstream tasks has achieved remarkable success. Beyond the traditional full fine-tuning paradigm, Parameter-Efficient Fine-Tuning (PEFT) has emerged as a more efficient model adaptation approach. However, applying existing PEFT methods to adapt dense vision models, particularly in multi-task settings, remains inadequately explored due to their low efficiency, limited task scalability, and neglect of cross-task fine-tuning interactions. To address these challenges, we propose the Task Dynamic-Synergistic Skill Adaptation, termed TDSS, an efficient and scalable multi-task model adaptation framework for dense visual predictions. TDSS comprises two key components: Task-Dynamic Skill Adapters (TDSA) and Task-Synergistic Adaptation Interaction (TSAI). Specifically, TDSA are inserted in parallel into pre-trained vision models to extract task-specific adapted features through the construction of skill representation experts and task dynamic gating. TSAI is developed to enhance cross-task adaptation interaction by bridging global generic and task-specific adapted features. Extensive experiments on multi-task dense visual predictions demonstrate that TDSS surpasses existing state-of-the-art parameter-efficient fine-tuning methods, while exhibiting remarkable efficiency and scalability in parameters and computational complexity.

TDSS: Task Dynamic-Synergistic Skill Adaptation for Boosting Efficient and Scalable Multi-Task Learning in Dense Visual Prediction

Recent advances in long-context language models (LCLMs), designed to handle extremely long contexts, primarily focus on utilizing external contextual information, often leaving the influence of language models' parametric knowledge underexplored. In this work, we first investigate how this parametric knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model’s ability to utilize parametric knowledge, which we call parametric recall ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval ability can interfere with the model’s parametric recall ability, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior potential to combine various abilities. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance, highlighting the importance of evaluating models from a dual-ability perspective.

Content not yet available

Next from AAAI 2026

Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES