Singapore

Small language models (SLMs) run quickly, consume little memory, and can be deployed on edge devices, making them especially appealing when compute or energy is limited. Because of these advantages, boosting SLMs&#39; reasoning ability has become an important research goal. A common approach is to distill the long chains of thought (long-CoTs) produced by large reasoning models (LRMs) into SLMs, hoping to transfer the larger models’ strong reasoning ability. However, SLMs do not always benefit from distillation of long-CoTs. The lengthy and complex semantic steps and large amount of self-reflection contents in long-CoTs may exceed the limited learning capabilities of SLMs, and the impact of self-reflection density on the performance of SLMs is unclear. To resolve this capacity mismatch, we propose \textbf{MACoT}, a multi-agent framework that \textit{synthesizes} chains of thought (CoTs) that are more suitable for small models rather than compressing or pruning existing ones. Through the interactive collaboration among six types of agents, \textbf{MACoT} synthesizes semantically explicit, logically clear CoTs that efficiently activate a small model’s internal knowledge through a carefully designed output pattern. At the same time, the CoTs synthesized by our method can retain a small amount of self-reflection content, thereby matching the learning capability of the small model and maximizing its reasoning accuracy. We fine-tuned Qwen2.5-7B-Instruct using only 1879 synthetic CoTs, significantly improving its performance on mathematical reasoning tasks and generalizing well, outperforming models trained on 5x more data. Through experiments, we found that a modest level of self-reflection boosts small-model performance, whereas excessive reflection sharply degrades it, which shows that “teaching SLMs to think” hinges on aligning each CoT’s cognitive load with the model’s capacity.

AAAI 2026

MACoT: Synthesizing Chains of Thought for Small Models via Multi-Agent Collaboration

nlp: prompt engineering / prompting

mas: coordination and collaboration

small models

Small language models (SLMs) run quickly, consume little memory, and can be deployed on edge devices, making them especially appealing when compute or energy is limited. Because of these advantages, boosting SLMs' reasoning ability has become an important research goal. A common approach is to distill the long chains of thought (long-CoTs) produced by large reasoning models (LRMs) into SLMs, hoping to transfer the larger models’ strong reasoning ability. However, SLMs do not always benefit from distillation of long-CoTs. The lengthy and complex semantic steps and large amount of self-reflection contents in long-CoTs may exceed the limited learning capabilities of SLMs, and the impact of self-reflection density on the performance of SLMs is unclear. To resolve this capacity mismatch, we propose \textbf{MACoT}, a multi-agent framework that \textit{synthesizes} chains of thought (CoTs) that are more suitable for small models rather than compressing or pruning existing ones. Through the interactive collaboration among six types of agents, \textbf{MACoT} synthesizes semantically explicit, logically clear CoTs that efficiently activate a small model’s internal knowledge through a carefully designed output pattern. At the same time, the CoTs synthesized by our method can retain a small amount of self-reflection content, thereby matching the learning capability of the small model and maximizing its reasoning accuracy. We fine-tuned Qwen2.5-7B-Instruct using only 1879 synthetic CoTs, significantly improving its performance on mathematical reasoning tasks and generalizing well, outperforming models trained on 5x more data. Through experiments, we found that a modest level of self-reflection boosts small-model performance, whereas excessive reflection sharply degrades it, which shows that “teaching SLMs to think” hinges on aligning each CoT’s cognitive load with the model’s capacity.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Visual language models (VLMs) have made significant progress in image captioning tasks, yet recent studies have found they are vulnerable to backdoor attacks. Attackers can inject undetectable perturbations into the data during inference, triggering abnormal behavior and generating malicious captions. These attacks are particularly challenging to detect and defend against due to the stealthiness and cross-modal propagation of the trigger signals. In this paper, we identify two key vulnerabilities by analyzing existing attack patterns: (1) the model exhibits abnormal attention concentration on certain regions of the input image, and (2) backdoor attacks often induce semantic drift and sentence incoherence. Based on these insights, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without requiring any prior knowledge of trigger patterns. SRD learns to apply discrete perturbations to sensitive contextual regions of image inputs via a deep Q-network policy, aiming to confuse attention and disrupt the activation of malicious paths. To guide policy optimization, we design a reward signal named semantic fidelity score, which jointly assesses the semantic consistency and linguistic fluency of the generated captions, encouraging the agent to achieve a robust yet faithful output. SRD offers a trigger-agnostic, policy-interpretable defense paradigm that effectively mitigates local (TrojVLM) and global (Shadowcast) backdoor attacks, reducing ASR to 3.4% and 5.6% respectively, with less than 15% average CIDEr drop on the clean inputs.

SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Autonomous agents powered by large language models (LLMs) show promising potential in assistive tasks across various domains, including mobile device control. As these agents interact directly with personal information and device settings, ensuring their safe and reliable behavior is crucial to prevent undesirable outcomes. However, no benchmark exists for standardized evaluation of the safety of mobile device-control agents. In this work, we introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents within a realistic mobile environment based on Android emulators. We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications, challenging agents with managing risks encompassing the misuse and negative side effects. These tasks include tests to evaluate the safety of agents in daily scenarios as well as their robustness against indirect prompt injection attacks. Our experiments demonstrate that baseline agents, based on state-of-the-art LLMs, often fail to effectively prevent harm while performing the tasks. To mitigate these safety concerns, we propose a prompting method that encourages agents to prioritize safety considerations. While this method shows promise in promoting safer behaviors, there is still considerable room for improvement to fully earn user trust. This highlights the urgent need for continued research to develop more robust safety mechanisms in mobile environments. WARNING: This paper contains contents that are unethical or offensive in nature.

MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

Mobile Graphical User Interface (GUI) agents aim to autonomously complete tasks within or across apps based on user instructions. While recent Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens and perform actions, existing agents remain fundamentally reactive. They reason over the current UI screen but lack a structured representation of the app navigation flow, lim-
iting GUI agents’ ability to understand execution context, detect unexpected execution results, and recover from errors. We introduce Agent-SAMA, a state-aware multi-agent framework that models app execution as a Finite State Machine (FSM), treating UI screens as states and user actions as transitions. Agent-SAMA implements four specialized agents that collaboratively construct and use FSMs in real time to guide task planning, execution verification, and recovery. We evaluate Agent-SAMA on two types of benchmarks: cross-
app (Mobile-Eval-E, SPA-Bench) and mostly single-app (AndroidWorld). On Mobile-Eval-E, Agent-SAMA achieves an 84.0% success rate and a 71.9% recovery rate. On SPA-Bench, it reaches an 80.0% success rate with a 66.7% recovery rate. Compared to prior methods, Agent-SAMA improves task success by up to 12% and recovery success by 13.8%. On AndroidWorld, Agent-SAMA achieves a 63.7%
success rate, outperforming the baselines. Our results demonstrate that structured state modeling enhances robustness and can serve as a lightweight, model-agnostic memory layer for future GUI agents.

Agent-SAMA: State-Aware Mobile Assistant

Joint-Embedding Predictive Architectures (JEPAs), a powerful class of self-supervised models, exhibit an unexplained ability to cluster time-series data by their underlying dynamical regimes. We propose a novel theoretical explanation for this phenomenon, hypothesizing that JEPA's predictive objective implicitly drives it to learn the invariant subspace of the system's Koopman operator. We prove that an idealized JEPA loss is minimized when the encoder represents the system's regime indicator functions, which are Koopman eigenfunctions. This theory was validated on synthetic data with known dynamics, demonstrating that constraining the JEPA's linear predictor to be a near-identity operator is the key inductive bias that forces the encoder to learn these invariants. We further discuss that this constraint is critical for selecting this interpretable solution from a class of mathematically equivalent but entangled optima, revealing the predictor's role in representation disentanglement. This work demystifies a key behavior of JEPAs, provides a principled connection between modern self-supervised learning and dynamical systems theory, and informs the design of more robust and interpretable time-series models.

Koopman Invariants as Drivers of Emergent Time-Series Clustering in Joint-Embedding Predictive Architectures

Automated evaluation of generative text-to-image models remains a challenging problem. Recent works have proposed using multimodal LLMs to judge the quality of images, but these works offer little insight into how multimodal LLMs make use of concepts relevant to humans, such as image style or composition, to generate their overall assessment. In this work, we study what attributes of an image--specifically aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style--are important for both LLMs and humans to make judgments on image quality. We first curate a dataset of human preferences using synthetically generated image pairs. We use inter-task correlation between each pair of image quality attributes to understand which attributes are related in making human judgments. Repeating the same analysis with LLMs, we find that the relationships between image quality attributes are much weaker. Finally, we study individual image quality attributes by generating synthetic datasets with a high degree of control for each axis. Humans are able to easily judge the quality of an image with respect to all of the specific image quality attributes (e.g. high vs. low aesthetic image), however we find that some attributes, such as anatomical accuracy, are much more difficult for multimodal LLMs to learn to judge. Taken together, these findings reveal interesting differences between how humans and multimodal LLMs perceive images.

What Makes a Good Generated Image? Investigating Human and Multimodal LLM Image Preference Alignment

Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.

A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

Diffusion models deliver state-of-the-art generative performance across diverse modalities but remain computationally expensive due to their inherently iterative sampling process. Existing training-free acceleration methods typically improve numerical solvers for the reverse-time ODE, yet their effectiveness is fundamentally constrained by the inefficiency of the underlying sampling trajectories.
We propose A-FloPS (Adaptive Flow Path Sampler), a principled, training-free framework that reparameterizes the sampling trajectory of any pre-trained diffusion model into a flow-matching form and augments it with an adaptive velocity decomposition. The reparameterization analytically maps diffusion scores to flow-compatible velocities, yielding integration-friendly trajectories without retraining. The adaptive mechanism further factorizes the velocity field into a linear drift term and a residual component whose temporal variation is actively suppressed, restoring the accuracy benefits of high-order integration even in extremely low-NFE regimes.
Extensive experiments on conditional image generation and text-to-image synthesis show that A-FloPS consistently outperforms state-of-the-art training-free samplers in both sample quality and efficiency. Notably, with as few as $5$ function evaluations, A-FloPS achieves substantially lower FID and generates sharper, more coherent images. The adaptive mechanism also improves native flow-based generative models, underscoring its generality. These results position A-FloPS as a versatile and effective solution for high-quality, low-latency generative modeling.

A-FloPS: Accelerating Diffusion Models via Adaptive Flow Path Sampler

Inference time latency has remained an open challenge for real world applications of large language models (LLMs). State-of-the-art (SOTA) speculative sampling (SpS) methods for LLMs, like EAGLE-3, use tree-based drafting to explore multiple candidate continuations in parallel. However, the hyperparameters controlling the tree structure are static, which limits flexibility and efficiency across diverse contexts and domains.
We introduce \textbf{Re}inforcement learning for \textbf{Sp}eculative \textbf{S}ampling (\textbf{Re-SpS}), the first reinforcement learning (RL)-based framework for draft tree hyperparameter optimization. Re-SpS dynamically adjusts draft tree hyperparameters in real-time, learning context-aware policies that maximize generation speed by balancing speculative aggression with computational overhead. 
It leverages efficient state representations from target model hidden states and introduces multi-step action persistence for better context modeling.
Evaluation results across five diverse benchmarks demonstrate consistent improvements over the SOTA method EAGLE-3,
achieving up to 5.45$\times$ speedup over the backbone LLM and up to 1.12$\times$ speedup compared to EAGLE-3 across five diverse benchmarks, with no loss in output fidelity. Our code is included in the supplementary material and will be released upon paper acceptance.

Re-SpS: A Reinforcement Learning Approach to Speculative Sampling

Equilibria of realistic multiplayer games constitute a key solution concept both in practical applications, such as online advertising auctions and electricity markets, and in analytical frameworks used to study strategic voting in elections or assess policy impacts in integrated assessment models. However, efficiently computing these equilibria requires games to have a carefully designed structure and satisfy numerous restrictions; otherwise, the computational complexity becomes prohibitive. In particular, finding even approximate Nash equilibria in general normal-form games with three or more players is known to be PPAD-complete. Current state-of-the-art algorithms for computing Nash equilibria in multiplayer normal-form games either suffer from poor scalability due to their reliance on non-convex optimization solvers, or lack guarantees of convergence to a true equilibrium. In this paper, we propose a novel reformulation of the Nash equilibrium computation problem and develop a complete and sound spatial branch-and-bound algorithm based on this reformulation. We provide a qualitative analysis arguing why one should expect our approach to perform better than conventional formulation, and show the relationship between approximate solution to our reformulation and that of computing an approximate Nash equilibrium. Empirical evaluations demonstrate that our algorithm substantially outperforms existing complete methods.

Spatial Branch-and-Bound for Computing Multiplayer Nash Equilibrium

Stochastic approximation is a powerful class of algorithms with celebrated successes. A large body of previous analysis, however, focuses on stochastic approximations driven by contractive operators, which is not applicable in some important reinforcement learning settings like the average reward setting. This work instead investigates stochastic approximations with merely nonexpansive operators. In particular, we study nonexpansive stochastic approximations with Markovian noise, providing both asymptotic and finite sample analysis.
Key to our analysis are a few novel bounds of noise terms resulting from the Poisson equation. As an application, we prove, for the first time, that the classical tabular average reward temporal difference learning converges to a sample path dependent fixed point.

Downloads

Next from AAAI 2026

SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads