Singapore

Current paradigms for robotic imitation learning face a stark trade-off between the motion fidelity of diffusion models and the data scalability of inverse dynamics models. The latter, while scalable, often learns a latent action space disconnected from physical reality. This flaw leads to critical failures: temporal entanglement, where the model cannot distinguish between visually similar states requiring distinct actions, e.g., a gripper approaching versus receding from an object. This ambiguity, compounded by discretization artifacts and sensitivity to task-irrelevant dynamics, renders robust planning infeasible. We introduce LatentVLA, a vision-language-action framework designed to overcome these limitations by learning a continuous and spatiotemporally grounded latent action representation. Its progressive three-stage architecture first employs a Temporal-Attentive Latent Action Model (TA-LAM) to resolve ambiguities using language-guided attention and explicit temporal encoding. Subsequently, a Latent Action Diffusion Transformer (LADT) performs planning via diffusion directly within this continuous latent space, preserving motion fidelity without tokenization. Finally, an expert policy head translates these latent plans into precise robot actions. Experiments show LatentVLA sets a new state-of-the-art across a suite of real-world bimanual tasks, outperforming prior methods and demonstrating superior zero-shot generalization and few-shot efficiency.

AAAI 2026

LatentVLA: Taming Latent Space for Generalizable and Long-Horizon Bimanual Manipulation

rob: embodied ai

rob: manipulation

cv: vision for robotics & autonomous driving

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We study an efficient implementation of
Multi-Armed Bandit (MAB)-based Monte-Carlo Tree Search (MCTS) for classical planning.
One weakness of MCTS is that
it spends a significant time in deciding which node to expand next.
While selecting a node from an OPEN list with $N$ nodes has
$O(1)$ runtime complexity with traditional array-based priority-queues for dense integer keys,
the tree-based OPEN list used by MCTS requires $O(\log N)$,
which roughly corresponds to the search depth $d$.
In classical planning, $d$ is arbitrarily large (e.g., $2^k-1$ in $k$-disk Tower-of-Hanoi)
and the runtime for node selection is significant,
unlike in game tree search, where the cost is negligible compared to the node evaluation (rollouts)
because $d$ is inherently limited by the game (e.g. $d\leq 361$ in Go).
To improve this bottleneck,
we propose a bilevel modification to MCTS that
runs a best-first search from each selected leaf nodes with an expansion budget proportional to $d$,
which achieves amortized $O(1)$ runtime for node selection, equivalent to traditional queue-based OPEN list.
In addition, we introduce Tree Collapsing, an enhancement that reduces action selection steps
and further improves the performance.

Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning

We introduce River, a novel Condorcet-consistent voting method that is based on pairwise majority margins and can be seen as a simplified variation of Tideman's Ranked Pairs method. River is simple to explain, simple to compute even "by hand," and gives rise to an easy-to-interpret certificate in the form of a directed tree. Like Ranked Pairs and Schulze's Beat Path method, River is a refinement of the Split Cycle method and shares with those many desirable properties, including independence of clones. Unlike the other three methods, River satisfies a strong form of resistance to agenda-manipulation that is known as independence of Pareto-dominated alternatives.

The River Voting Method

Temporal graph classification is an emerging task with broad applications in neuroscience, cybersecurity, bioinformatics, and infrastructure monitoring, where systems are naturally modeled as evolving networks. While recent advances in temporal graph neural networks (TGNNs) have enabled the modeling of dynamic graph data, they often struggle to capture global structural evolution, suffer from oversmoothing, and are sensitive to noise and node permutations. In this work, we propose a novel framework that integrates tools from topological data analysis (TDA) into the temporal graph classification pipeline to address these limitations. By extracting persistent topological features from time-evolving graphs, our method captures stable, global structural patterns such as cycles and connectivity changes that are difficult for standard TGNNs to learn. These topological descriptors are then combined with neural architectures to enhance representation learning and improve classification performance. We evaluate our approach on multiple real-world datasets and demonstrate that it consistently outperforms existing TGNN models, particularly in tasks where the structural dynamics of the graph are critical to the target labels. Our results highlight the potential of topological machine learning in enriching temporal graph models with geometric and structural priors.

T3former: Temporal Graph Classification with Topological Machine Learning

Mixture-of-Experts (MoE) architecture with experts parallelism scales LLMs efficiently by activating only a subset of experts per input, avoiding proportional training costs. However, the intensive and heterogeneous communication substantially hinders the efficiency and scalability of MoE training in the resource-constrained scenario. Existing communication compression techniques fall short in MoE training due to: (\textit{i}) Intensive training amplifies compression overhead, compromising training efficiency; (\textit{ii}) Accumulated compression errors propagate through the network, degrading training quality. In this paper, we propose RCMoE, a communication-efficient \textbf{R}andom \textbf{C}ompression framework for \uline{MoE} training with two core modules:
(\textit{i}) \textit{Local-Stochastic Quantization} compresses the all-to-all communication by stochastically quantizing each row of the expert's intermediate computing results in parallel, effectively improving the compression efficiency and reducing compression error; 
(\textit{ii}) \textit{Probabilistic Thresholding Sparsification} compresses the all-reduce communication by probabilistically sampling large gradients at high probability, thereby reducing the computational complexity and maintaining the convergence efficiency. 
Experiments on four typical MoE training tasks prove that RCMoE achieves higher 5.9$\times$-8.1$\times$ total communication compression ratios and 1.3$\times$-10.1$\times$ training speedup compared with the state-of-the-art compression techniques while maintaining the MoE training accuracy.

RCMoE: A Communication-Efficient Random Compression Framework for Resource-Constrained Mixture-of-Experts Training

Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we leverage new results from off-policy evaluation; it has recently been shown that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates. This result is surprising as it means collecting data on-policy is not variance optimal. We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved to learn optimal policies. Off-policy RL has been well studied (e.g., IMPALA), with correct and truncated importance weighted samples for de-biasing and managing variance appropriately. Generally these approaches are concerned with reconciling data collected from multiple workers in parallel, while the policy is updated asynchronously, mismatch between the workers and policy is corrected in a mathematically sound way. Here we consider only one worker - the behaviour policy, which is used to collect data for policy improvement, with provably lower variance return estimates. In our experiments we extend two policy-gradient methods with this regime, demonstrating better sample efficiency and performance over a diverse set of environments.

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Existing large language model (LLM)-based table question answering (TableQA) methods primarily involve decomposition reasoning and answer verification processes. However, decomposing questions solely at the semantic level, without considering the factual evidence in tables, fails to significantly reduce the difficulty for LLMs in understanding the key information in questions. Furthermore, reasoning and verification without supporting factual evidence are often arbitrary and unreliable. In light of these issues, this paper proposes a Syllogism-Inspired Reasoning and Verification method (SIRV), which performs reliable decomposition reasoning and answer verification based on the evidential concept of syllogism. Specifically, SIRV extracts question-relevant factual evidence from the table to construct the premises. Based on the constructed premises, SIRV plans reasoning paths and generates sub-questions that explicitly indicate relevant factual evidence, performing evidence-centered reasoning. Additionally, SIRV examines the consistency between the premises and the table to focus on factual evidence, thereby reliably identifying and correcting errors in the reasoning process. Compared to state-of-the-art methods, SIRV achieves performance improvements of up to 5.24% in single-mode and 2.89% in joint reasoning, while also demonstrating excellent generalization ability and efficiency.

Syllogism-Inspired TableQA: Evidentialization Makes Decomposition Reasoning and Answer Verification More Reliable

Recent advances in explainable recommendation have explored the integration of language models to analyze natural language rationales for user–item interactions.
Despite their potential, existing methods often rely on ID-based representations that obscure semantic meaning and impose structural constraints on language models, thereby limiting their applicability in open-ended scenarios.
These challenges are intensified by the complex nature of real-world interactions, where diverse user intents are entangled and collaborative signals rarely align with linguistic semantics.
To overcome these limitations, we propose BEAT, a unified and transferable framework that tokenizes user and item behaviors into discrete, interpretable sequences. 
We construct a behavior vocabulary via a vector-quantized autoencoding process that disentangles macro-level interests and micro-level intentions from graph-based representations. 
We then introduce multi-level semantic supervision to bridge the gap between behavioral signals and language space.
A semantic alignment regularization mechanism is designed to embed behavior tokens directly into the input space of frozen language models.
Experiments on three public datasets show that BEAT improves zero-shot recommendation performance while generating coherent and informative explanations.
Further analysis demonstrates that our behavior tokens capture fine-grained semantics and offer a plug-and-play interface for integrating complex behavior patterns into large language models.

Behavior Tokens Speak Louder: Disentangled Explainable Recommendation with Behavior Vocabulary

Trajectory similarity computation is fundamental functionality that is used for, e.g., clustering, prediction, and anomaly detection. However, existing learning-based methods exhibit three key limitations: (1) insufficient modeling of trajectory semantics and hierarchy, lacking both movement dynamics extraction and multi-scale structural representation; (2) high computational costs due to point-wise encoding; and (3) use of physically implausible augmentations that distort trajectory semantics. To address these issues, we propose MovSem, a movement-semantics contrastive learning framework for trajectory similarity computation. MovSem first transforms raw GPS trajectories into movement-semantics features and then segments them into patches. Next, MovSem employs intra- and inter-patch attentions to encode local as well as global trajectory patterns, enabling efficient hierarchical representation and reducing computational costs. Moreover, MovSem includes a curvature-guided augmentation strategy that preserves informative segments (e.g., turns and intersections) and masks redundant ones, generating physically plausible augmented views. Experiments on real-world datasets show that MovSem is capable of outperforming state-of-the-art methods, achieving mean ranks close to the ideal value of 1 at similarity search tasks and improvements by up to 20.3% at heuristic approximation, while reducing inference latency by up to 43.4%.

MovSemCL: Movement-Semantics Contrastive Learning for Trajectory Similarity

One important direction of Federated Foundation Models (FedFMs) is leveraging data from small client models to enhance the performance of a large server‑side foundation model. Existing methods based on model level or representation level knowledge transfer either require expensive local training or incur high communication costs and introduce unavoidable privacy risks. We reformulate this problem as a reinforcement learning style evaluation process and propose FedGRPO, a privacy preserving framework comprising two modules. The first module performs competence-based expert selection by building a lightweight confidence graph from auxiliary data to identify the most suitable clients for each question. The second module leverages the “Group Relative” concept from the Group Relative Policy Optimization (GRPO) framework by packaging each question together with its solution rationale into candidate policies, dispatching these policies to a selected subset of expert clients, and aggregating solely the resulting scalar reward signals via a federated group–relative loss function. By exchanging reward values instead of data or model updates, FedGRPO reduces privacy risk and communication overhead while enabling parallel evaluation across heterogeneous devices. Empirical results on diverse domain tasks demonstrate that FedGRPO achieves superior downstream accuracy and communication efficiency compared to conventional FedFMs baselines.

FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Clients

The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and model will be available.

Downloads

Next from AAAI 2026

Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES