Singapore

The powerful generalization capabilities of Vision-Language-Action (VLA) models are bottlenecked by their heavy reliance on massive, redundant, and unevenly valued datasets, hindering their widespread application. Existing model-centric optimization paths, whether through model compression (which often leads to performance degradation) or policy distillation (where the output is model-dependent and lacks versatility), have failed to fundamentally address this data-level challenge. To this end, we propose a fundamentally different, data-centric generative data distillation framework, FT-NCFM. Our framework employs a self-contained FT (Fact-Tracing) engine that combines causal attribution with programmatic contrastive verification to assess the intrinsic value of samples. This evaluation guides an NCFM adversarial game to synthesize a model-agnostic, information-dense, reusable data asset. Experimental results on several mainstream VLA benchmarks show that using just 5% of our distilled coreset, we can train models that achieve over 90% of the task success rate of models trained on the full dataset, while reducing training time by 75.4%. Our work demonstrates that intelligent data distillation is a highly promising new path for building efficient, high-performance VLA models. Code and datasets will be made available upon acceptance.

AAAI 2026

FT-NCFM: An Influence-Aware Data Distillation Framework for Efficient VLA Models

ml: efficient ml / green ai

rob: embodied ai

cv: language and vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recent advancements in bidirectional heuristic search have yielded significant theoretical insights and novel algorithms.
While most previous work has concentrated on optimal search methods, this paper focuses on bounded-suboptimal bidirectional search, where a bound on the suboptimality of the solution cost is specified. We build upon the state-of-the-art optimal bidirectional search algorithm, BAE*, designed for consistent heuristics, and introduce several variants of BAE* specifically tailored for the bounded-suboptimal context. Through experimental evaluation, we compare the performance of these new variants against other bounded-suboptimal bidirectional algorithms as well as the standard weighted A* algorithm. Our results demonstrate that each algorithm excels under distinct conditions, highlighting the strengths and weaknesses of each approach.

Bidirectional Bounded-Suboptimal Heuristic Search with Consistent Heuristics

Recent progress in two-view geometry increasingly emphasizes enforcing smoothness and global consistency priors when estimating motion fields between pairs of images. However, in complex real-world scenes, characterized by extreme viewpoint and scale changes as well as pronounced depth discontinuities, the motion field often exhibits diverse and heterogeneous motion patterns. Most existing methods lack targeted modeling strategies and fail to explicitly account for this variability, resulting in estimated motion fields that diverge from their true underlying structure and distribution. We observe that Mixture-of-Experts (MoE) can assign dedicated experts to motion sub-fields, enabling a divide-and-conquer strategy for heterogeneous motion patterns. Building on this insight, we re-architect motion field modeling in two-view geometry with GeoMoE, a streamlined framework. Specifically, we first devise a Probabilistic Prior-Guided Decomposition strategy that exploits inlier probability signals to perform a structure-aware decomposition of the motion field into heterogeneous sub-fields, sharply curbing outlier-induced bias. Next, we introduce an MoE-Enhanced Bi-Path Rectifier that enhances each sub-field along spatial-context and channel-semantic paths and routes it to a customized expert for targeted modeling, thereby decoupling heterogeneous motion regimes, suppressing cross-sub-field interference and representational entanglement, and yielding fine-grained motion-field rectification. With this minimalist design, GeoMoE outperforms prior state-of-the-art methods in relative pose and homography estimation and shows strong generalization.

GeoMoE: Divide-and-Conquer Motion Field Modeling with Mixture-of-Experts for Two-View Geometry

Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of $\textbf{\emph{negative entities}}$—objects that appear in generated caption but are absent from the input—and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent image-to-text retrieval across both training and inference; (2) it filters negative entities from retrieved content to enhance accuracy; and (3) it applies attention-level suppression using identified negative entities to further minimize the impact of hallucination-prone features. Evaluation across multiple benchmarks demonstrates that NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving new state-of-the-art results in ZIC. Our code is available at: https://anonymous.4open.science/r/NES-CB86

Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images

Distributed online convex optimization (D-OCO), where multiple agents within a network collaboratively learn optimal decisions in real-time, arises naturally in applications such as federated learning, sensor networks, and multi-agent control. In this paper, we study D-OCO under unknown, time- and agent-varying feedback delays. While recent work has addressed delayed feedback (Nguyen, Kim Thang,
and Trystram 2024), existing algorithms assume prior knowledge of the cumulative delay over agents and still suffer from suboptimal dependence on both the delay and network parameters. To overcome these limitations, we propose a novel algorithm that achieves an improved regret bound of $\widetilde{\mathcal{O}}\Big(\sqrt{N^3d_{\text{tot},T}} + \sqrt{\frac{N^{5/2}T}{\sqrt{1-\sigma_2}}}\Big)$, where $d_{\text{tot}}$ denotes the average total delay across agents, $N$ is the number of agents, and $1 - \sigma_2$ is the spectral gap of the network. We also show that the dependence on $T$, $d_{\text{tot}}$ and $\sigma_2$ is tight by providing a matching lower bound. Our approach builds upon recent advances in D-OCO (Wan et al. 2024), but crucially incorporates an adaptive learning rate mechanism via a decentralized communication protocol. This enables each agent to estimate delays locally using a gossip-based strategy without the prior knowledge of the total delay. We further extend our framework to the strongly convex setting and derive sharper regret bounds. Experimental results validate the effectiveness of our approach, showing improvements over existing benchmark algorithms.

Decentralized Online Convex Optimization with Unknown Feedback Delays

We study GCS-TSP, a new variant of the Traveling Salesman Problem (TSP) defined over a Graph of Convex Sets (GCS) — a powerful representation for trajectory planning that decomposes the configuration space into convex regions connected by a sparse graph. In this setting, edge costs are not fixed but depend on the specific trajectory selected through each convex region, making classical TSP methods inapplicable.
We introduce GHOST, a hierarchical framework that optimally solves the GCS-TSP by combining combinatorial tour search with convex trajectory optimization. GHOST systematically explores tours on a complete graph induced by the GCS, using a novel abstract-path-unfolding algorithm to compute admissible lower bounds that guide best-first search at both the high level (over tours) and the low level (over feasible GCS paths realizing the tour). These bounds provide strong pruning power, enabling efficient search while avoiding unnecessary convex optimization calls.
We prove that GHOST guarantees optimality and present a bounded-suboptimal variant for time-critical scenarios. Experiments show that GHOST is orders-of-magnitude faster than unified mixed-integer convex programming baselines for simple cases and uniquely handles complex trajectory planning problems involving high-order continuity constraints and an incomplete GCS.

GHOST: Solving the Traveling Salesman Problem on Graphs of Convex Sets

Establishing point-to-point correspondences across multiple 3D shapes is a fundamental problem in computer vision and graphics. In this paper, we introduce DcMatch, a novel unsupervised learning framework for non-rigid multi-shape matching. Unlike existing methods that learn a canonical embedding from a single shape, our approach leverages a shape graph attention network to capture the underlying manifold structure of the entire shape collection. This enables the construction of a more expressive and robust shared latent space, leading to more consistent shape-to-universe correspondences via a universe predictor. Simultaneously, we represent these correspondences in both the spatial and spectral domains and enforce their alignment in the shared universe space through a novel cycle consistency loss. This dual-level consistency fosters more accurate and coherent mappings. Extensive experiments on several challenging benchmarks demonstrate that our method consistently outperforms previous state-of-the-art approaches across diverse multi-shape matching scenarios.

DcMatch: Unsupervised Multi-Shape Matching with Dual-Level Consistency

Recent end-to-end robotic manipulation research increasingly adopts architectures inspired by large language models (LLMs) to enable robust manipulation. However, a critical challenge arises from severe distribution shifts in robotic action data, primarily due to substantial numerical variations in action commands across diverse robotic platforms and tasks, hindering the effective transfer of pretrained knowledge. To address this limitation, we utilize the language-modal action representation ``motion'' as the target for pretraining. Unlike conventional discretized action representations that are sensitive to numerical scales, the motion representation specifically disregards numeric scale effects, emphasizing directionality instead. This abstraction effectively reduces the impact of distribution shifts, providing a more generalizable representation suitable for pretraining. Moreover, using the motion representation narrows the feature distance between action tokens and standard vocabulary tokens, mitigating modality gaps. Multi-task experiments on two benchmarks demonstrate that this method significantly improves generalization performance and transferability in robotic manipulation tasks.

Bridging Scale Discrepancies in Robotic Control via Language-Based Action Representations

The advanced reasoning capabilities of Large Reasoning Models enable them to thoroughly understand and apply safety policies through deliberate thought processes, thereby improving the models' safety. Beyond safety, these models must also be able to reflect the diverse range of human values across various cultures. This paper presents the Cultural Norm-based Cultural Alignment (CNCA) framework, which enables models to leverage their powerful reasoning ability to align with cultural norms. Specifically, we propose three methods to automatically mine cultural norms from limited survey data and explore ways to effectively utilize these norms for improving cultural alignment. Two alignment paradigms are examined: an in-context alignment method, where cultural norms are explicitly integrated into the user context, and a fine-tuning-based method, which internalizes norms through enhanced Chain-of-Thought training data. Comprehensive experiments demonstrate the effectiveness of these methods, highlighting that models with stronger reasoning capabilities benefit more from cultural norm mining and utilization. Our findings emphasize the potential for reasoning models to better reflect diverse human values through culturally informed alignment strategies.

Reasoning Shapes Alignment: Investigating Cultural Alignment in Large Reasoning Models with Cultural Norms

Regularized optimization has been a classical approach to solving imaging inverse problems, where the regularization term enforces desirable properties of the unknown image. Recently, the integration of flow matching generative models into image restoration has garnered significant attention, owing to their powerful prior modeling capabilities. In this work, we incorporate such generative priors into a Plug-and-Play (PnP) framework based on proximal splitting, where the proximal operator associated with the regularizer is replaced by a time-dependent denoiser derived from the generative model. While existing PnP methods have achieved notable success in inverse problems with smooth squared $\ell_2$ data fidelity—typically associated with Gaussian noise—their applicability to more general data fidelity terms remains underexplored. To address this, we propose a general and efficient PnP algorithm inspired by the primal-dual hybrid gradient (PDHG) method. Our approach is computationally efficient, memory-friendly, and accommodates a wide range of fidelity terms. In particular, it supports both $\ell_1$ and $\ell_2$ norm-based losses, enabling robustness to non-Gaussian noise types such as Poisson and impulse noise. We validate our method on several image restoration tasks, including denoising, super-resolution, deblurring, and inpainting, and demonstrate that $\ell_1$ and $\ell_2$ fidelity terms outperform the conventional squared $\ell_2$ loss in the presence of non-Gaussian noise.

Image Restoration via Primal Dual Hybrid Gradient and Flow Generative Model

Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose **MMG-Vid**, a novel training-free visual token pruning framework that removes redundancy by **M**aximizing **M**arginal **G**ains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over **99.5\%** of the original performance, while effectively **reducing 75\%** visual tokens and accelerating the prefilling stage by **3.9x** on LLaVA-OneVision-7B. Code will be released soon.

Downloads

Next from AAAI 2026

Bidirectional Bounded-Suboptimal Heuristic Search with Consistent Heuristics

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Bidirectional Bounded-Suboptimal Heuristic Search with Consistent Heuristics

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads