Singapore

We propose a mesh-free policy iteration framework based on physics-informed neural networks (PINNs) for solving entropy-regularized stochastic control problems. The method iteratively alternates between soft policy evaluation and improvement using automatic differentiation and neural approximation, without relying on spatial discretization. We present a detailed $L^2$ error analysis that decomposes the total approximation error into three sources: iteration error, policy network error, and PDE residual error. The proposed algorithm is validated with a range of challenging control tasks, including high-dimensional linear-quadratic regulation in 5D and 10D, as well as nonlinear systems such as pendulum and cartpole problems. Numerical results confirm the scalability, accuracy, and robustness of our approach across both linear and nonlinear benchmarks.

AAAI 2026

Physics-Informed Approach for Exploratory Hamilton–Jacobi–Bellman Equations via Policy Iterations

hjb equation

control theory

pinn

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Generative modeling has emerged as a powerful approach for visuomotor policy learning, with diffusion models achieving strong results in robotic manipulation. However, they suffer from two major limitations: poor data efficiency and slow sampling due to iterative inference. While recent advances introduce equivariant architectures to address the former, slow sampling speed remains a challenge.
We propose Efficient Equivariant Flow Policy (EEFlow), a generative policy learning framework based on flow matching, which models a continuous path from noise to action using ordinary differential equations (ODEs). We theoretically show that under an isotropic Gaussian prior and an equivariant velocity field, EEFlow preserves equivariance in the learned action distribution, promoting better generalization across symmetric states and reducing data requirements.
To improve sampling efficiency, we introduce a second-order regularizer that penalizes acceleration. Since computing acceleration requires intractable marginal trajectories, we propose a novel surrogate loss that enables stable training using only readily available conditional trajectories.
Evaluated on extensive manipulation tasks, EEFlow matches or exceeds the performance of baselines while offering fast inference, highlighting its potential for high-performance, efficient robotic control.

EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI

The inherently low signal-to-noise ratio (SNR) in diffusion-weighted (DW) imaging fundamentally impedes precise tissue microstructure characterization, rendering effective noise suppression a persistent challenge. Existing denoising methods frequently suffer from over-smoothing or distortion of microstructure information when handling spatially correlated or severe noise. To address these limitations, we propose UP$^2$-MAE fusion model, a self-supervised DWI denoising method based on Uncertainty-Propelled Physics and Masked Auto-Encoder (MAE) fusion. This framework integrates two complementary branches: one leverages MAE to suppress noise through local context modeling, while the other constructs uncorrelated noisy pairs using diffusion tensor imaging (DTI) physics and denoises them via a Noise2Noise approach, which can preserve texture details by exploiting directional relationships across diffusion encoding directions. To fully integrate the strengths of both branches, an uncertainty-propelled fusion strategy based on maximum likelihood estimation is proposed to derive the final denoised output. In addition, to further promote the performance, uncertainty-guided reconstruction and consistency loss are presented. Evaluations against state-of-the-art denoising methods on both simulated and acquired DW datasets confirm the efficacy of our approach.

Uncertainty-Propelled Physics-MAE Fusion for Self-Supervised Diffusion-Weighted Image Denoising

Cross-domain recommendation (CDR) is crucial for improving recommendation accuracy and generalization, yet traditional methods are often hindered by the reliance on shared user/item IDs, which are unavailable in most real-world scenarios. Consequently, many efforts have focused on learning disentangled representations through multi-domain joint training to bridge the domain gaps. Recent Large Language Model (LLM)-based approaches show promise, they still face critical challenges, including: (1) the \textbf{item ID tokenization dilemma}, which leads to vocabulary explosion and fails to capture high-order collaborative knowledge; and (2) \textbf{insufficient domain-specific modeling} for the complex evolution of user interests and item semantics. To address these limitations, we propose \textbf{GenCDR}, a novel \textbf{Gen}erative \textbf{C}ross-\textbf{D}omain \textbf{R}ecommendation framework. GenCDR first employs a \textbf{Domain-adaptive Tokenization} module, which generates disentangled semantic IDs for items by dynamically routing between a universal encoder and domain-specific adapters. Symmetrically, a \textbf{Cross-domain Autoregressive Recommendation} module models user preferences by fusing universal and domain-specific interests. Finally, a \textbf{Domain-aware Prefix-tree} enables efficient and accurate generation. Extensive experiments on multiple real-world datasets demonstrate that GenCDR significantly outperforms state-of-the-art baselines. Our code is available in the supplementary materials.

From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization

Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.

Object-Centric Framework for Video Moment Retrieval

Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, LLaVA and QwenVL, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1\sim2\%$ additional parameters. The relevant code is uploaded in the supplementary.

Explore How to Inject Beneficial Noise in MLLMs

Knowledge distillation (KD) is a promising compression technique for reducing the computational burden of large language models (LLMs). Depending on access to the teacher model’s internal parameters, KD is typically categorized into white-box and black-box KD. While white-box KD benefits from full access to intrinsic knowledge such as softmax distributions, black-box KD adopts a black-box LLM (e.g., GPT-4) as the teacher, which provides only text-level outputs via API calls. This limited supervision makes black-box KD generally less effective than its white-box counterpart. To bridge the gap between white-box and black-box KD, we propose GrayKD, a novel framework that can effectively distill text-level knowledge from a black-box LLM in a single-stage manner. In particular, rationales generated by the black-box LLM are injected into the student via a lightweight cross-attention module (teacher mode), enabling the model to approximate the black-box teacher’s output distribution without access to internal parameters. The student is then trained with the softmax-level knowledge provided by the teacher mode (student mode). Since both the teacher and student modes share the same backbone, the proposed teacher mode remains highly parameter-efficient, requiring only a small number of additional parameters for rationale injection. Experimental results on instruction-following tasks demonstrate that GrayKD achieves substantial performance improvements over existing KD methods.

GrayKD: Distilling Better Knowledge from Black-box LLM via Multi-rationale Injection

Several studies have demonstrated that large language models (LLMs) exhibit positional bias when answering multiple-choice questions (MCQs). Previous methods have identified such bias to be detrimental, leading to the development of techniques to mitigate it. However, we observe that certain permutations of options can actually improve the performance. Therefore, instead of eliminating such bias, we propose an EMbracing the Bias EquivaRiantly (EMBER) network. Specifically, the EMBER network, which outputs a permutation of options in MCQs, is optimized towards the beneficial permutations to which the LLM is biased. Additionally, to solve the positional bias among different permutations of options, the EMBER network is designed to grant the equivariance to the permutation to the LLMs. Theoretically and empirically, we show that the proposed EMBER network can effectively utilize the positional bias and demonstrate state-of-the-art performance over various baselines.

Embracing Positional Bias in Multiple-Choice Question Answering via Permutation Equivariant Neural Networks

Vision-Language Models (VLMs) have advanced multimodal understanding, yet they remain susceptible to adversarial attacks. Among various strategies, transfer-based attacks are notably effective, especially in black-box scenarios. The dominant approach within this paradigm leverages generative models to create image targets from text, consistently outperforming text-only methods. However, this approach suffers from a fundamental limitation: generative models introduce visual features irrelevant or even detrimental to textual semantics, misguiding optimization. 
To investigate this limitation, we conduct comprehensive analysis revealing two critical findings. First, optimal attack directions lie in synergistic spaces between image and text gradients, demonstrating that text provides complementary information. Second, widespread gradient conflicts occur when combining modalities, where image-target gradients oppose text-target directions. This conflict provides direct evidence that extraneous visual information actively harms optimization, driving it away from intended textual objectives.
Based on these insights, we propose Text-Guided Gradient Refinement (TGGR), a novel framework that employs a conflict-aware projection mechanism to resolve this conflict. TGGR preserves the beneficial characteristics of image targets by decomposing the image gradient and surgically removing components that oppose the textual guidance. Extensive experiments on models such as LLaVA and GPT-4o demonstrate that TGGR substantially improves attack success rates. Specifically, on GPT-4o, TGGR yields an improvement of up to 14\% over state-of-the-art methods, achieving 96\% attack success rate.
Our work offers a principled framework for developing more synergistic and effective adversarial strategies against VLMs.

Text-Guided Gradient Refinement: Resolving Multimodal Gradient Conflicts to Boost Adversarial Attacks on Vision-Language Models

Open-set semi-supervised learning (OSSL) leverages unlabeled data containing both in-distribution (ID) and unknown out-of-distribution (OOD) samples, aiming simultaneously to improve closed-set accuracy and detect novel OOD instances. Existing methods either discard valuable information from uncertain samples or force-align every unlabeled sample into one or a few synthetic “catch-all” representations, resulting in geometric collapse and overconfidence on only seen OODs. To address the limitations, we introduce selective non-alignment, adding a novel “skip” operator into conventional pull and push operations of contrastive learning. Our framework, SkipAlign, selectively skips alignment (pulling) for low-confidence unlabeled samples, retaining only gentle repulsion against ID prototypes. This approach transforms uncertain samples into a pure repulsion signal, resulting in tighter ID clusters and naturally dispersed OOD features. Extensive experiments demonstrate that SkipAlign significantly outperforms state-of-the-art methods in detecting unseen OOD data without sacrificing ID classification accuracy.

Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment

Partially observable Markov decision processes (POMDPs) present significant challenges for reinforcement learning, as agents must learn optimal policies while maintaining belief states over unobserved environment states based on partial observations. 
We observe a compelling analogy: large language models (LLMs) autoregressively generate token probability distributions based on preceding context, mirroring how belief states are maintained and updated in POMDPs. 
This insight motivates leveraging the rich prior knowledge embedded in pre-trained LLMs for latent states estimation from observation-action histories. 
However, two critical challenges emerge: on the one hand, modality misalignment prevents LLMs from directly encoding visual observations and discrete actions; on the other hand, semantic misalignment exists between observation-action sequences and token sequences. 
To address these challenges, we introduce a novel framework ELSLLM that employs a Johnson-Lindenstrauss projection (JLP) module to transform input dimensions while preserving state similarity with theoretical guarantees, and utilizes modern Hopfield networks (MHN) to store all word embeddings from pre-trained LLMs as a knowledge repository. 
Through retrieval and querying mechanisms, ELSLLM achieves token-level knowledge alignment without requiring fine-tuning of the pre-trained LLMs. 
Extensive experiments on partially observable environments demonstrate that ELSLLM achieves state-of-the-art performance, significantly outperforming baseline methods with and without LSTM memory mechanisms. 
Our work opens new avenues for integrating pre-trained LLMs with reinforcement learning in partially observable settings.

Downloads

Next from AAAI 2026

EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads