Singapore

Offline reinforcement learning (RL) can learn policies from pre-collected offline datasets without interacting with the environment, but it suffers from the issue of out-of-distribution (OOD). Recent methods use the generative adversarial paradigm to learn policies, but easily fail to handle the conflict of fooling the discriminator and maximizing expected returns. In this paper, we propose a novel offline RL method named Distribution-Matching Generator-based Diffusion Policies (DMGDP). A distribution matching-based policy learning method is first developed, where the diffusion serves as the policy generator, to handle the conflict of fooling the discriminator and maximizing expected returns. Furthermore, a policy confidence mechanism based on discriminator regularization is designed to prevent the agent from taking OOD actions, with the aim of robust generative adversarial learning. We conducted extend experiments on the D4RL benchmarks, and the results demonstrate that DMGDP outperforms state-of-the-art methods.

AAAI 2026

Enhancing Diffusion Policies with Distribution-Matching Generator in Offline Reinforcement Learning

distribution matching

offline reinforcement learning

diffusion

generative adversarial network

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The growing demand for psychological support underscores the lack of high-quality counseling dialogue datasets, particularly in non-English contexts. We propose PGSim, a Path-Guided Simulation framework that mirrors real counseling processes—symptom description, problem identification, cause analysis, strategy planning, and iterative adjustment. PGSim models each user scenario as a fine-grained quadruple {Group, Psychological Problem, Problem Cause, Support Focus} and guides dialogue generation through expert-annotated strategy paths. Real counseling dialogues and expert-edited samples are used to fine-tune two language models: a Dialog Generator for strategy-aligned dialogue creation and a Dialog Modifier for expert-level refinement. After automated and human verification, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 problems, 13 causes, and 12 support focuses. We further present the Comprehensive Agent Dialogue Support System (CADSS), which integrates profiling, summarization, strategy planning, and empathetic response. Experiments on CPsDD and ESConv demonstrate that CADSS achieves state-of-the-art results on Strategy Prediction and Emotional Support Conversation tasks.

Simulating Human-Like Counseling: A Path- and Scenario-Guided Framework for Psychological Support Dialogue

Vision foundation models (e.g., SAM2, CLIP) show strong generalization in natural image analysis but degrade significantly in specialized domains like medical imaging. This is critical for tasks such as brain tumor segmentation, where errors directly affect surgical planning and patient outcomes. In such contexts, segmentation must be highly reliable and structurally precise, underscoring the need for adaptable methods with low error tolerance. While fine-tuning is the dominant strategy, it is computationally expensive and prone to forgetting. To address this, we propose CausalBridgeNet, a causality-guided correction framework for medical image segmentation. Inspired by predictive coding theories of the Bayesian brain, our method introduces a Predictive Causal Reasoning Unit (PCRU) that estimates structured error maps and delivers targeted feedback to iteratively refine predictions. This forms a closed-loop, error-aware correction mechanism without modifying the foundation model. By keeping the backbone frozen, CausalBridgeNet preserves general visual priors while enhancing task-specific accuracy. On the BraTS 2025 benchmark, it achieves an average Dice score of 84.48 and HD95 of 5.48 across tumor subregions, demonstrating its effectiveness for high-precision medical segmentation.

Make Foundation Models Trustworthy Again: Causal Fine-Adaptation for Medical Image Segmentation

DiT models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high-resolution outputs, further amplifying computational demands—especially for single-stage DiT models. To address these challenges, we propose a novel two-stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low-resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage achieves a nearly straight ODE trajectory between low and high resolutions via flow matching, effectively generating fine details with minimal NFEs. To ensure a seamless connection between the two independently trained stages during inference, we carefully design degradation strategies during the second-stage training. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high-resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability. Code and weights are available at https://github.com/FoundationVision/FlashVideo.

FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

Deep neural networks have recently achieved notable progress in 3D point cloud recognition, yet their vulnerability to adversarial perturbations poses critical security challenges in practical deployments. Conventional defense mechanisms struggle to address the evolving landscape of multifaceted attack patterns. Through systematic analysis of existing defenses, we identify that their unsatisfactory performance primarily originates from an entangled feature space, where adversarial attacks can be performed easily. To this end, we present 3D-ANC, a novel approach that capitalizes on the Neural Collapse (NC) mechanism to orchestrate discriminative feature learning. In particular, NC depicts where last-layer features and classifier weights jointly evolve into a simplex equiangular tight frame (ETF) arrangement, establishing maximally separable class prototypes. However, leveraging this advantage in 3D recognition confronts two substantial challenges: (1) prevalent class imbalance in point cloud datasets, and (2) complex geometric similarities between object categories. To tackle these obstacles, our solution combines an ETF-aligned classification module with an adaptive training framework consisting of representation-balanced learning (RBL) and dynamic feature direction loss (FDL). 3D-ANC seamlessly empowers existing models to develop disentangled feature spaces despite the complexity in 3D data distribution. Comprehensive evaluations state that 3D-ANC significantly improves the robustness of models with various structures on two datasets. For instance, DGCNN's classification accuracy is elevated from 27.2% to 80.9% on ModelNet40 -- a 53.7% absolute gain that surpasses leading baselines by 34.0%.

3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition

Current brain-computer interfaces primarily decode single motor variables, limiting their ability to support natural, high-bandwidth neural control that requires simultaneous extraction of multiple correlated motor dimensions. We introduce Multi-dimensional Neural Decoding (MND), a task formulation that simultaneously extracts multiple motor variables (direction, position, velocity, acceleration) from single neural population recordings. MND faces two key challenges: cross-task interference when decoding correlated motor dimensions from shared cortical representations, and generalization issues across sessions, subjects, and paradigms. To address these challenges, we propose OrthoSchema, a multi-task framework inspired by cortical orthogonal subspace organization and cognitive schema reuse. OrthoSchema enforces representation orthogonality to eliminate cross-task interference and employs selective feature reuse transfer for few-shot cross-session, subject and paradigm adaptation. Experiments on macaque motor cortex datasets demonstrate that OrthoSchema significantly improves decoding accuracy in cross-session, cross-subject and challenging cross-paradigm generalization tasks, with larger performance improvements when fine-tuning samples are limited. Ablation studies confirm the synergistic effects of all components are crucial, with OrthoSchema effectively modeling cross-task features and capturing session relationships for robust transfer. Our results provide new insights into scalable and robust neural decoding for real-world BCI applications.

Multi-dimensional Neural Decoding with Orthogonal Representations for Brain-Computer Interfaces

Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for ''walking''). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into ''what to count'' and ''where to see'' via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.

Decoupling What to Count and Where to See for Referring Expression Counting

Large Language Models (LLMs) increasingly leverage Federated Learning (FL) to utilize private, task-specific datasets for fine-tuning while preserving data privacy. However, while federated LLM frameworks effectively enable collaborative training without raw data sharing, they critically lack built-in mechanisms for regulatory compliance like GDPR’s $\textit{right to be forgotten}$. Integrating private data heightens concerns over data quality and long-term governance, yet existing distributed training frameworks offer no principled way to selectively remove specific client contributions post-training. Due to distributed data silos, stringent privacy constraints, and the intricacies of interdependent model aggregation, federated LLM unlearning is significantly more complex than centralized LLM unlearning. To address this gap, we introduce $\texttt{\textbf{Oblivionis}}$, a lightweight learning and unlearning framework that enables clients to selectively remove specific private data during federated LLM training, enhancing trustworthiness and regulatory compliance. By unifying FL and unlearning as a dual optimization objective, we incorporate $6$ FL and $5$ unlearning algorithms for comprehensive evaluation and comparative analysis, establishing a robust pipeline for federated LLM unlearning. Extensive experiments demonstrate that $\texttt{\textbf{Oblivionis}}$ outperforms local training, achieving a robust balance between forgetting efficacy and model utility, with cross-algorithm comparisons providing clear directions for future LLM development.

Oblivionis: A Lightweight Learning and Unlearning Framework for Federated Large Language Models

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from $\textit{segment anything}$ to $\textit{any segmentation}$. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we devise a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding.

X-SAM: From Segment Anything to Any Segmentation

Sequential recommendation has emerged as a fundamental task in various domains, aiming to predict a user's next interaction based on historical behavior. Recent advances in deep sequence models, particularly Transformer-based architectures and the more recent Mamba, have substantially pushed the boundaries of sequential modeling performance. However, existing methods still face two critical challenges. First, many current approaches overlook the hierarchical structures and high-order dependencies among items, typically restricting representation learning to conventional Euclidean spaces, which limits their capacity to capture complex relational information. Second, although Mamba excels at long-range dependency modeling, its reliance on static Feed-Forward Networks (FFNs) hinders its ability to dynamically adapt to evolving user preferences across diverse contexts. To address these limitations, we propose a Hyperbolic-Enhanced Mixture-of-Experts Mamba recommender (HM2Rec) for sequential recommendation. HM2Rec first encodes user-item relationships through hyperbolic graph convolution to exploit hierarchical structure more effectively. Then, a Variational Graph Auto-Encoder (VGAE) is employed to reconstruct node embeddings, improving structural robustness. To further enhance sequential modeling, we integrate Rotary Positional Encoding (RoPE) into Mamba to better capture relative position dependencies, and replace the FFN with Mixture-of-Expert (MOE) module, enabling dynamic and personalized expert selection for each token. Our extensive experiments on four widely-used public datasets demonstrate that HM2Rec outperforms several advanced baseline models.

Hyperbolic-Enhanced Mixture-of-Experts Mamba for Sequential Recommendation

Product posters blend striking visuals with informative text to highlight the product and capture customer attention. However, crafting appealing posters and manually optimizing them based on online performance is laborious and resource-consuming. To address this, we introduce AutoPP, an automated pipeline for product poster generation and optimization that eliminates the need for human intervention. Specifically, the generator, relying solely on basic product information, first uses a unified design module to integrate the three key elements of a poster (background, text, and layout) into a cohesive output. Then, an element rendering module encodes these elements into condition tokens, efficiently and controllably generating the product poster. Based on the generated poster, the optimizer enhances its Click-Through Rate (CTR) by leveraging online feedback. It systematically replaces elements to gather fine-grained CTR comparisons and utilizes Isolated Direct Preference Optimization (IDPO) to attribute CTR gains to isolated elements. Our work is supported by AutoPP1M, the largest dataset specifically designed for product poster generation and optimization, which contains one million high-quality posters and feedback collected from over one million users. Experiments demonstrate that AutoPP achieves state-of-the-art results in both offline and online settings. Our code and dataset will be released upon the paper's acceptance.

Downloads

Next from AAAI 2026

Simulating Human-Like Counseling: A Path- and Scenario-Guided Framework for Psychological Support Dialogue

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Simulating Human-Like Counseling: A Path- and Scenario-Guided Framework for Psychological Support Dialogue

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads