Singapore

Cross-Domain Few-Shot Object Detection (CD-FSOD) faces significant challenges due to the dual issues of domain shift and limited labeled samples. One major challenge is style bias, caused by limited support samples that fail to represent the target domain’s style diversity. Another is feature confusion, which stems from distribution shifts and limited supervision, manifesting as both object-background ambiguity and object-object confusion. To address these challenges, we propose Style-Augmented Prototype Learning (StyleProto), which constructs style-aware prototypes from support samples with diverse visual styles, and refines them via spatial weighting and discriminative fusion. Specifically, our StyleProto consists of three components: (1) Style Generation Augmentation (SGA); (2) Semantic-Focused Prototype Construction (SPC); (3) Hierarchical Prototype Fusion Aggregator (HPFA). SGA synthesizes style-diverse yet semantically consistent training samples by recombining style statistics from the support set, thus improving robustness to unseen styles. SPC aggregates support features using spatial attention to highlight object semantics and suppress background noise, yielding cleaner and more distinctive class prototypes. HPFA leverages query-guided attention to integrate discriminative support features, enhancing prototype representations with richer class-specific details. Extensive experiments on multiple benchmarks demonstrate that StyleProto consistently outperforms existing state-of-the-art methods. The code is included in the \textit{supplementary material}.

AAAI 2026

StyleProto: Style-Augmented Prototype Learning for Cross-Domain Few-Shot Object Detection

cv: object detection & categorization; ml: classification and regression; ml: clustering

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Computational fluid dynamics (CFD) simulations traditionally require extensive computational resources, limiting their utility in many scientific and engineering applications at scale. We introduce Physically-Informed Flow Matching Graph Networks (PIFM-GN), a novel generative framework that directly samples fluid states under specified physical conditions without requiring expensive time-stepping simulations. The key innovation of our approach is the incorporation of incompressibility constraints directly into the flow matching transport process by parameterizing velocity fields through vector potentials, with graph-based curl operators ensuring divergence-free predictions without requiring global pressure-Poisson solves. Experiments on diverse fluid dynamics problems -- ranging from two-dimensional surface pressure distributions and complete flow fields, to complex three-dimensional airflow fields -- demonstrate that PIFM-GN generates high-fidelity samples with significantly fewer sampling steps than diffusion-based alternatives. Most notably, our model maintains competitive performance even with a single sampling step, a regime where diffusion models completely fail. Our generated samples accurately reproduce the statistical characteristics of target flows, successfully capturing multi-modal pressure distributions across various flow conditions, while achieving significant computational speedups compared to diffusion-based methods. PIFM-GN thus enables efficient generation of fluid states for downstream analysis and design tasks in scientific and engineering applications. The code is available at https://anonymous.4open.science/r/pifm-gn-F75F/.

Physically-Informed Flow Matching with Graph Neural Networks for Complex Fluid Dynamics

Since next-scale prediction was introduced as a new paradigm for autoregressive image generation, it has attracted extensive research interest. By progressively increasing resolution in a draft-to-refinement process, next-scale prediction demonstrates great potential in both generation quality and efficiency.However, at high resolutions, this paradigm faces a fundamental challenge: token sequences grow quadratically and accumulate across multiple scales, resulting in a key performance bottleneck. Our systematic study uncovers two critical observations: (1) most image regions have stabilized during early drafting stages, making later refinement across the full-scale image token-inefficient; (2) different scales inherently trade off efficiency and fidelity, suggesting that adaptive token dispatch on different scales can focus resources where they yield the greatest quality gains. Motivated by these insights, we propose a training-free \textbf{M}ixture \textbf{o}f \textbf{S}cale\textbf{s} (\textbf{MoSs}) method for efficient high-resolution autoregressive image generation. MoSs breaks the strict causal dependency across scales in the final refinement steps by parallelizing scales of different resolutions, each responsible for a subset of spatial regions. A lightweight frequency-based token dispatcher, analyzes the drafted image and assigns regions to the appropriate scale. The outputs are then composited over the draft to produce the final high-resolution image. The scale-mixture method exhibits remarkable efficiency with little impact on generation quality on various models. For instance, our implementaion achieves \textbf{2.05-4.96$\times$ speedup} on transformer backbone, up to \textbf{85.62\% KV cache reduction}, incurring only \textbf{0.1-2.4\%} loss on GenEval\citep{ghosh2023geneval} quality, based on state-of-the-art Infinity\citep{han2024infinity} model.

MoSs: Mixture of Scales for Efficient High-Resolution Autoregressive Image Generation

Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware stabilized ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results. Code will be available upon publication.

Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

Understanding brain function represents a fundamental goal in neuroscience, with critical implications for therapeutic interventions and neural engineering applications. Computational modeling provides a quantitative framework for accelerating this understanding, but faces a fundamental trade-off between computational efficiency and high-fidelity modeling. To address this limitation, we introduce a novel Energy-based Autoregressive Generation (EAG) framework that employs an energy-based transformer learning temporal dynamics in latent space through strictly proper scoring rules, enabling efficient generation with realistic population and single-neuron spiking statistics. Evaluation on synthetic Lorenz datasets and two Neural Latents Benchmark datasets (MC_Maze and Area2_bump) demonstrates that EAG achieves state-of-the-art generation quality with substantial computational efficiency improvements, particularly over diffusion-based methods. Beyond optimal performance, conditional generation applications show two capabilities: generalizing to unseen behavioral contexts and improving motor brain-computer interface decoding accuracy using synthetic neural data. These results demonstrate the effectiveness of energy-based modeling for neural population dynamics with applications in neuroscience research and neural engineering.

Energy-based Autoregressive Generation for Neural Population Dynamics

Video classification requires event-level representations of objects and their interactions. Existing methods typically rely on data-driven approaches, which either learn such features from whole frames or object-centric visual regions. Therefore, the modeling of spatiotemporal interactions among objects is usually overlooked. To address this issue, this paper presents a Decomposition of Synergistic, Unique, and Redundant Causal Representations Learning (SurdCRL) model for video classification, which introduces a newly-proposed SURD causal theory to model the spatiotemporal features of both object dynamics and their in- and cross-frame interactions. Specifically, SurdCRL employs three modules to model the object-centric spatiotemporal dynamics using distinct types of causal components, where the first module Spatial-Temporal Entity Modeling decouples the frame into object and context entities, and employs a temporal message passing block to capture object state changes over time, generating spatiotemporal features as basic causal variables. Second, the Dual-Path Causal Inference module mitigates confounders among causal variables by front-door and back-door interventions, thus enabling the subsequent causal components to reflect their intrinsic effects. Finally, the Causal Composition and Selection module employs the compositional structure-aware attention to project the causal variables and their high-order interactions into the synergistic, unique, and redundant components. Experiments on two benchmarking datasets verify that SurdCRL better captures event-relevant object-centric representation by decomposing spatiotemporal object interactions into three types of causal components.

Introducing Decomposed Causality with Spatiotemporal Object-Centric Representation for Video Classification

The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. 
Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm.
To overcome these limitations, we introduce **VILTA** (**V**LM-**I**n-the-**L**oop **T**rajectory **A**dversary), a novel framework that integrates a VLM into the **_closed-loop training_** of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents' future trajectories. This **_direct-editing_** approach fully leverages the VLM's powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.

VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications.
Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored.
Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation.
To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence.
Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales.
Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench.
The agent’s markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Early warning of intraoperative adverse events plays a vital role in reducing surgical risk and improving patient safety. While deep learning has shown promise in predicting the single adverse event, several key challenges remain: overlooking adverse event dependencies, underutilizing heterogeneous clinical data, and suffering from the class imbalance inherent in medical datasets. To address these issues, we construct the first multi-label adverse events dataset (MuAE) for intraoperative adverse events prediction, covering six critical events. Next, we propose a novel Transformer-based multi-label learning framework (IAENet) that combines an improved Time-Aware Feature-wise Linear Modulation (TAFiLM) module for static covariates and dynamic variables robust fusion and complex temporal dependencies modeling. Furthermore, we introduce a label-constrained reweighting loss (LCRLoss) with co-occurrence regularization to effectively mitigate intra-event imbalance and enforce structured consistency among frequently co-occurring events. Extensive experiments demonstrate that IAENet consistently outperforms strong baselines on 5, 10, and 15-minute early warning tasks, achieving improvements of +5.05\%, +2.82\%, and +7.57\% on average F1-score. These results highlight the potential of IAENet for supporting intelligent intraoperative decision-making in clinical practice.

Early Warning of Intraoperative Adverse Events via Transformer-Driven Multi-Label Learning

Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, "task" and "step" descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce "states", i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to first ground representations in states before associating them with steps and, ultimately, high-level tasks.
Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.

Downloads

Next from AAAI 2026

Physically-Informed Flow Matching with Graph Neural Networks for Complex Fluid Dynamics

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Physically-Informed Flow Matching with Graph Neural Networks for Complex Fluid Dynamics

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads