Singapore

Open-vocabulary multi-object tracking (OV-MOT) aims to track objects with unseen categories beyond the training set. While existing methods rely on pseudo video sequences synthesized from static images, they struggle to model realistic motion patterns, resulting in limited association performance in real-world scenarios. To alleviate these issues, we propose SAM2-OV, a novel association learning-free OV-MOT method that adopts a detection-only tuning paradigm, eliminating the need for synthetic sequences or spatiotemporal supervision and substantially reducing the overall learnable parameters. The core of our method is a Unified Detection Module (UDM), which effectively provides object-level prompts to enable SAM2 for OV-MOT. Enabled by UDM, SAM2-OV is the first to integrate SAM2 for OV-MOT, fully unleashing its zero-shot cross-frame association ability.To further enhance object association under occlusion and abrupt motion, we introduce a Motion Prior Assistance Module (MPAM) that incorporates motion cues into the mask selection process. In addition, a Semantic Enhancement Adapter (SEA) distilled from CLIP is used to improve classification generalization. A sparse prompting strategy is also adopted to reduce computational redundancy by triggering detection only on selected keyframes. As only the detection module is tuned on static images, the overall training process remains simple and efficient. Experiments on the TAO dataset demonstrate that SAM2-OV achieves state-of-the-art performance under the TETA metric, particularly on novel categories. Evaluations on the KITTI dataset show the strong zero-shot cross-domain transferability of our SAM2-OV.

AAAI 2026

SAM2-OV: A Novel Detection-Only Tuning Paradigm for Open-Vocabulary Multi-Object Tracking

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Real-world fraud detection applications benefit from graph learning techniques that jointly exploit node features—often rich in textual data—and graph structural information. Recently, Graph-Enhanced LLMs have emerged as a promising graph learning approach that converts graph information into prompts, exploiting LLMs' ability to reason over both textual and structural information. Among them, text-only prompting, which converts graph information into prompts consisting solely of text tokens, offers a solution that relies only on LLM tuning without requiring additional graph-specific encoders. However, text-only prompting struggles on heterogeneous fraud-detection graphs: multi-hop relations expand exponentially with each additional hop, leading to rapidly growing neighborhoods associated with dense textual information. These neighborhoods may overwhelm the model with long, irrelevant content in the prompt and suppress key signals from the target node, thereby degrading performance. To address this challenge, we propose Dual Granularity Prompting (DGP), which mitigates information overload by preserving fine-grained textual details for the target node while summarizing neighbor information into coarse-grained text prompts. DGP introduces tailored summarization strategies for different data modalities—bi-level semantic abstraction for textual fields and statistical aggregation for numerical features—enabling effective compression of verbose neighbor content into concise, informative prompts. Experiments across public and industry datasets demonstrate that DGP operates within a manageable token budget while improving fraud detection performance by up to 6.8\% (AUPRC) over state-of-the-art methods, showing the potential of Graph-Enhanced LLMs for fraud detection.

DGP: A Dual-Granularity Prompting Framework for Fraud Detection with Graph-Enhanced LLMs

The generalization capability of deepfake detectors is crucial for real-world applications. Data augmentation to generate synthetic fake faces has served as an effective strategy to enhance generalization. Interestingly, current state-of-the-art (SoTA) methods rely on fixed augmentation strategies, raising a fundamental question: Can a single static augmentation approach suffice, or does the diversity of forgery features necessitate dynamic strategies? We argue that existing methods overlook the evolving complexity of real-world forgery patterns, such as facial warping, expression manipulation, and compression artifacts, which cannot be fully simulated by fixed policies.
To bridge this gap, we propose CRDA (Curriculum Reinforcement-Learning Data Augmentation), a novel framework that guides the detector to progressively master multi-domain forgery features from simple to complex. CRDA synthesizes augmented samples using a configurable pool of forgery operations and dynamically generates adversarial samples tailored to the detector’s current learning state. 
Key to our approach is the integration of reinforcement learning (RL) and causal inference. To efficiently explore the vast augmentation space, an RL agent dynamically selects augmentation actions based on the detector’s performance, ensuring continuous adaptation to increasingly challenging forgeries. Simultaneously, the agent’s output is designed to introduce variations in action spaces, generating heterogeneous forgery patterns. These variations are guided by causal inference theory, which mitigates spurious correlations by suppressing task-irrelevant biases and enforcing the model to focus on causally invariant features. This integration ensures robust generalization by decoupling synthetic augmentation patterns from the model’s learned representations.
Extensive experiments demonstrate that the proposed method significantly improves the generalizability of the detector, achieving superior performance compared to state-of-the-art methods on multiple cross-domain datasets. Code is available at the supplementary material.

Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation

The performance of Offline reinforcement learning is significantly impacted by the issue of state distributional shift, and out-of-distribution (OOD) state correction is a popular approach to address this problem. However, previous methods correct the agent's transition distributions in a supervised way, which significantly degrades the flexibility and robustness. In this paper, we propose a novel method named Density-Aware Safety Perception (DASP) for OOD state correction. Specifically, our method encourages the agent to prioritize actions that lead to outcomes with higher data density, thereby promoting its operation within or the return to in-distribution (safe) regions. To achieve this, we optimize the objective within a variational framework that concurrently considers both the potential outcomes of decision-making and their density, thus providing crucial contextual information for safe decision-making. Finally, we validate the effectiveness and feasibility of our proposed method through extensive experimental evaluations on the offline MuJoCo and AntMaze suites.

Variational OOD State Correction for Offline Reinforcement Learning

Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.

ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Low-rank tensor decompositions (TDs) provide an effective framework for multiway data analysis. Traditional TD methods rely on predefined structural assumptions, such as CP or Tucker decompositions. From a probabilistic perspective, these methods effectively model the relationships between latent factors and the low-rank tensor using Dirac delta distributions. However, tensor low-rank decomposition is inherently non-unique, leading to a multimodal distribution over possible solutions. Critically, such prior knowledge is rarely available in practical scenarios, particularly regarding the optimal rank structure and contraction rules. To address this issue, we propose a score-based model that eliminates the need for predefined structural or distributional assumptions, enabling the learning of compatibility between tensors and latent factors. Specifically, a neural network is designed to learn the energy function, which is optimized via score matching to capture the gradient of the joint log-probability of tensor entries and latent factors. Our method allows for modeling structures and distributions beyond the Dirac delta assumption. Moreover, integrating the block coordinate descent (BCD) algorithm with the proposed smooth regularization enables the model to perform both tensor completion and denoising. Experimental results demonstrate significant performance improvements across various tensor types, including sparse and continuous-time tensors, as well as visual data.

Score-Based Model for Low-Rank Tensor Recovery

Point cloud completion aims to recover missing geometric structures from incomplete 3D scans, which often suffer from occlusions or limited sensor viewpoints. Existing methods typically assume fixed input/output densities or rely on image-based representations, making them less suitable for real-world scenarios with variable sparsity and limited supervision. In this paper, we introduce Density-agnostic and Class-aware Network (DANCE), a novel framework that completes only the missing regions while preserving the observed geometry. DANCE generates candidate points via ray-based sampling from multiple viewpoints. A transformer decoder then refines their positions and predicts opacity scores, which determine the validity of each point for inclusion in the final surface. To incorporate semantic guidance, a lightweight classification head is trained directly on geometric features, enabling category-consistent completion without external image supervision. Extensive experiments on the PCN and MVP benchmarks show that DANCE outperforms state-of-the-art methods in accuracy and structural consistency, while remaining robust to varying input densities and noise levels.

DANCE: Density-agnostic and Class-aware Network for Point Cloud Completion

Improving the diversity of generated results while maintaining high visual quality remains a significant challenge in image generation tasks. Fractal Generative Models (FGMs) are efficient in generating high-quality images, but their inherent self-similarity limits the diversity of output images. To address this issue, we propose a novel approach based on the Hausdorff Dimension (HD), a widely recognized concept in fractal geometry to quantify structural complexity, which aids in enhancing the diversity of generated outputs. To incorporate HD into FGM, we propose a learnable HD estimation method that predicts HD directly from image embeddings, addressing computational cost concerns and enabling efficient integration into generative modeling. Moreover, simply introducing HD as an auxiliary loss is insufficient to enhance diversity in FGMs. To this end, during training, we adopt an HD-based loss with a momentum-driven weighting strategy to progressively optimize hyperparameters to gain best diversity without sacrificing visual quality. Besides, during inference, we employ HD-guided rejection sampling to select geometrically richer outputs. Extensive experiments on the ImageNet dataset demonstrate that our FGM-HD framework yields a 39\% improvement in output diversity, compared to the baseline fractal model, while preserving comparable image quality. To our knowledge, this is the very first work introducing the Hausdorff Dimension into FGM. Our method effectively enhances the diversity of generated outputs while offering a principled theoretical contribution to the development of fractal-based generative models.

FGM-HD: Boosting Generation Diversity of Fractal Generative Models through Hausdorff Dimension Induction

We introduce Style4D-Bench, the first benchmark suite specifically designed for 4D stylization, with the goal of standardizing evaluation and facilitating progress in this emerging area. Style4D-Bench comprises: 1) a strong baseline that make an initial attempt for 4D stylization, 2) a comprehensive evaluation protocol measuring spatial fidelity, temporal coherence, and multi-view consistency through both perceptual and quantitative metrics, and 3) a curated collection of high-resolution dynamic 4D scenes with diverse motions and complex backgrounds. To establish a strong baseline, we present Style4D, a novel framework built upon 4D Gaussian Splatting. It consists of three key components: a basic 4DGS scene representation to capture reliable geometry, a Style Gaussian Representation that leverages lightweight per-Gaussian MLPs for temporally and spatially aware appearance control, and a Holistic Geometry-Preserved Style Transfer module designed to enhance spatio-temporal consistency via contrastive coherence learning and structural content preservation. Extensive experiments on Style4D-Bench demonstrate that Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering. We expect Style4D-Bench to become a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes. Please refer to the supplementary material for demo video.

Style4D-Bench: A Benchmark Suite for 4D Stylization

Understanding motion is crucial for visual object tracking in complex and dynamic motion scenarios. However, existing methods often rely on simple template updates or temporal feature propagation, neglecting the effective mining and utilization of motion information. To address this issue, we propose a motion-aware spatio-temporal framework that achieves motion perception by explicitly matching motion patterns and modeling motion relationships between frames. Specifically, our method introduces a motion pattern dictionary that encodes diverse and representative motion patterns as learnable features, enabling effective motion modeling. During tracking, features from the search region retrieve the most relevant motion patterns from the dictionary to capture current motion dynamics. The decoder then integrates temporal motion correlations for enhanced motion awareness. Additionally, we incorporate geometric cues into the search region features to enhance spatial perception, mitigate occlusion-induced ambiguity, and improve foreground-background separation. Extensive experiments on seven challenging benchmarks show that our approach consistently outperforms existing methods, confirming the effectiveness of motion pattern modeling and geometry-guided enhancement in alleviating tracking drift. Our MoDTrack achieves a 1.2\% higher AUC score on the LaSOT benchmark compared to the latest state-of-the-art methods, further validating the superiority of our approach.

Motion-Aware Object Tracking via Motion and Geometry-Aware Cues

Microvascular invasion (MVI) is a critical prognostic factor that significantly impacts postoperative outcomes in hepatocellular carcinoma (HCC). As the current gold standard for the diagnosis of MVI is based on the postoperative histopathological examination of whole slide images, accurate preoperative prediction of MVI status using magnetic resonance imaging (MRI) presents both a substantial clinical imperative and a significant challenge. In order to discover reliable MRI-based imaging biomarkers to support clinical decision making and enhance the interpretability of deep learning-based diagnostic models, we propose a novel interpretable MVI prediction framework in which the shared latent visual attributes are first learned and then used for potential imaging biomarker extraction and MVI diagnosis, respectively. To ensure that the visual attributes of these biomarkers are generalizable across diverse patients, the similarity constraints at the intra-patient level and the inter-patient level are enforced within the learned feature space, enabling intuitive biomarker discovery directly from the original image space. To guarantee semantic alignment between biomarkers and the characteristics of individual patients, we introduce a novel classification mechanism that directly links the alignment between each biomarker and patient-specific characteristics with the prediction outcome, thereby ensuring a precise prediction of MVI. Furthermore, the interpretability of the model is enhanced by integrating a mask-based visual explanation method that regions in patient images that correspond to the identified biomarkers. Extensive experiments on two MVI prediction datasets: HCC-WCH and HCC-ZSH unequivocally demonstrate our method's superior performance in both classification accuracy and interpretability. Our code will be made publicly available shortly after publication.

Downloads

Next from AAAI 2026

DGP: A Dual-Granularity Prompting Framework for Fraud Detection with Graph-Enhanced LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

DGP: A Dual-Granularity Prompting Framework for Fraud Detection with Graph-Enhanced LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads