Singapore

Tracking Any Point (TAP) is a foundational task in computer vision with broad applicability. The state-of-the-art self-supervised TAP method leverages a global matching transformer and contrastive random walks to learn point correspondences. However, its dense all-pairs attention and correlation volume computation tend to introduce irrelevant features and produce less informative training signals, degrading both learning efficiency and tracking accuracy. To address these limitations, we introduce LEAP-Track, a self-supervised TAP approach that computes the attention matrices and correlation volume over adaptively selected sparse pairs. It consists of two core designs: (1) Curriculum-based Sparse Attention (CSA), which dynamically focuses on the most relevant keys, promoting the learning of discriminative features; and (2) Progressive k-NN Transition (PkT), which reformulates the contrastive random walk to operate on a increasingly sparse k-NN affinity graph to reinforce the learning of the most informative correspondences. By integrating the above two designs into a two-stage training paradigm, LEAP-Track is shown both theoretically and empirically to effectively boost learning efficiency, achieving superior tracking accuracy over existing self-supervised TAP methods.

AAAI 2026

Learning to LEAP: Efficient Dense Point Tracking by Focusing Where It Matters

tracking any point

computer vision

self-supervised learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Reference Audio-Visual Segmentation (Ref-AVS) tasks challenge models to precisely locate sounding objects by integrating visual, auditory, and textual cues. Existing methods often lack genuine semantic understanding, tending to memorize fixed reasoning patterns. Furthermore, jointly training for reasoning and segmentation can compromise pixel-level precision.
To address these issues, we introduce AURORA, a novel framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation. We employ a structured Chain-of-Thought (CoT) prompting mechanism to guide the model through a step-by-step reasoning process and introduce a novel segmentation feature distillation loss to effectively integrate these reasoning abilities without sacrificing segmentation performance. To further cultivate the model's genuine reasoning capabilities, we devise a further two-stage training strategy: first, a ``corrective reflective-style training" stage utilizes self-correction to enhance the quality of reasoning paths, followed by reinforcement learning via Group Reward Policy Optimization (GRPO) to bolster robustness in challenging scenarios. Experiments demonstrate that AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.

AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation

Recent vision-language model (VLM)-based methods have achieved promising results in zero-shot out-of-distribution (OOD) detection by effectively leveraging the local patch features. 
However, the zero-shot nature inherently comes with two limitations: 1) imperfect local feature prototypes; 2) lack of OOD prototypes.
In this paper, we propose Intra-Image Mining (IIM), a lightweight framework designed to overcome these limitations in a few-shot manner.
IIM is motivated by the fact that local patches within an image often exhibit diverse semantics, with some patches deviating from the main class concept. Therefore, for each image, we first select the top-$k$ class prototype-related patches as positive samples and leverage them to refine and optimize the local feature prototype. Then, the next top-$k$ among the remaining patches are selected as negatives—serving as OOD signals to construct OOD prototypes. This process yields coherent local positives and challenging negatives, effectively enhancing the model’s local feature discrimination. 
Besides, we propose a novel OOD evaluation method named Symmetric Maximum Concept Matching (S-MCM). 
While existing approaches typically adopt an image-to-text scheme—comparing the image features to textual class prototypes—S-MCM further incorporate a text-to-image perspective, leading to more reliable OOD detection. We also propose two benchmarks to analyze
the impact of semantic diversity within ID dataset.
Built on a frozen VLM, IIM, in conjunction with S-MCM, achieves consistent gains in OOD detection on ImageNet-1k and other benchmarks, outperforming prior methods in FPR95 and AUROC across various few-shot settings.

Intra-Image Mining and Symmetric Maximum Concept Matching for Few Shot Out-of-Distribution Detection

Watermarking diffusion-generated images is crucial for copyright protection and user tracking.
However, current diffusion watermarking methods face significant limitations: zero-bit watermarking systems lack the capacity for large-scale user tracking, while multi-bit methods are highly sensitive to certain image transformations or generative attacks, resulting in a lack of comprehensive robustness.
In this paper, we propose **OptMark**, an optimization-based approach that embeds a robust multi-bit watermark into the intermediate latents of the diffusion denoising process. OptMark strategically inserts a structural watermark early to resist generative attacks and a detail watermark late to withstand image transformations, with tailored regularization terms to preserve image quality and ensure imperceptibility.
To address the challenge of memory consumption growing linearly with the number of denoising steps during optimization, OptMark incorporates adjoint gradient methods, reducing memory usage from $O(N)$ to $O(1)$. Experimental results demonstrate that OptMark achieves invisible multi-bit watermarking while ensuring robust resilience against valuemetric transformations, geometric transformations, editing, and regeneration attacks.

OptMark: Robust Multi-bit Diffusion Watermarking via Inference Time Optimization

Recent diffusion-based image editing methods have made great strides in text-guided tasks but often struggle with complex, indirect instructions. Additionally, current models frequently exhibit poor identity preservation, unintended edits, or rely on manual masks. To overcome these limitations, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that bridges user intent with editing model capabilities. X-Planner uses chain-of-thought reasoning to systematically break down complex instructions into simpler sub-instructions. For each one, X-Planner automatically generates precise edit types and segmentation masks, enabling localized, identity-preserving edits without applying external tools or models during inference. To enable the training of such a planner, we also introduce a fully automated, reproducible pipeline to generate large-scale, high-quality training data. Our complete system achieves state-of-the-art results on both existing and newly proposed complex instruction-based editing benchmarks.

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

In robot manipulation, robot learning has become a prevailing approach. However, generative models within this field face a fundamental trade-off between the slow, iterative sampling of diffusion models and the architectural constraints of faster Flow-based methods, which often rely on explicit consistency losses. To address these limitations, we introduce MP1, which pairs 3D point-cloud inputs with the MeanFlow paradigm to generate action trajectories in one network function evaluation (1-NFE). By directly learning the interval-averaged velocity via the "MeanFlow Identity", our policy avoids any additional consistency constraints. This formulation eliminates numerical ODE-solver errors during inference, yielding more precise trajectories. MP1 further incorporates CFG for improved trajectory controllability while retaining 1-NFE inference without reintroducing structural constraints. Because subtle scene-context variations are critical for robot learning, especially in few-shot learning, we introduce a lightweight Dispersive Loss that repels state embeddings during training, boosting generalization without slowing inference. We validate our method on the Adroit and Meta-World benchmarks, as well as in real-world scenarios. Experimental results show MP1 achieves superior average task success rates, outperforming DP3 by 10.2% and FlowPolicy by 7.3%. Its average inference time is only 6.8 ms—19 times faster than DP3 and nearly 2 times faster than FlowPolicy. Our code can be accessed at https://github.com/LogSSim/MP1.

MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation

Natural Language-based Egocentric Task Verification (NLETV) aims to verify the alignment between action sequences in egocentric videos and their corresponding textual descriptions. 
However, existing NLETV approaches are still facing two critical challenges: 
(1) These methods are designed for simulating environments, ignoring the domain gap between synthetic and realistic data. 
(2) The matching processes are regarded as a simple binary classification problem, which undermines model reliability due to evaluation bias and uncalibrated decision settings. 
To address these challenges, we propose a novel method termed Prototypical Evidential Learning (PEL), which can be adapted to existing NLETV approaches and boost the model generalization and mitigate prediction bias. 
Our method leverages prototypes to guide cross-domain alignment and evidence collection. 
Specifically, PEL consists of two key components: 
(1) Prototypical Domain Adaptation module enabling cross-domain feature alignment and intra-domain prototype preservation between synthetic and realistic domains; 
(2) Matching Evidence Collector module, which quantifies prediction uncertainty on the prototypical representations through evidential deep learning. 
It enforces the model to collect the vision-text consistency and discrepancy evidence, thus addressing the issues of biased decisions in binary classification. 
Extensive experiments on two public datasets demonstrate that our PEL method outperforms existing state-of-the-art NLETV methods and shows remarkable generalizability.

De-biased Natural Language Egocentric Task Verification via Prototypical Evidence Learning

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed **DualParal**. The core idea is that, instead of generating an entire video on a single GPU, we parallelize computation by partitioning both video frames and model layers across multiple GPUs. However, a naive parallel implementation is not feasible. Because all frames need to share the same noise level, they can't be processed independently. Instead, every step must wait for all others to finish, which cancels out the speed benefits of parallel processing. We overcome this obstacle with a *block-wise denoising* scheme. Namely, we segment the video into sequential blocks, each with different noise level. As results, we process them in a pipeline across the GPUs. Each GPU, holding a subset of the model layers, processes a specific block of frames and passes the results to the next GPU, enabling asynchronous computation and communication.
To further optimize performance, we incorporate two key enhancements. Firstly, each GPU uses a feature cache technique to reduce the overhead of smooth transitions by reusing only features involved in cross-frame computation from the prior block, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs.

Minute-Long Videos with Dual Parallelisms

In-context learning (ICL) has emerged as a powerful paradigm for Large Visual Language Models (LVLMs), enabling them to leverage a few examples directly from input contexts. However, the effectiveness of this approach is heavily reliant on the selection of demonstrations, a process that poses significant challenges due to its NP-hard nature. Traditional strategies, including random, similarity-based sampling and infoscore-based sampling, often lead to inefficiencies or suboptimal performance, struggling to balance both efficiency and effectiveness in demonstration selection.
In this paper, we propose a novel demonstration selection framework named Coreset-based Dual Retrieval (CoDR).
We demonstrate that samples within the diverse subset achieve higher mutual information expectations.
To implement this, we introduce a cluster-pruning method to build a diverse coreset.
This coreset aligns more effectively with the input query while maintaining diversity.
Additionally, we introduce a dual retrieval mechanism to enhance the selection process, achieving a more global demonstration selection, while maintaining efficiency.
Experimental results demonstrate that our method significantly improves the ICL performance compared to the existing strategies, providing a robust solution for effective and efficient demonstration selection.

Efficient and Effective In-context Demonstration Selection with Coreset

Multimodal learning frequently faces two coupled challenges: modality imbalance, where dominant modalities suppress others during training, and modality conflict, where opposing gradient directions hinder optimization. Existing methods typically address these issues in isolation, yet they are intrinsically correlated and most fundamentally reflected in the gradient space—severe imbalance may obscure conflicts, while suppressing conflict may homogenize features and worsen imbalance, affecting fusion performance. To jointly address this coupled challenge, we propose Reconcile Gradient Modulation (RGM), a unified framework that adaptively adjusts gradient magnitude and direction for harmony multimodal learning. The core of RGM is SynOrth Grad, which minimizes Dirichlet energy to perform minimal-gradient surgery. It enhances cooperation synergy when modalities are aligned and enforces orthogonality to preserve uniqueness in conflict situations, thus promoting stable and balanced learning. To guide this modulation, we propose Cumulative Gradient Energy (CGE) as a convergence-guaranteed measure of modality-wise progress, and construct a Balance-nonConflict Plane (BCP) for real-time diagnosis and control of training dynamics. Experiments on diverse benchmarks validate our effectiveness and generalizability, consistently outperforming counterparts that are designed to handle multimodal imbalance or conflict independently.

Downloads

Next from AAAI 2026

AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads