Singapore

Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation sub-tasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

AAAI 2026

Tracking and Segmenting Anything in Any Modality

motion & tracking; multi-modal vision; video understanding & activity analysis

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We introduce MAVERIX~(Multimodal audiovisual Evaluation and Recognition IndeX), a unified benchmark to probe the video understanding in multimodal LLMs, encompassing video, audio, text inputs with human performance baselines. 
Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from 700 videos, in the form of both multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions that necessitate tight integration of video and audio information, spanning a broad spectrum of agentic scenarios. MAVERIX uniquely provides models with audiovisual questions, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments with state-of-the-art models, including Qwen 2.5 Omni and Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of 92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.

MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

As large language models (LLMs) increasingly operate as autonomous agents in social contexts, evaluating their capacity for prosocial behavior is both theoretically and practically critical. However, existing research has primarily relied on static, economically framed paradigms, lacking models that capture the dynamic evolution of prosociality and its sensitivity to structural inequities. To address these gaps, we introduce ProSim, a simulation framework for modeling the prosocial behavior in LLM agents across diverse social conditions. We conduct three progressive studies to assess prosocial alignment. First, we demonstrate that LLM agents can exhibit human-like prosocial behavior across a broad range of real-world scenarios and adapt to normative policy interventions. Second, we find that agents engage in fairness-based third-party punishment and respond systematically to variations in inequity magnitude and enforcement cost. Third, we show that policy-induced inequities suppress prosocial behavior, propagate norm erosion through social networks. These findings advance prosocial behavior theory by elucidating how institutional dynamics shape the emergence, decay, and diffusion of prosocial norms in agent-driven societies.

Investigating Prosocial Behavior Theory in LLM Agents Under Policy-Induced Inequities

Protein subcellular localization prediction is essential for understanding protein function and cellular organization. However, existing methods exhibit two major limitations: (1) they overlook the critical role of evolutionarily conserved protein domains, which are fundamental functional and structural units that significantly influence functions and subcellular localization, and (2) they rarely learn residue order and backbone coordinates simultaneously, neglecting the complementary information inherent in multi-modal representations. In this paper, we propose a novel Domain-Aware Multi-View Contrastive Representation Learning for Protein Subcellular Localization prediction, named DMVCL. Firstly, it devises domain-sequence/structure attention modules, which identify functionally significant regions in protein structures/sequences that critically determine subcellular localization. Secondly, it introduces a multi-view contrastive learning framework that unites inter-view and intra-view objectives. Inter-view contrastive learning aligns protein sequences with their corresponding structures by maximizing mutual information, thereby capturing the consistency of protein residue order and backbone coordinates. Intra-view contrastive learning enhances the model’s sensitivity to subtle sequence and structural differences by pushing apart the embeddings of proteins located in different cellular compartments while pulling closer those in the same compartment. Extensive experiments demonstrate that DMVCL significantly outperforms existing baselines. Ablation studies and visualizations further highlight the contributions of domain-sequence/structure attention and multi-view contrastive learning in achieving superior predictive performance. Source code can be found at https://anonymous.4open.science/r/DMVCL-C6F0.

Domain-Aware Multi-View Contrastive Representation Learning for Protein Subcellular Localization Prediction

Training data detection is critical for enforcing copyright and data licensing, as Large Language Models are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001 \% of the training corpus. SPECTRA works by using an LLM to generate semantically equivalent paraphrases of text, and then computing its token log probabilities, using a scoring model that was not trained on the text. A paraphrase is then sampled with a score computed using the token log probabilities that is close to the score of the original text. We compare the token log probabilities of a "suspect" model to those of the scoring model to detect if the watermarked data was used for training. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used to train a model versus data not used to train a model. SPECTRA equips data owners with a scalable, deploy‑before‑release watermark that survives even large‑scale LLM training.

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Clustering non-independent and identically distributed (non-IID) data under local differential privacy (LDP) in federated settings presents a critical challenge: preserving privacy while maintaining accuracy without iterative communication. 
Existing one-shot methods rely on unstable pairwise centroid distances or neighborhood rankings, degrading severely under strong LDP noise and data heterogeneity. 
We present Gravitational Federated Clustering (GFC), a novel approach to privacy-preserving federated clustering that overcomes the limitations of distance-based methods under varying LDP.
Addressing the critical challenge of clustering non-IID data with diverse privacy guarantees, GFC transforms privatized client centroids into a global gravitational potential field where true cluster centers emerge as topologically persistent singularities. 
Our framework introduces two key innovations: (1) a client-side compactness-aware perturbation mechanism that encodes local cluster geometry as "mass" values, and (2) a server-side topological aggregation phase that extracts stable centroids through persistent homology analysis of the potential field's superlevel sets. 
Theoretically, we establish a closed-form bound between the privacy budget $\epsilon$ and centroid estimation error, proving the potential field's Lipschitz smoothing properties exponentially suppress noise in high-density regions.
Empirically, GFC outperforms state-of-the-art methods on ten benchmarks, especially under strong LDP constraints ($\epsilon < 1$), while maintaining comparable performance at lower privacy budgets. By reformulating federated clustering as a topological persistence problem in a synthetic physics-inspired space, GFC achieves unprecedented privacy-accuracy trade-offs without iterative communication, providing a new perspective for privacy-preserving distributed learning.

Topological Federated Clustering via Gravitational Potential Fields Under Local Differential Privacy

The discovery rate of optical transients will explode to 10 million public alerts per night once the Vera C. Rubin Observatory’s Legacy Survey of Space and Time comes online, overwhelming the traditional physics-based inference pipelines. A continuous-time forecasting AI model is of interest because it can deliver millisecond-scale inference for thousands of objects per day, whereas legacy MCMC codes need hours per object. In this paper, we propose a continuous-time variational autoencoder for panels of sparse and irregularly time-sampled (gappy) astrophysical light curves that are nonstationary, heteroscedastic, and inherently dependent. Our model combines a masked GRU-ODE encoder with a latent neural ODE propagator and an interpretable Gaussian-basis decoder. The encoder learns to summarize panels of imbalanced and correlated data even when only a handful of points are observed. The neural ODE then integrates this hidden state forward in continuous time, extrapolating to future unseen epochs. This extrapolated time series is further encoded by deep sets to a latent distribution that is decoded to a weighted sum of Gaussian basis functions, the parameters of which are physically meaningful. Such parameters (e.g., rise time, decay rate, peak flux) directly drive downstream prioritization of spectroscopic follow-up for astrophysical surveys. Beyond astronomy, the architecture offers a generic recipe for interpretable and continuous-time sequence modeling in any time domain where data are multivariate, sparse, heteroscedastic, and irregularly spaced.

SELDON: Supernova Explosions Learned by Deep ODE Networks

The mixed truck-drone delivery system has attracted increasing attention for its potential to optimize last-mile logistics. While the Flying Sidekick Traveling Salesman Problem (FSTSP) provides a foundation for modeling the truck-drone collaboration, it falls short of capturing real-world complexities by assuming a single truck-drone pair operating on a fully connected graph. We introduce the Multi-Agent FSTSP (MA-FSTSP), which extends FSTSP to handle multiple trucks, each carrying multiple drones operating over real road networks. Trucks must follow roads, while drones can fly directly between locations. To solve this NP-hard problem efficiently, we propose a novel three-phase algorithm that first partitions customers using a set-based distance heuristic, then computes initial truck routes via a Set TSP formulation, and finally optimizes drone deployment patterns by dynamic programming. Through extensive experiments on real-world road networks from Manhattan (1,024 nodes) and Boston (11,000 nodes), we demonstrate that our method achieves more than 30\% cost reduction compared to existing approaches while scaling effectively to problems with 150 customers within a 20-minute computational time-bound.

Optimization of Multi-Agent Flying Sidekick Traveling Salesman Problem over Road Networks

With the rapid advance of spatial multi-omics technologies, it has become possible to simultaneously profile transcripts, proteins and chromatin states at their native spatial coordinates, thereby uncovering molecular architecture that transcends any single-omics perspective. However, the resulting data matrices are often highly sparse and suffer from unstable dimensionality. Graph-based neural methods capture only local neighborhood information, whereas conventional Transformers, although capable of modelling long-range dependencies, incur prohibitive computational costs on such data. To overcome these limitations, we propose TLAGC—a Taylor-Linear-Attention-Guided Graph Convolutional framework that couples a Taylor-expanded linear attention (TLA) mechanism with graph convolutional networks. By eliminating the soft-max operation and linking the LocalGCN via residual connections, TLA preserves local structural information while enabling the integration of global and local contexts, thereby alleviating ineffective information propagation between spatially distant yet transcriptionally similar regions. Theoretical analysis confirms that TLA indeed reduces computational complexity, and extensive experiments on multiple spatial multi-omics benchmarks demonstrate that TLAGC consistently outperforms state-of-the-art baselines in delineating spatial domains.

TLAGC: Taylor Linear Attention-Guided Graph Convolutions for Revealing Spatial Domains in Spatial Multi-Omics Data

Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic rendering for dynamic real-world scenes captured by UAVs. Specifically, we address the challenge of reconstructing dynamic scenes with multiple moving pedestrians from monocular video data without the need for additional sensors. We use a combination of a 3D foundation model and a human mesh reconstruction model to reconstruct both the scene background and humans. We propose a novel approach to resolve the scene scale ambiguity and place both humans and the scene in world coordinates by identifying human-scene contact points. Additionally, we exploit the SMPL model and background mesh to initialize Gaussian splats, enabling holistic scene rendering. We evaluated our method on three complex UAV-captured datasets: VisDrone, Manipal-UAV, and Okutama-Action, each with distinct characteristics and 10-50 humans. Our results demonstrate the benefits of our approach over existing methods in novel view synthesis, achieving a 1.5 dB PSNR improvement and superior visual sharpness.

UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery Using Gaussian Splatting

Autonomous driving systems have achieved remarkable capabilities in real-world deployment, yet ensuring safety under corner cases remains a significant challenge due to the scarcity and constrained diversity of safety-critical scenarios. Existing generation methods may either lead to irrational vehicle behaviors or be limited by fixed collision patterns, while both heavily rely on existing map datasets, restricting the diversity. To address these fundamental limitations, we introduce **Any2Critical**, the first framework that can encode arbitrary real-world scenarios and generate contextually relevant safety-critical scenarios with realistic driving behaviors. Specifically, Any2Critical addresses two key challenges: (1) developing comprehensive, diverse map data by successfully leveraging everyday traffic situations as the most abundant source of real-world driving contexts, and (2) proposing an RAG-based Safety-Critical Scenario Generation Strategy based on our curated NHTSA-5K database for achieving an optimal balance between scenario diversity and behavioral rationality. Through comprehensive evaluation, we demonstrate that Any2Critical consistently achieves collision rates with an average of 89.69% across diverse scenarios and autonomous driving systems, significantly outperforming current state-of-the-art generation methods.

Downloads

Next from AAAI 2026

MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads