Singapore

There is a growing demand for deploying large generative AI models on mobile devices. For recent popular video generative models, however, the Variational AutoEncoder (VAE) represents one of the major computational bottlenecks. Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices. (1) We analyze redundancy in existing VAE architectures and get empirical design insights. By integrating 3D depthwise separable convolutions into our model, we significantly reduce the number of parameters. (2) We observe that the upsampling techniques in mainstream video VAEs are poorly suited to mobile hardware and form the main bottleneck. In response, we propose a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3) We propose an efficient VAE decoder training method. Since only the decoder is used during deployment, we distill it to Turbo-VAED instead of retraining the full VAE, enabling fast mobile adaptation with minimal performance loss. To our knowledge, our method enables real-time 720p video VAE decoding on mobile devices for the first time. This approach is widely applicable to most video VAEs. When integrated into four representative models, with training cost as low as $95, it accelerates original VAEs by up to 84.5× at 720p resolution on GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of the original reconstruction quality. Compared to mobile-optimized VAEs, Turbo-VAED achieves a 2.9× speedup in FPS and better reconstruction quality on the iPhone 16 Pro.

AAAI 2026

Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices

video latent diffusion models

cv: diffusion models for vision

mobile deployment

variational autoencoder

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The rapid advancement of large language models (LLMs) has resulted in increasingly sophisticated AI-generated content, posing significant challenges in distinguishing LLM-generated text from human-written language. Existing detection methods, primarily based on lexical heuristics or fine-tuned classifiers, often suffer from limited generalizability and are vulnerable to paraphrasing, adversarial perturbations, and cross-domain shifts. In this work, we propose SentiDetect, a model-agnostic framework for detecting LLM-generated text by analyzing the divergence in sentiment distribution stability. Our method is motivated by the empirical observation that LLM outputs tend to exhibit emotionally consistent patterns, whereas human-written texts display greater emotional variability. To capture this phenomenon, we define two complementary metrics: sentiment distribution consistency and sentiment distribution preservation, which quantify stability under sentiment-altering and semantic-preserving transformations. We evaluate SentiDetect on five diverse domains and a range of advanced LLMs, including Gemini-1.5-Pro, Claude-3, GPT-4-0613, and LLaMa-3.3. Experimental results demonstrate its superiority over state-of-the-art baselines, with over 16% and 11% F1 score improvements on Gemini-1.5-Pro and GPT-4-0613, respectively. Moreover, SentiDetect also shows greater robustness to paraphrasing, adversarial attacks, and text length variations, outperforming existing detectors in challenging scenarios.

Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

Vision-and-Language Navigation (VLN) plays a critical role in tasks of embodied AI, particularly in unseen environments following natural language instructions. Recent advancements leverage large language models (LLMs) to improve the accuracy and generalizability of VLN systems by encoding image sequences as dense token representations. However, this tokenization approach incurs substantial computational overhead due to two key inefficiencies: 1) ego-centric camera views often include navigation-irrelevant re-
gions (e.g., sky or distant backgrounds), and 2) high-frame-rate image sequences introduce temporal redundancy. To address these challenges, we propose Spatial-Temporal Efficient Visual Token Pruning (STEP-Nav), a unified frame-
work that simultaneously prunes redundant visual tokens and fine-tunes VLN models to preserve navigation performance. In particular, STEP-Nav incorporates a distance- and content-aware token evaluation mechanism to remove irrelevant tokens at the spatial level, along with temporal level similarity-based filtering to reduce redundancy across sequential frames. To ensure pruning does not harm task performance, we introduce a distortion-aware fine-tuning strategy that aligns pruned-token representations with their full-token
counterparts while maintaining navigation accuracy. Experiments on the R2R and RxR benchmarks using Navid-CE and
NavGPT-2 as base models demonstrate that STEP-Nav preserves over 95% of the performance while reducing 66.7% of tokens, outperforming existing token pruning baselines.

STEP-Nav: Spatial-Temporal Efficient Visual Token Pruning for Vision-and-Language Navigation with Large Language Models

Recent advances in editing technologies for 3D Gaussian Splatting (3DGS) have made it simple to manipulate 3D scenes. However, these technologies raise concerns about potential malicious manipulation of 3D content. To avoid such malicious applications, localizing tampered regions becomes crucial. In this paper, we propose GS-Checker, a novel method for locating tampered areas in 3DGS models. Our approach integrates a 3D tampering attribute into the 3D Gaussian parameters to indicate whether the Gaussian has been tampered. Additionally, we design a 3D contrastive mechanism by comparing the similarity of key attributes between 3D Gaussians to seek tampering cues at 3D level. Furthermore, we introduce a cyclic optimization strategy to refine the 3D tampering attribute, enabling more accurate tampering localization. Notably, our approach does not require expensive 3D labels for supervision. Extensive experimental results demonstrate the effectiveness of our proposed method to locate the tampered 3DGS area.

GS-Checker: Tampering Localization for 3D Gaussian Splatting

A key challenge in graph out-of-distribution (OOD) detection lies in the absence of ground-truth OOD samples during training. Existing methods are typically optimized to capture features within the in-distribution (ID) data and calculate OOD scores, which often limits pre-trained models from representing distributional boundaries, leading to unreliable OOD detection. Moreover, the latent structure of graph data is often governed by multiple underlying factors, which remains less explored. To address these challenges, we propose a novel test-time graph OOD detection method, termed BaCa, that calibrates OOD scores using dual dynamically updated dictionaries without requiring fine-tuning the pre-trained model. Specifically, BaCa estimates graphons and applies a mix-up strategy solely with test samples to generate diverse boundary-aware discriminative topologies, eliminating the need for exposing auxiliary datasets as outliers. We construct dual dynamic dictionaries via priority queues and attention mechanisms to adaptively capture latent ID and OOD representations, which are then utilized for boundary-aware OOD score calibration. To the best of our knowledge, extensive experiments on real-world datasets show that BaCa significantly outperforms existing state-of-the-art methods in OOD detection.

Graph Out-of-Distribution Detection via Test-Time Calibration with Dual Dynamic Dictionaries

Recent advances in spatial transcriptomics have enabled the simultaneous measurement of gene expression profiles and spatial location information, offering a more comprehensive and in-depth view for studying the tissue microenvironment. Spatial domain identification is a crucial step in analyzing spatial transcriptomics. However, current methods have poor accuracy and visualization because they lack self-adaptability to different tissue data, and moreover, they cannot effectively extract spatial location information. To address these issues, we propose an adaptive graph contrastive learning framework based on multi-head graph attention networks (GATCL) for spatial domain identification. Specifically, we design a data augmentation module to mask and shuffle the pre-processed gene expression data to generate more differentiated negative samples. In addition, we construct the multi-head graph attention networks (MHGAT) to encode gene expression profiles and spatial location information. More importantly, we design an adaptive graph contrastive learning model that works both with positive and negative samples from spatial transcriptomics. We introduce the attention pooling mechanism to dynamically and adaptively aggregate the spots' neighborhood information, and to improve the model's generalization ability for different spatial transcriptomics data. Furthermore, we design a discriminator that adds spectral normalization to bilinear functions. Experimental results on DLPFC, breast cancer, and mouse somatosensory cortex datasets demonstrate that the average Adjusted Rand Index (ARI) scores are 0.5746, 0.6182, and 0.5319, respectively, significantly outperforming state-of-the-art methods. More importantly, GATCL provides a more detailed visualization of different spatial transcriptomics data.

GATCL: An Adaptive Contrastive Learning Framework Based on MHGAT for Spatial Domain Identification in Spatial Transcriptomics

Membership Inference Attacks (MIAs) test whether a model has memorized training data, and are a key tool for auditing privacy risks in machine learning. Recent papers report near-perfect MIA success against large vision-language models such as CLIP, but almost all evaluations train on one web-scale corpus (e.g. LAION-400M) and treat samples from a different corpus (e.g. COCO or CC12M) as non-members -- thereby turning the task into out-of-distribution (OOD) detection rather than true membership testing, introducing spurious signals unrelated to true memorization.

We revisit the problem with a distribution-matched benchmark built from the CommonPool-L corpus of DataComp. A ViT-B/16 CLIP trained on 400 M pairs is accompanied by two 26-shard, i.i.d. splits that serve as member and non-member sets, sharing the exact same acquisition and preprocessing pipeline. Under this strictly in-distribution setting, every published MIA baseline collapses to chance ($\approx$51\% AUC). To explain this collapse, we derive a scaling-law upper bound for similarity-based attacks showing that the expected member vs. non-member similarity gap decays as $\mathcal{O}(T/N)$ for contrastive learning with $T$ epochs over $N$ samples. Empirically, as we vary the training set size while holding all hyper-parameters fixed, the gap follows the predicted linear trend in log–log space, and Cosine Similarity Attack AUC drops from 94\% to 51\%. %, matching the analytic sigmoid mapping.
Finally, we propose a simple, white-box, gradient-based MIA that outperforms prior attacks for CLIP without relying on OOD cues. We release code, checkpoints, and data to foster comprehensive and reproducible privacy research on multimodal foundation models.

Rethinking Membership Inference Attacks for CLIP

Understanding the structural dynamics of biomolecules is vital for elucidating biological function. With the increasing availability of molecular dynamics (MD) simulation data, deep generative models have been developed to synthesize realistic MD trajectories. However, existing approaches generate fixed-length trajectories by jointly denoising high-dimensional spatiotemporal representations, which conflicts with MD’s frame-by-frame integration process and fails to capture time-dependent conformational diversity. Motivated by the sequential nature of MD, we introduce a novel probabilistic autoregressive (\textbf{ProAR}) framework for trajectory generation. ProAR employs a dual-network system that explicitly models each frame as a multivariate Gaussian distribution and uses an anti-drifting sampling strategy to mitigate cumulative errors, thereby capturing conformational uncertainty and time-coupled structural changes while flexibly generating trajectories of arbitrary length. Experiments on ATLAS, a large-scale protein MD dataset, show that for the long trajectory generation task, our model achieves a 7.5\% reduction in reconstruction RMSE and an average 25.8\% improvement in conformation change accuracy over previous state-of-the-art methods. Regarding the conformation sampling task, it attains comparable performance to specialized time-independent models, offering a flexible and reliable alternative to conventional MD simulations.

ProAR: Probabilistic Autoregressive Modeling for Molecular Dynamics

Recently, continuous transform-based tensor representation has emerged as a promising tool for multi-dimensional data recovery. 
However, the existing continuous transforms are essentially single-layer linear mappings, which limits their ability to capture the complex relationships inherent in multi-dimensional data.
To overcome this limitation, we propose a Hierarchical Nonlinear Continuous Transform-based Tensor Representation (HiNCoT) for multi-dimensional data recovery. By leveraging the hierarchical nonlinear continuous transform, HiNCoT constructs the recovered tensor from a latent tensor, which is generated by the deep representation module with a low-rank core tensor as input. Compared with the existing continuous transform-based methods, HiNCoT can more effectively capture the complex nonlinear relationships inherent in multi-dimensional data along the third dimension.
To evaluate the effectiveness of the proposed HiNCoT, we suggest an HiNCoT-based multi-dimensional data recovery model. Extensive experiments on diverse degeneration scenarios demonstrate the superiority of our hierarchical nonlinear transform-based method over existing single-layer linear transform-based methods.

HiNCoT: Hierarchical Nonlinear Continuous Transform-based Tensor Representation for Multi-Dimensional Data Recovery

Existing end-to-end approaches of robotic manipulation often lack generalization to unseen objects or tasks due to limited data and poor interpretability. While recent Multimodal Large Language Models (MLLMs) demonstrate strong commonsense reasoning, they struggle with geometric and spatial understanding required for pose prediction. In this paper, we propose RobMRAG, a 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation (MRAG) framework for zero-shot robotic manipulation. Specifically, We construct a multi-source manipulation knowledge base containing object contact frames, task completion frames, and pose parameters. During inference, a Hierarchical Multimodal Retrieval module first employs hybrid semantic search to find task-relevant object prototypes, then selects the geometrically closest reference example based on pixel-level similarity and Instance Matching Distance (IMD). We further introduce a 3D-Aware Pose Refinement module based on 3D Gaussian Splatting into the MRAG framework, which aligns the pose of the reference object to the target object in 3D space. The aligned results are reprojected onto the image plane and used as input to the MLLM to enhance the generation of the final pose parameters. Extensive experiments show that on a test set containing 30 categories of household objects, our method improves the success rate by 7.76% compared to the best-performing zero-shot baseline under the same setting, and by 6.54% compared to the state-of-the-art supervised baseline. Our results validate that RobMRAG effectively bridges the gap between high-level semantic reasoning and low-level geometric execution, enabling robotic systems that generalize to unseen objects while remaining inherently interpretable.

Zero-Shot Robotic Manipulation via 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation

Large language models (LLMs) have shown great promise in automating data science workflows. However, existing models still struggle with multi-step reasoning and tool use, limiting their effectiveness on complex data analysis tasks.
To address this limitation, we propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks and associated data files.
Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task–solution pairs that reflect authentic tool-use patterns in practical data science scenarios.
To further enhance the multi-step reasoning capabilities, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning.
During inference, Jupiter combines the value model and
node visit counts to efficiently collect executable multi-step plans with minimal search steps.
Experimental results show that Qwen2.5-7B and 14B-Instruct models on NbQA solve 75.10\% and 84.44\% of tasks on InfiAgent-DABench, respectively—matching or surpassing GPT-4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool-use reasoning across diverse multi-step reasoning tasks.

Downloads

Next from AAAI 2026

Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads