Singapore

Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks.
Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding.
To support this, we systematically categorize and organize existing multi-image grounding tasks according to cognitive demands and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation.
To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths.
This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model’s overall perception and reasoning capabilities.
Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0\% and 9.7\% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1\% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

AAAI 2026

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

reinforcement learning with verifiable rewards (rlvr)

large multimodal models (lmms)

visual grounding

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

A 3D point cloud completion task is to generate completed 3D objects given partial observations. Auto-encoder-based models suffer from poor generalization ability to untrained 3D data. Current diffusion-based models add isotropic noise with the same variance in three $x, y, z$ axes. More importantly, these models ignore real-world anisotropic evolution properties of 3D particles from a non-equilibrium state to thermodynamic equilibrium in the real physical world due to the velocity and energy thermodynamics of the particles, leading to unstable completions of 3D object topology. This paper presents a novel physically-based anisotropic 3D diffusion model (3DDM) to address these issues. We also present derivations of our proposed forward and reverse processes and a loss function in closed form, thus reproducibility. The 3DDM contains anisotropic energy-aware forward and reverse processes with a novel anisotropic quadratic loss function. The forward process adds anisotropic 3D Gaussian noises per-axis and mimics the thermal non-equilibrium evolution towards Maxwellian equilibrium based on velocity and kinetic energy evolutions of 3D particles in the real physical space. The reverse process learns to denoise along per-axis and per-timestep anisotropically. The anisotropic quadratic loss function penalizes errors along certain axes, yielding a highly flexible and anisotropic reverse diffusion process and a physically realistic generative model. The 3DDM denoises along $x, y, z$ axes with different velocities from the non-equilibrium evolution, achieving fewer than 20 diffusion steps and strong generalization to unseen 3D objects and real-world scenes that were not trained.

3DDM: Physically-based Anisotropic 3D Diffusion Model with 3D Gaussian for Point Cloud Completion

Feature coding has recently emerged as a key technique for efficient transmission of intermediate representations in distributed AI systems. 
Existing approaches largely follow a *transform-quantization-entropy coding* pipeline inherited from image and video coding, where the transform module is used to remove spatial structural redundancies in visual signals. 
However, our analysis indicates that such redundancies have already been removed during feature extraction, which reduces the necessity of the transform module. Building on this insight, we propose a new *vector quantization-entropy coding* pipeline that directly encodes the extracted features via a vector quantization module and an entropy model.
The proposed transform‑free framework jointly learns the quantization codebook and entropy model, enabling end‑to‑end optimization tailored to the inherent feature characteristics. Furthermore, the proposed method inherently avoids the computational complexity of the transform module. Experiments on features from diverse architectures and tasks demonstrate that our method achieves superior rate-distortion performance compared to transform-based baselines, while significantly reducing the encoding and decoding complexity.

Transform-Free Feature Coding via Entropy-Constrained Vector Quantization

Knowledge distillation (KD) is widely recognized as an effective approach for compressing large language models (LLMs). However, standard KD methods often falter when confronted with architectural or tokenization heterogeneity between teacher and student models, which creates a mismatch in their representations. While Optimal Transport (OT) provides a promising solution to align these representations, most OT-based methods rely on a single cost function, which isn’t enough to capture the multifaceted discrepancies between models with distinct designs. To address this limitation, we introduce Multi-Cost Wasserstein Knowledge Distillation (MCW-KD), a novel framework that enhances KD by simultaneously optimizing several cost functions within a unified OT formulation. MCW-KD employs specific cost matrices to effectively align both the final hidden states and the output distributions of the models. We also provide a rigorous theoretical foundation for the proposed Multi-Cost Wasserstein Distance, ensuring both mathematical validity and computational ability. Extensive experiments on instruction-following datasets demonstrate that MCW-KD significantly improves student model performance compared to state-of-the-art KD baselines, especially when teacher and student models have different tokenizers.

MCW-KD: Multi-Cost Wasserstein Knowledge Distillation for Large Language Models

Large language models (LLMs) exhibit diverse response behaviors, costs, and strengths, making it challenging to select the most suitable LLM for a given user query. We study the problem of adaptive multi-LLM selection in an online setting, where the learner interacts with users through multi-step query refinement and must choose LLMs sequentially without access to offline datasets or model internals. A key challenge arises from unstructured context evolution: the prompt dynamically changes in response to previous model outputs via a black-box process, which cannot be simulated, modeled, or learned. To address this, we propose the first contextual bandit framework for sequential LLM selection under unstructured prompt dynamics. We formalize a notion of myopic regret and develop a LinUCB-based algorithm that provably achieves sublinear regret without relying on future context prediction. We further introduce budget-aware and positionally-aware (favoring early-stage satisfaction) extensions to accommodate variable query costs and user preferences for early high-quality responses. Our algorithms are theoretically grounded and require no offline fine-tuning or dataset-specific training. Experiments on diverse benchmarks demonstrate that our methods outperform existing LLM routing strategies in both accuracy and cost-efficiency, validating the power of contextual bandits for real-time, adaptive LLM selection.

Online Multi-LLM Selection via Contextual Bandits Under Unstructured Context Evolution

Existing linguistic steganography methods primarily rely on content transformations to conceal secret messages. However, they often cause subtle yet looking-innocent deviations between normal and stego texts, posing potential security risks in real-world applications. To address this challenge, we propose a content-preserving linguistic steganography paradigm for perfectly secure covert communication without modifying the cover text. Based on this paradigm, we introduce CLstega (Content-preserving Linguistic steganography), a novel method that embeds secret messages through controllable distribution transformation. CLstega first applies an augmented masking strategy to locate and mask embedding positions, where MLM (masked language model)-predicted probability distributions are easily adjustable for transformation. Subsequently, a dynamic distribution steganographic coding strategy is designed to encode secret messages by deriving target distributions from the original probability distributions. To achieve this transformation, CLstega elaborately selects target words for embedding positions as labels to construct a masked sentence dataset, which is used to fine-tune the original MLM, producing a target MLM capable of directly extracting secret messages from the cover text. This approach ensures perfect security of secret messages while fully preserving the integrity of the original cover text. Experimental results demonstrate that CLstega can achieve a 100% extraction success rate, and outperforms existing methods in security, effectively balancing embedding capacity and security.

A Content-Preserving Secure Linguistic Steganography

Autonomous vehicles must navigate safely in complex driving environments. Imitating a single expert trajectory, as in regression-based approaches, usually does not explicitly assess the safety of the predicted trajectory. Selection-based methods address this by generating and scoring multiple trajectory candidates and predicting the safety score for each. However, they face optimization challenges in precisely selecting the best option from thousands of candidates and distinguishing subtle but safety-critical differences, especially in rare and challenging scenarios. We propose DriveSuprim to overcome these challenges and advance the selection-based paradigm through a coarse-to-fine paradigm for progressive candidate filtering, a rotation-based augmentation method to improve robustness in out-of-distribution scenarios, and a self-distillation framework to stabilize training. DriveSuprim achieves state-of-the-art performance, reaching 93.5\% PDMS in NAVSIM v1 and 87.1\% EPDMS in NAVSIM v2 without extra data, with 83.02 Driving Score and 60.00 Success Rate on Bench2Drive, demonstrating superior planning capabilities in various driving scenarios.

DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning

The rapid evolution of generative technologies necessitates reliable methods for detecting AI-generated images. A critical limitation of current detectors is their failure to generalize to images from unseen generative models, as they often overfit to source-specific semantic cues rather than learning universal generative artifacts. To overcome this, we introduce a simple yet remarkably effective \emph{pixel-level mapping} pre-processing step to disrupt the images' pixel value distribution and break the fragile, non-essential semantic patterns that detectors commonly exploit as shortcuts. This forces the detector to focus on more fundamental and generalizable high-frequency traces inherent to the image generation process. Through comprehensive experiments on GAN and diffusion-based generators, we show that our approach significantly boosts the cross-generator performance of state-of-the-art detectors. Extensive analysis further verifies our opinion that the disruption of semantic cues is the key to generalization.

Beyond Semantic Features: Pixel-level Mapping for Generalized AI-Generated Image Detection

Existing core-set selection methods predominantly rely on heuristic scoring signals such as training dynamics or model uncertainty, lacking explicit modeling of data likelihood. This omission may hinder the constructed subset from capturing subtle yet critical distributional structures that underpin effective model training. In this work, we propose a novel, theoretically grounded approach that leverages diffusion models to estimate data likelihood via reconstruction deviation induced by partial reverse denoising. Specifically, we establish a formal connection between reconstruction error and data likelihood, grounded in the Evidence Lower Bound (ELBO) of Markovian diffusion processes, thereby enabling a principled, distribution-aware scoring criterion for data selection. Complementarily, we introduce an efficient information-theoretic method to identify the optimal reconstruction timestep, ensuring that the deviation provides a reliable signal indicative of underlying data likelihood. Extensive experiments on ImageNet demonstrate that reconstruction deviation offers an effective scoring criterion, consistently outperforming existing baselines across selection ratios, and closely matching full-data training using only 50% of the data. Further analysis shows that the likelihood-informed nature of our score reveals informative insights in data selection, shedding light on the interplay between data distributional characteristics and model learning preferences.

Diffusion Reconstruction-based Data Likelihood Estimation for Core-Set Selection

Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation sub-tasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

Tracking and Segmenting Anything in Any Modality

We introduce MAVERIX~(Multimodal audiovisual Evaluation and Recognition IndeX), a unified benchmark to probe the video understanding in multimodal LLMs, encompassing video, audio, text inputs with human performance baselines. 
Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from 700 videos, in the form of both multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions that necessitate tight integration of video and audio information, spanning a broad spectrum of agentic scenarios. MAVERIX uniquely provides models with audiovisual questions, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments with state-of-the-art models, including Qwen 2.5 Omni and Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of 92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.

Content not yet available

Next from AAAI 2026

3DDM: Physically-based Anisotropic 3D Diffusion Model with 3D Gaussian for Point Cloud Completion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES