Singapore

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities, yet their ability to ground language in complex, interactive environments such as video games remains a critical frontier. Existing benchmarks are inadequate for this purpose: real-world datasets like RefCOCO introduce a domain gap; GUI-centric benchmarks lack the complexity of modern game interfaces; and existing game-specific benchmarks are often too simplistic or narrow, failing to assess fine-grained, generalizable grounding capabilities.
To address this issue, we propose GGBench — a large-scale, cross-genre benchmark designed to probe the grounding capabilities of LVLMs in diverse gaming scenarios. GGBench features unprecedented genre diversity, encompassing 10 categories including card games, first-person shooters, and role-playing games, with a total of 1335 test images. It focuses on tasks that require connecting natural language instructions to specific in-game objects and UI elements.
Experimental results show existing models perform poorly on GGBench, with weak grounding abilities, especially in complex game scenarios. Due to limited data scale, fine-tuning them for gaming scenarios is also challenging. To address this, we propose Game-R1, a novel training method centered on the Grounded Reinforcement Policy Optimization (GRPO) algorithm. GRPO maximizes limited interaction data utility and enables robust few-shot generalization across games. Extensive experiments show Game-R1 significantly outperforms existing LVLMs on GGBench, validating our approach.
GGBench provides a solid and comprehensive evaluation platform for subsequent research on agents in gaming environments, which strongly promotes development in this field.

AAAI 2026

Game Ground Bench: Probing the Limits of LVLMs in Complex Semantic Grounding Across Game Universes

large vision-language models，llm for gaming

visual grounding

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from ``Outfit of the Day'' (OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g. garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48$\times$ speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. 
Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatars support downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.

PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos

Model merging combines expert models for multitask performance but faces challenges from parameter interference. This has sparked recent interest in controllable model merging, giving users the ability to explicitly balance performance trade-offs. Existing approaches employ a compile-then-query paradigm, performing a costly offline multi-objective optimization to enable fast, preference-aware model generation. This offline stage typically involves iterative search or dedicated training, with complexity that grows exponentially with the number of tasks. To overcome these limitations, we shift the perspective from parameter-space optimization to a direct correction of the model's final representation. Our approach models this correction as an optimal linear transformation, yielding a closed-form solution that replaces the entire offline optimization process with a single-step, architecture-agnostic computation. This solution directly incorporates user preferences, allowing a Pareto-optimal model to be generated on-the-fly with complexity that scales linearly with the number of tasks. Experimental results show our method generates a superior Pareto front with more precise preference alignment and drastically reduced computational cost. Code is available at: https://github.com/CREAHDD/ReACT

From Parameter to Representation: A Closed-Form Approach for Controllable Model Merging

3D Gaussian Splatting (3DGS) has recently demonstrated significant potential for streaming dynamic scenes, enabling the synthesis of photo-realistic and real-time free-viewpoint videos (FVVs). Conventional streaming pipelines optimize each frame independently, \textit{i.e.}, the attribute of the 3D Gaussians (3DGs) responsible for the static regions are supposed to be identical across all frames but are changed in the optimization process, thus causing temporal color inconsistency and visual flickering artifacts in the static regions. To tackle this, we propose CPOStream, which utilizes a prediction and observation module to determine the state of 3DG. Specifically, the prediction module records those 3DGs that are inactive in the past K frames and those would be ignored in the optimization process of the current frame reconstruction. Thus, the attributes of those 3DGs would be kept consistent across the past K frames, guaranteeing the temporal consistence. Additionally, the observation module conducts motion detection, and recognizes those new 3DGs which are not recorded in the prediction module and are first detected by the observation module in the past K frames. The attributes of those 3DGs are optimized during the current frame reconstruction. Experiments on multiple real-world FVV benchmarks show that CPOStream substantially reduces temporal flickering and improves reconstruction fidelity, achieving state‑of‑the‑art performance.

CPOStream: Collaborating Prediction and Observation for Flicker-Free Streamable Free-Viewpoint Video with 3DGS

Attributing synthetic images to their source generative models is critical for digital forensics and security. While most existing attribution methods can distinguish images produced by known models and reject those from unknown ones, they are unable to verify whether a given image was produced by a specific, previously unseen model. To address this limitation, we formulate an open-set verification problem: determining whether a given image was generated by a specific model. Our key insight is that synthetic images from different models show consistent, content-independent fingerprints in their amplitude spectrum. Based on this insight, we design a dynamic fingerprint simulator capable of simulating over 1.6 trillion generative model architectures. We further train an extractor to capture model-specific fingerprint representations with supervised contrastive learning, enabling accurate attribution of synthetic images, even from previously unseen models. Our method does not rely on any synthetic images, instead, it is trained solely on real images. On DMDetection and AIGCBenchmark, which comprises dozens of state-of-the-art and in-the-wild generative models, our method improves the attribution performance (AUC) of the prior method from random level to 94.05\% and 83.05\%, respectively. On GenImage and OSMA datasets, we obtain 85.08\%, and 88.48\% OSCR, outperforming the SOTA methods by 4.30\% and 9.37\% under the same settings.

One for All: Synthesis-Free Fingerprint Learning for Attribution of In-the-Wild Synthetic Images

In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present Medverse, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available.

Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement

Egocentric human pose estimation (HPE) plays a crucial role in immersive applications such as virtual and augmented reality. However, existing methods relying on either visual or sparse inertial data alone often suffer from occlusion or ill-posed problems. In this work, we propose SAME, a novel spatial-aware multimodal fusion framework combining the complementary signals from the stereo images and sparse IMUs for accurate and robust egocentric HPE. It adopts a two-stage network based on a dual coordinate frame to mitigate the coordinate inconsistencies among the stereo cameras and the IMUs. In the first stage, the IMU signals are transformed into the local frame and iteratively fused with the stereo images for estimating 3D poses in the local frame. In the second stage, the local poses are transformed into the global frame with the 6DOF head poses provided by the head-mounted display's (HMD) SLAM algorithm and then temporally aggregated via a temporal Transformer network. Meanwhile, to achieve geometric and semantic alignment among multi-modal features, we present a depth-guided spatial-aware deformable stereo attention network and a modality-aware Transformer decoder for cross-view and cross-modal feature fusion. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on the public EMHI multi-modal egocentric pose estimation benchmark.

SAME: Spatial-Aware Multimodal Egocentric Human Pose Estimation

Shapley values are widely recognized as a principled method for attributing importance to input features in machine learning. However, the exact computation of Shapley values scales exponentially with the number of features, severely limiting the practical application of this powerful approach. The challenge is further compounded when the predictive model is probabilistic---as in Gaussian processes (GPs)---where the outputs are random variables rather than point estimates, necessitating additional computational effort in modeling higher-order moments. In this work, we demonstrate that for an important class of GPs known as FANOVA GP, which explicitly models all main effects and interactions, exact Shapley attributions for both local and global explanations can be computed in *quadratic* time. For *local, instance-wise explanations*, we define a stochastic cooperative game over function components and compute the *exact stochastic Shapley value* in quadratic time only, capturing both the expected contribution and uncertainty. For *global explanations*, we introduce a deterministic, variance-based value function and compute exact Shapley values that quantify each feature’s contribution to the model’s overall sensitivity. Our methods leverage a closed-form (stochastic) Möbiusrepresentation of the FANOVA decomposition and introduce recursive algorithms, inspired by Newton's identities, to efficiently compute the mean and variance of Shapley values. Our work enhances the utility of explainable AI, as demonstrated by empirical studies, by providing more scalable, axiomatically sound, and uncertainty-aware explanations for predictions generated by structured probabilistic models.

Exact Shapley Attributions in Quadratic-time for FANOVA Gaussian Processes

3D multi-human reconstruction from single images holds significant potential for advancing AR/VR applications. While remarkable progress has been made in single-human reconstruction, existing methods face challenges when reconstructing multiple humans. These challenges include: (1) severe inter-occlusion that disrupts individual body structures, and (2) the absence of physically plausible relative positioning among subjects.
We present DECON, a novel DEcouple-and-reCONstruct framework that systematically addresses these limitations through two technical innovations: (1) a decouple-and-reconstruct framework with multi-view synthesis. It separates individuals and reconstructs detailed 3D bodies from a single image. (2) a Perspective-Aware Position Optimization (PAPO) approach. It ensures realistic positioning by fixing overlaps and gaps between subjects.
Extensive experiments demonstrate our method's capability to reconstruct fully separated, anatomically complete 3D humans with clothed-geometric details and plausible interactions. Quantitative evaluations show a 54\% reduction in Chamfer Distance and 35\% in Point-to-Surface Distance compared to state-of-the-art methods. Our source code will be publicly released.

DECON: Reconstruction of Clothed-Geometric Multiple Humans from a Single Image via Geometry-Guided Decoupling

Bokeh is used in photography to emphasize the selected subject by smoothly blurring the out-of-focus region with appealing highlights. While recent advances have achieved impressive results in rendering realistic blur, existing frameworks typically rely on disparity maps and bokeh-relevant inputs (e.g., focal distance and blur size), and face significant challenges in video bokeh rendering due to limited temporal consistency. In this paper, we propose BokehCrafter, the first video diffusion framework that generates temporally coherent and visually pleasing bokeh effects from all-in-focus video inputs under user-friendly input conditions. Specifically, we leverage a dual-stream attention mechanism, integrating a reference image branch and a rendering instruction branch. We propose a Bokeh Image Extraction (BIE) module and a CLIP-based text encoder to extract image and text features, respectively, whose outputs are fused via a Text-Image Fusion (TIF) module to enable fine-grained and controllable bokeh rendering. To support the novel capabilities of our model, we construct Video Bokeh Scenes (VBS), a large-scale dataset containing a wide variety of bokeh videos with corresponding rendering instructions, across various scenes and rendering settings. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art methods in both bokeh rendering quality and temporal consistency.

BokehCrafter: Taming Video Diffusion Models for Controllable Bokeh Rendering

Recent advances in vision language models (VLMs) have demonstrated remarkable potential in embodied navigation tasks. However, existing robot-centric datasets primarily focus on traditional 3D tasks such as perception and prediction, lacking adequate support for vision-language tasks. Vision-language-navigation (VLN) is a key capability for achieving human-like and interpretable navigation in complex environments. In this study, we present CoT-VLNBench, the first large-scale benchmark and dataset designed for chain-of-thought (CoT) reasoning in quadruped robot navigation. Our dataset encompasses a diverse range of indoor and outdoor scenes, multi-step navigation trajectories, and rich natural language instructions, all annotated with fine-grained CoT reasoning traces. Specifically, it contains 175K frames, 5.25M 3D bounding boxes, and 875K vision–question–answer (VQA) pairs. This comprehensive resource enables thorough evaluation of embodied agents’ perceptual and step-by-step reasoning abilities. Furthermore, we propose a novel CoT-VLN model, a state-of-the-art 7B VLN model that integrates visual, linguistic, and reasoning modules, to facilitate interpretable and effective navigation. Extensive experiments demonstrate that our approach significantly outperforms existing non-VLMs baselines on the new benchmark, underscoring the importance of CoT-VLN in embodied navigation. We hope that CoT-VLNBench will serve as a valuable resource to advance research at the intersection of robotics, vision, language, and reasoning.

Downloads

Next from AAAI 2026

PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads