Singapore

Spatial understanding is a critical capability for LVLMs (Large Vision-Language Models) to advance embodied AI applications. Existing works primarily focus on enhancing spatial understanding within a single frame, i.e., injecting 3D spatial concepts into LVLMs under single coordinate system. However, such improvements struggle in real-world tasks that require consistent cross-view spatial reasoning. In this paper, we propose \textbf{CVVG-Reasoner}(\textbf{C}ross-\textbf{V}iew \textbf{V}isual \textbf{G}eometries) that lifts single-frame spatial comprehension to unified cross-view spatial understanding by mimicking \textit{\textbf{human-like cross-view reasoning mechanisms}}. First, we introduce \textbf{MV3DSR}(\textbf{M}ulti-\textbf{V}iew \textbf{3D} \textbf{S}patial \textbf{R}easoning), a scalable pipeline for cross-view spatial reasoning data generation, and construct MV3DSR-Dataset, a large-scale dataset with diverse 3D cross-view reasoning tasks. Based on MV3DSR, we propose MV3DSR-Bench, a comprehensive benchmark for evaluating cross-view spatial reasoning capabilities. Second, we design a three-stage training strategy: the first two stages progressively equip the model with (1) fundamental spatial knowledge and (2) human-like cross-view reasoning patterns, while the final stage employs reinforcement learning to further boost its performance. Extensive experiments demonstrate that our \textbf{CVVG-Reasoner} significantly outperforms existing 3D LLMs(Large Language Models) and advanced LVLMs in cross-view tasks while maintaining robust performance on out-of-domain data. Ablation studies further reveal that injecting human-like reasoning patterns yields a remarkable 44\% performance gain, validating the effectiveness of our design.

AAAI 2026

Aligning Cross-View Visual Geometries in LVLMs Through Human-Like Reasoning Learning

geometric reasoning

3d vlm

spatial intelligence

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose *Omni-Effects*, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) **LoRA-based Mixture of Experts (LoRA-MoE)**, which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) **Spatial-Aware Prompt (SAP)** incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset *Omni-VFX* via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that *Omni-Effects* achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects. Our code will be released.

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Cascade-based multi-scale Multi-view Stereo (MVS) architectures are currently the mainstream in multi-view stereo reconstruction, achieving a balance between computational efficiency and reconstruction accuracy. However, existing cascade MVS methods suffer from significant limitations in cross-scale information utilization, where depth estimation processes operate independently across scales without fully exploiting the rich relevance between adjacent scales. To address this fundamental limitation, we propose the Enhanced Cascade Multi-View Stereo framework (EC-MVSNet), which introduces a novel cross-scale relevance integration strategy. Our framework incorporates three key components: a Cross-Scale Feature-based Joint Construction (CFC) module that synergistically combines features from adjacent scales to build more reliable cost volumes, a Cross-Scale Probability-guided Enhancement (CPE) module that propagates depth probability distributions across scales to guide cost volume enhancement, and a Monocular Feature-based Refinement (MFR) module that leverages monocular priors to further enhance depth prediction accuracy. Extensive experiments demonstrate that EC-MVSNet achieves state-of-the-art performance on multiple benchmarks, validating the effectiveness of the cross-scale integration in improving MVS reconstruction quality.

EC-MVSNet: Enhanced Cascaded Multi-View Stereo with Cross-Scale Relevance Integration

Reconstructing human avatars using generative priors is essential for achieving versatile and realistic avatar models. Traditional approaches often rely on volumetric representations guided by generative models, but these methods require extensive volumetric rendering queries, leading to slow training. Alternatively, surface-based representations offer faster optimization through differentiable rasterization, yet they are typically limited by vertex count, restricting mesh resolution and scalability when combined with generative priors. Moreover, integrating generative priors into physically based human avatar modeling remains largely unexplored. To address these challenges, we introduce DIS (Deep Inverse Shading), a unified framework for high-fidelity, relightable avatar reconstruction that incorporates generative priors into a coherent surface representation. DIS centers on a mesh-based model that serves as the target for optimizing both surface and material details. The framework fuses multi-view 2D generative surface normal predictions, rich in detail but often inconsistent, into the central mesh using a normal conversion module. This module converts generative normal outputs into per-triangle surface offsets via differentiable rasterization, enabling the capture of fine geometric details beyond sparse vertex limitations. Additionally, DIS integrates a de-shading module, informed by generative priors, to recover accurate material properties such as albedo. This module refines albedo predictions by removing baked-in shading and back-propagates reconstruction errors to further optimize the mesh geometry. Through this joint optimization of geometry and material appearance, DIS achieves physically consistent, high-quality reconstructions suitable for accurate relighting. Our experiments show that DIS delivers SOTA relighting quality, enhanced rendering efficiency, lower memory consumption, and detailed surface reconstruction.

Deep Inverse Shading: Consistent Albedo and Surface Detail Recovery via Generative Refinement

Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. The code will be made publicly available.

VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation

Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only \textbf{23K} training data from \textbf{MATH} dataset. Through test-time scaling, a \textbf{1.5B} GenPRM outperforms \textbf{GPT-4o}, and a \textbf{7B} GenPRM surpasses \textbf{Qwen2.5-Math-PRM-72B} on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs.

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

Current multimodal large language models (MLLMs) struggle with hour-level video understanding, facing significant challenges not only in modeling the substantial information volume of long videos but also in overcoming the memory wall and resource constraints during both training and inference. Although recent training-free approaches have alleviated resource demands by compressing visual features, their reliance on incomplete visual information limits the performance potential. To address these limitations, we propose Adaptive Pivot Visual information Retrieval (APVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information. It breakthroughs the memory wall limitation via two complementary components: Pivot Frame Retrieval employs query expansion and iterative spatio-semantic confidence scoring to identify relevant video frames, and Pivot Token Retrieval performs query-aware attention-driven token selection within up to 1024 pivot frames. This dual granularity approach enables the processing of hour-long videos while maintaining semantic fidelity. Experimental validations on three different baseline MLLMs demonstrate significant performance improvements up to 9.5\%, 4.6\% and 9.7\% on LongVideoBench, VideoMME and MLVU, respectively. APVR achieves state-of-the-art results for both training-free and training-based approaches. Code is available on https://anonymous.4open.science/r/APVR-F2C2.

APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Coupon distribution is a critical marketing strategy used by online platforms to boost revenue and enhance user engagement. Regrettably, existing coupon distribution strategies fall far short of effectively leveraging the complex sequential interactions between platforms and users. This critical oversight, despite the abundance of e-commerce log data, has precipitated a performance plateau. In this paper, we focus on the scene that the platforms make sequential coupon distribution decision multiple times for various users, with each user interacting with the platform repeatedly. Based on this marketing scenario, we propose a novel marketing framework, named \textbf{S}equence-\textbf{A}ware \textbf{C}onstrained \textbf{O}ptimization (SACO) framework, to directly devise coupon distribution policy for long-term revenue boosting. SACO framework enables optimized online decision-making in a variety of real-world marketing scenarios. It achieves this by seamlessly integrating three key characteristics, general scenarios, sequential modeling with more comprehensive historical data, and efficient iterative updates within a unified framework. Furthermore, empirical results on real-world industrial dataset, alongside public and synthetic datasets demonstrate the superiority of our framework.

SACO: Sequence-Aware Constrained Optimization Framework for Coupon Distribution in E-commerce

High-quality material synthesis is essential for replicating complex surface properties to create realistic scenes. Despite advances in the generation of material appearance based on analytic models, the synthesis of real-world measured BRDFs remains largely unexplored. To address this challenge, we propose M^3ashy, a novel multi-modal material synthesis framework based on hyperdiffusion. M^3ashy enables high-quality reconstruction of complex real-world materials by leveraging neural fields as a compact continuous representation of BRDFs. Furthermore, our multi-modal conditional hyperdiffusion model allows for flexible material synthesis conditioned on material type, natural language descriptions, or reference images, providing greater user control over material generation. To support future research, we contribute two new material datasets and introduce two BRDF distributional metrics for more rigorous evaluation. We demonstrate the effectiveness of M^3ashy through extensive experiments, including a novel statistics-based constrained synthesis, which enables the generation of materials of desired categories.

M3ashy: Multi-Modal Material Synthesis via Hyperdiffusion

For long-tailed recognition (LTR) tasks, high intra-class compactness and inter-class separability in both head and tail classes, as well as balanced separability among all the classifier vectors, are preferred. The existing LTR methods based on cross-entropy (CE) loss not only struggle to learn features with desirable properties but also couple imbalanced classifier vectors in the denominator of its Softmax, amplifying the imbalance effects in LTR. In this paper, for the LTR, we propose a binary cross-entropy (BCE)-based tripartite synergistic learning, termed BCE3S, which consists of three components: (1) BCE-based joint learning optimizes both the classifier and sample features, which achieves better compactness and separability among features than the CE-based joint learning, by decoupling the metrics between feature and the imbalanced classifier vectors in multiple Sigmoid; (2) BCE-based contrastive learning further improves the intra-class compactness of features; (3) BCE-based uniform learning balances the separability among classifier vectors and interactively enhances the feature properties by combining with the joint learning. The extensive experiments show that the LTR model trained by BCE3S not only achieves higher compactness and separability among sample features, but also balances the classifier's separability, achieving SOTA performance on various long-tailed datasets such as CIFAR10-LT, CIFAR100-LT, ImageNet-LT, and iNaturalist2018.

BCE3S: Binary Cross-Entropy Based Tripartite Synergistic Learning for Long-Tailed Recognition

The proliferation of sophisticated deepfakes poses significant threats to information integrity.​​ While DINOv2 shows promise for detection, existing fine-tuning approaches treat it as generic binary classification, ​​overlooking distinct artifacts inherent to different deepfake methods.​​ To address this, ​​we propose a DeepFake Fine-Grained Adapter (DFF-Adapter) for DINOv2.​​ Our method incorporates ​​lightweight multi-head LoRA modules​​ into ​​every transformer block​​, enabling efficient backbone adaptation. ​​DFF-Adapter simultaneously addresses authenticity detection and fine-grained manipulation type classification,​​ ​​where classifying forgery methods enhances artifact sensitivity.​​ We introduce ​​a shared branch propagating fine-grained manipulation cues to the authenticity head.​​ ​​This enables multi-task cooperative optimization,​​ explicitly enhancing authenticity discrimination with manipulation-specific knowledge. Utilizing ​​only 3.5M trainable parameters​​, our parameter-efficient approach achieves detection accuracy comparable to or even surpassing that of current complex state-of-the-art methods.

Downloads

Next from AAAI 2026

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads