Singapore

Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation. However, SDS-based methods struggle to maintain semantic fidelity for user prompts, particularly when involving multiple objects with intricate interactions. 
While existing approaches often address 3D consistency through multiview diffusion model fine-tuning on 3D datasets, this strategy inadvertently exacerbates text-3D alignment degradation. 
The limitation stems from SDS&#39;s inherent accumulation of view-independent biases during optimization, which progressively diverges from the ideal text alignment direction.
To alleviate this limitation, we propose a novel SDS objective, dubbed as Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs). Our TCSD leverages cross-modal understanding capabilities of MLLMs to assess and guide the text-3D correspondence during the optimization. We further develop 3DLLaVA-CRITIC - a fine-tuned MLLM specialized for evaluating multiview text alignment in 3D generations. Additionally, we introduce an LLM-layout initialization that significantly accelerates optimization convergence through semantic-aware spatial configuration.
Our framework, CoherenDream, achieves consistent improvement across multiple metrics on TIFA subset.As the first study to incorporate MLLMs into SDS optimization, we also conduct extensive ablation studies to explore optimal MLLM adaptations for 3D generation tasks.

AAAI 2026

CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback

diffusion models for vision

3d computer vision

language and vision

Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation. However, SDS-based methods struggle to maintain semantic fidelity for user prompts, particularly when involving multiple objects with intricate interactions. 
While existing approaches often address 3D consistency through multiview diffusion model fine-tuning on 3D datasets, this strategy inadvertently exacerbates text-3D alignment degradation. 
The limitation stems from SDS's inherent accumulation of view-independent biases during optimization, which progressively diverges from the ideal text alignment direction.
To alleviate this limitation, we propose a novel SDS objective, dubbed as Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs). Our TCSD leverages cross-modal understanding capabilities of MLLMs to assess and guide the text-3D correspondence during the optimization. We further develop 3DLLaVA-CRITIC - a fine-tuned MLLM specialized for evaluating multiview text alignment in 3D generations. Additionally, we introduce an LLM-layout initialization that significantly accelerates optimization convergence through semantic-aware spatial configuration.
Our framework, CoherenDream, achieves consistent improvement across multiple metrics on TIFA subset.As the first study to incorporate MLLMs into SDS optimization, we also conduct extensive ablation studies to explore optimal MLLM adaptations for 3D generation tasks.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Accurate feature matching between image pairs is fundamental for various computer vision applications. In detector-base process, the feature matcher aims to find the optimal feature correspondences, and the match filter is used for further removing mismatches. However, their connection is rarely exploited since they are usually treated as two separate issues in previous method, which may lead to suboptimal results. In this paper, we propose an end-to-end collaborative feature matching (CFM) method, which contains a keypoint learning (KL) module and a correspondence learning (CL) module, to bridge the gap between two types of works. The former improves the discrimination of keypoints, and provides high-quality dynamic matches for CL module. The latter further captures the rich context of matches, and gives effective feedback to KL module. These two modules can reinforce each other in a progressive manner. Besides, we develop an efficient version of CFM, named ECFM, using an adaptive sampling strategy to avoid the negative influence of uninformative keypoints. Experimental results indicate that both methods outperform the state-of-the-art competitors in the tasks of relative pose estimation and visual localization. The code and a 1-minute video demo are provided in the supplementary materials.

Collaborative Feature Matching with Progressive Correspondence Learning

Recently, 3D Gaussian Splatting for scene rendering has attracted much attention in computer vision and graphics, but generally suffers from large burdens of both computation and storage when handling large-scale scenes. Some existing works in literature employ a divide-and-conquer strategy for alleviating this issue, where an input large scene is divided into lots of local blocks, and each block is handled separately. However, such a strategy generally leads to limited performance due to the inevitable inconsistency among the 3D Gaussians from different blocks. To address this problem, we propose a Consistent Anchor Guided Gaussian Splatting for large-scale scene rendering under the divide-and-conquer strategy, called CAG-GS. In CAG-GS, a set of learnable anchors for each local block is injected with the corresponding semantic features from a pre-trained semantic segmentation model SAM2 through an explored semantic mapping module, and then these anchors are used to predict the attributes of 3D Gaussians. Moreover, we explore a coarse-to-fine training strategy for CAG-GS, where each local block is optimized independently while being guided by globally consistent semantics. Extensive experimental results on five large-scale scenes demonstrate the superiority of the proposed method over five state-of-the-art methods in most cases.

CAG-GS: Consistent Anchor Guided Gaussian Splatting for Large-scale Scene Rendering

Model quantization is widely applied for compressing and accelerating deep neural networks (DNNs). However, conventional Quantization-Aware Training (QAT) focuses on training DNNs with uniform bit-width. The bit-width settings vary across different hardware and transmission demands, which induces considerable training and storage costs. Hence, the scheme of one-shot joint training multiple precisions is proposed to address this issue. Previous works either store a larger FP32 model to switch between different precision models for higher accuracy or store a smaller INT8 model but compromise accuracy due to using shared quantization parameters. In this paper, we introduce the Double Rounding quantization method, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision. Furthermore, we observe a competitive interference among different precisions during one-shot joint training, primarily due to inconsistent gradients of quantization scales during backward propagation. To tackle this problem, we propose an Adaptive Learning Rate Scaling (AdaScale) technique that dynamically adapts learning rates for various precisions to optimize the training process. Additionally, we extend our Double Rounding to one-shot mixed precision training and develop a Hessian-Aware Stochastic Bit-switching (HessBit) strategy. Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision. We validate the feasibility of our method on detection and segmentation tasks, as well as on LLMs task.

Double Rounding: Nearly Lossless Adaptive Bit Switching for QAT

Current novel view synthesis methods are typically designed for high-quality and clean input images. However, in foggy scenes, scattering and attenuation can significantly degrade the quality of rendering. Although NeRF-based dehazing approaches have been developed, their reliance on deep fully connected neural networks and per-ray sampling strategies leads to high computational costs. Furthermore, NeRF's implicit representation limits its ability to recover fine-grained details from hazy scenes. To overcome these limitations, We propose learning an explicit Gaussian representation to explain the formation mechanism of foggy images through a physically forward rendering process. Our method, DehazeGS, reconstructs and renders fog-free scenes using only multi-view foggy images as input. Specifically, based on the atmospheric scattering model, we simulate the formation of fog by establishing the transmission function directly onto Gaussian primitives via depth-to-transmission mapping. During training, we jointly learn the atmospheric light and scattering coefficients while optimizing the Gaussian representation of foggy scenes. At inference time, we remove the effects of scattering and attenuation in Gaussian distributions and directly render the scene to obtain dehazed views. Experiments on both real-world and synthetic foggy datasets demonstrate that DehazeGS achieves state-of-the-art performance.

DehazeGS: Seeing Through Fog with 3D Gaussian Splatting

Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts in few-step settings. To address these limitations, we propose SwiftVideo, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, We propose a dual-perspective alignment encompassing distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. 
Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation. Code will be available upon paper publication.

SwiftVideo: A Unified Framework for Few-Step Video Generation Through Trajectory-Distribution Alignment

Image retargeting aims to adjust the aspect ratio of images to accommodate various display devices. While existing methods consider both foreground semantics and background inpainting, their Seam-carving-based framework is inherently destructive, often compromising the structural integrity of foreground instances. Furthermore, conventional inpainting models struggle to achieve pixel-level accuracy with global-only guidance, leading to local inconsistencies and background distortions.
To address these challenges, we reformulate image retargeting as a instance-level re-layout task. By Adaptive Instance Relocation and Dual-guidance Repainting (AIR-DR), our method preserves the structural integrity of the foreground and recovers the background with consistent details. Additionally, we introduce an adaptive retargeting decision that maintains robustness across challenging retargeting scenarios and any ratios.
Extensive experiments on multiple public datasets across various aspect ratios demonstrate that our approach consistently outperforms existing methods in both objective metrics and subjective evaluations. Comprehensive ablation studies further validate the effectiveness of each component.

AIR-DR: Adaptive Image Retargeting with Instance Relocation and Dual-guidance Repainting

Probabilistic forecasting is not only a way to add more information to a prediction of the future, but it also builds on weaknesses in point prediction. Sudden changes in a time series can still be captured by a cumulative distribution function (CDF), while a point prediction is likely to miss it entirely. The modeling of CDFs within forecasts has historically been limited to parametric approaches, but due to recent advances, this no longer has to be the case. We aim to advance the fields of probabilistic forecasting and monotonic networks by connecting them and propose an approach that permits the forecasting of implicit, complete, and nonparametric CDFs. For this purpose, we propose an adaptation to deep lattice networks (DLN) for monotonically constrained simultaneous/implicit quantile regression in time series forecasting. By leveraging long short term memory units (LSTM) as the embedding layer, and spreading quantile inputs to all sub-lattices of a DLN with an extended output size, we can produce a multi-horizon forecast of an implicit CDF due to the monotonic constraintability of DLNs. We compare and evaluate our approach's performance to relevant state of the art within the context of a highly relevant application of time series forecasting: Day-ahead, hourly forecasts of solar irradiance observations. Our experiments show that the adaptation of a DLN performs just as well or even better than an unconstrained approach. Further comparison of the adapted DLN against a scalable monotonic neural network shows that our approach performs better. With this adaptation of DLNs, we intend to create more interest and crossover investigations in techniques of monotonic neural networks and probabilistic forecasting.

Multi-Horizon Time Series Forecasting of Non-Parametric CDFs with Deep Lattice Networks

Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end‑to‑end differentiable framework that infers plausible physical parameters from a user-proved natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground‑truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to be within plausible ranges. We further propose a learnable motion distillation loss, which extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that it produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art, with physically plausible parameters that are automatically determined. The code is available in the supplemental material and will be made publicly available upon publication.

MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

The inherent differences between spike cameras and traditional frame-based cameras lead to more complex and diverse noise characteristics, particularly under extremely low-light conditions. Existing noise modeling approaches for spike camera predominantly rely on inter-spike intervals (ISI) for noise quantification, which often results in inaccurate noise characterization. Moreover, current datasets for spike camera image reconstruction tasks are either synthetic or lack corresponding high-quality reference images, severely limiting rigorous evaluation of noise modeling methods. To address this limitation, we propose a multimodal noise modeling framework for spike camera that integrates insights from traditional frame-based imaging into spike imaging. Specifically, we introduce a time-interval-based quantification method inspired by the exposure-time concept used in traditional frame-based cameras, enabling accurate noise characterization for spike camera. Furthermore, we present the Spike-DSLR Multimodal Dataset (SDMD), the first real-world dataset capturing aligned multimodal data pairs from spike cameras and Digital Single-Lens Reflex (DSLR) cameras, explicitly designed for evaluating spike camera noise models. Experimental results on SDMD demonstrate that our noise modeling approach significantly enhances spike camera image reconstruction quality under low-light conditions, achieving more than 1.6 dB improvement in PSNR compared to existing state-of-the-art methods. This validates both the necessity and effectiveness of adopting a multimodal perspective in spike camera noise modeling. Our code is available at https://github.com/tech-support2/Anonymous_Submission_Code.

Robust Noise Modeling for Spike Camera via Time-Interval Quantification and Spike-DSLR Multimodal Dataset in Low-Light Imaging

Feature dynamics have emerged as a critical topic about open-environment learning due to the instability of feature availability. 
While traditional feature evolution targets single-label tasks, multi-label learning is essential to accommodate the exploding annotation spaces. However, multi-label classification with incremental and decremental features is a crucial yet underexplored problem, which poses the challenge of preserving feature representations and label correlations from historical instances and simultaneously adapting to newly arriving streaming data. To address these issues, we propose a two-stage, one-pass learning approach termed MLID. 
It attempts to compress the informative content of vanished features into the domain of survived ones, facilitate the propagation of label dependencies via low-rank regularization of the classifier, and incorporate augmented features to construct an adaptive classification mechanism. Besides, we design optimization strategies for each stage and provide theoretical guarantees of convergence. Moreover, we establish the generalization error bound of MLID and demonstrate that the compactness of the trace norm and the reuse of models based on effective features can enhance the generalization performance. Finally, we extend it to multi-shot case and extensive experimental results validate the superiority of our MLID.

Content not yet available

Next from AAAI 2026

Collaborative Feature Matching with Progressive Correspondence Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES