Singapore

Generating high-quality, controllable, and structurally consistent 3D scenes is a fundamental yet challenging task, especially in complex multi-object environments. We present \textbf{SceneGenesis}, a unified framework for 3D scene synthesis that systematically integrates semantic structural priors with mesh-guided video-geometry fusion. The process begins with a \textbf{semantic structural initialization module}, which leverages large language models to convert textual scene prompts into category-aware object descriptions. These are transformed into structured meshes by combining procedural approximations for large-scale objects and pretrained mesh generators for fine-grained assets, enabling precise layout control and scene scalability. To synthesize rich and style-controllable appearances, we render depth and semantic maps from the initialized scene and condition a pretrained video diffusion model to generate multi-view video sequences with geometry-awareness, where a consistency-guided latent fusion strategy further enhances temporal consistency across long sequences. Crucially, we introduce a \textbf{mesh-guided video-geometry fusion module} that reconstructs coherent 3D Gaussian scenes by aligning mesh priors with video outputs. This module incorporates mesh-conditioned fragment initialization, progressive geometric refinement, and structure-aware optimization, significantly enhancing global geometric fidelity and visual realism. 
Extensive experiments demonstrate that \textbf{SceneGenesis} enables flexible style variation and object-level editing while achieving superior controllability, scalability, and 3D structural quality, offering an effective solution for 3D scene synthesis.

AAAI 2026

SceneGenesis: 3D Scene Synthesis via Semantic Structural Priors and Mesh-Guided Video-Geometry Fusion

video-conditioned reconstruction

mesh-based modeling

3d scene generation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data. To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information. Moreover, to realize end-to-end long audio understanding, we introduce a segment-level token merging module to substantially reduce audio token redundancy and enhance the efficiency of information extraction. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing audio datasets into a new dataset focused on temporal tasks and establish a series of metrics to evaluate the fine-grained performance. Evaluations show strong performance across a variety of fine-grained tasks, such as dense captioning, temporal grounding, and timeline speech summarization, which demonstrates TimeAudio's robust temporal localization and reasoning capabilities.

Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

Emotional and cognitive factors are essential for understanding mental health disorders. However, existing methods often treat multi-modal data as classification tasks, limiting interpretability especially for emotion and cognition. Although large language models (LLMs) offer opportunities for mental health analysis, they mainly rely on textual semantics and overlook fine-grained emotional and cognitive cues in multi-modal inputs. While some studies incorporate emotional features via transfer learning, their connection to mental health conditions remains implicit. To address these issues, we propose ECMC, a novel task that aims at generating natural language descriptions of emotional and cognitive states from multi-modal data, and producing emotion–cognition profiles that improve both the accuracy and interpretability of mental health assessments. We adopt an encoder–decoder architecture, where modality-specific encoders extract features, which are fused by a dual-stream BridgeNet based on Q-former. Contrastive learning enhances the extraction of emotional and cognitive features. A LLaMA decoder then aligns these features with annotated captions to produce detailed descriptions.
Extensive objective and subjective evaluations demonstrate that: 1) ECMC outperforms existing multi-modal LLMs and mental health models in generating emotion–cognition captions; 2) the generated emotion–cognition profiles significantly improve assistive diagnosis and interpretability in mental health analysis.

Voices, Faces, and Feelings: Multi-modal Emotion-Cognition Captioning for Mental Health Understanding

Recent work on human animation usually incorporates large-scale video models, thereby achieving more vivid performance. However, the practical use of such methods is hindered by the slow inference speed and high computational demands. Moreover, traditional work typically employs separate models for each animation task, increasing costs in multi-task scenarios and worsening the dilemma. To address these limitations, we introduce EchoMimicV3, an efficient framework that unifies multi-task and multi-modal human animation. At the core of EchoMimicV3 lies a threefold design: a Soup-of-Tasks paradigm, a Soup-of-Modals paradigm, and a novel training and inference strategy. The Soup-of-Tasks leverages multi-task mask inputs and a counter-intuitive task allocation strategy to achieve multi-task gains without multi-model pains. Meanwhile, the Soup-of-Modals introduces a Coupled-Decoupled Multi-Modal Cross Attention module to inject multi-modal conditions, complemented by a Timestep Phase-aware Multi-Modal Allocation mechanism to dynamically modulate multi-modal mixtures. Besides, we propose Negative Direct Preference Optimization and Phase-aware Negative Classifier-Free Guidance, which ensure stable training and inference. Extensive experiments and analyses demonstrate that EchoMimicV3, with a minimal model size of 1.3 billion parameters, achieves competitive performance in both quantitative and qualitative evaluations. We are committed to open-sourcing our code for community use.

EchoMimicV3: 1.3B Parameters Are All You Need for Unified Multi-Modal and Multi-Task Human Animation

Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that directly assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across multiple plausible futures. To facilitate this study, we propose an HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes the complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.

Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search

Asymmetric image retrieval (AIR), which typically employs a compact model for the query side and a large model for the database server, has garnered significant attention in resource-constrained environments. While deep hashing methods have shown great potential in large-scale image retrieval, current attempts for the asymmetric image retrieval overlook the differences in quantization capabilities between query and gallery networks. In AIR, the conventional quantization scheme forces the outputs of small query models to approximate the discrete outputs of large models, imposing overly rigid and stringent constraints that severely limit the optimization of small query models. Furthermore, existing deep hashing methods for AIR necessitate labeled datasets from large models, which also limits their practical applicability. To this end, we reconsider the necessity of strict discretization in AIR and propose a novel asymmetric hashing method, named $\textbf{D}$eep $\textbf{C}$orrelation $\textbf{A}$lignment $\textbf{H}$ashing (DCAH). Rather than explicitly quantizing continuous query features to match discrete gallery representations, we distill the correlation across both models and introduce a $\textbf{C}$orrelation $\textbf{A}$lignment based $\textbf{Q}$uantization (CAQ) scheme, thereby implicitly accomplishing quantization. To preserve the similarity consistency between the query and gallery models, we further employ a correlation alignment-based knowledge distillation strategy which is intrinsically compatible with the CAQ. Notably, the proposed quantization scheme can function as a plug-and-play module that seamlessly integrates with existing AIR methods. Comprehensive evaluations on three real-world benchmark datasets demonstrate the effectiveness of the proposed quantization scheme CAQ, and also show that DCAH achieves state-of-the-art performance in asymmetric image retrieval scenarios. The open-source code is available at https://anonymous.4open.science/r/DCAH.

Discretization Is Not Always Better: Rethinking Deep Quantization for Asymmetric Image Retrieval

Semi-supervised semantic segmentation, which leverages a limited set of labeled images, helps to relieve the heavy annotation burden. While pseudo-labeling strategies yield promising results, there is still room for enhancing the reliability of pseudo-labels. Hence, we develop a semi-supervised framework, namely DerProp, equipped with a novel derivative label propagation to rectify imperfect pseudo-labels. Our label propagation method imposes discrete derivative operations on pixel-wise feature vectors as additional regularization, thereby generating strictly regularized similarity metrics. Doing so effectively alleviates the ill-posed problem that identical similarities correspond to different features, through constraining the solution space. Extensive experiments are conducted to verify the rationality of our design, and demonstrate our superiority over other methods. Codes will be made publicly available.

Semi-Supervised Semantic Segmentation via Derivative Label Propagation

Dynamic Gaussian Splatting approaches have achieved remarkable performance for 4D scene reconstruction. However, these approaches rely on dense-frame video sequences for photorealistic reconstruction. In real-world scenarios, due to equipment constraints, sometimes only sparse frames are accessible. In this paper, we propose Sparse4DGS, the first method for sparse-frame dynamic scene reconstruction. We observe that dynamic reconstruction methods fail in both canonical and deformed spaces under sparse-frame settings, especially in areas with high texture richness. Sparse4DGS tackles this challenge by focusing on texture-rich areas. For the deformation network, we propose Texture-Aware Deformation Regularization, which introduces a texture-based depth alignment loss to regulate Gaussian deformation. For the canonical Gaussian field, we introduce Texture-Aware Canonical Optimization, which incorporates texture-based noise into the gradient descent process of canonical Gaussians. Extensive experiments show that when taking sparse frames as inputs, our method outperforms existing dynamic or few-shot techniques on NeRF-Synthetic, HyperNeRF, NeRF-DS, and our iPhone-4D datasets.

Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction

The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict correctness criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is domain alignment, which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates Regeneration and Meta-Prompt Adaptation mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.

MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration

Retrieval-Augmented Generation (RAG) systems deployed over proprietary knowledge bases face growing threats from reconstruction attacks that aggregate model responses to replicate knowledge bases. Such attacks exploit both intra-class and inter-class paths—progressively extracting fine-grained knowledge within topics and diffusing it across semantically related ones, thereby enabling comprehensive extraction of the original knowledge base. However, existing defenses target only one path, leaving the other unprotected. We conduct a systematic exploration to assess the impact of protecting each path independently and find that joint protection is essential for effective defense. Based on this, we propose RAGFort, a structure-aware dual-module defense combining contrastive reindexing for inter-class isolation and constrained cascade generation for intra-class protection. Experiments across security, performance, and robustness confirm that RAGFort significantly reduces reconstruction success while preserving answer quality, offering the first comprehensive defense against knowledge base extraction attacks.

RAGFort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation

Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.

Downloads

Next from AAAI 2026

Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads