Singapore

Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality.
However, most existing diffusion transformer based video generation models are limited to low-resolution outputs ($\leq$720P) due to the quadratic computational complexity of the attention mechanism with respect to
the output width and height.
This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference.
To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis.
Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency.
We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows.
Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.

AAAI 2026

UltraGen: High-Resolution Video Generation with Hierarchical Attention

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from severe artifacts under sparse-view settings. To address this, we propose DiffNR, a novel framework that enhances NR optimization with diffusion priors. At its core is SliceFixer, a single-step diffusion model designed to correct artifacts in degraded slices. We integrate specialized conditioning layers into the network and develop tailored data curation strategies to support model finetuning. During reconstruction, SliceFixer periodically generates pseudo-reference volumes, providing auxiliary 3D perceptual supervision to fix underconstrained regions. Compared to prior methods that embed CT solvers into time-consuming iterative denoising, our repair-and-augment strategy avoids frequent diffusion model queries, leading to better runtime performance. Extensive experiments show that DiffNR improves PSNR by 3.99 dB on average, generalizes well across domains, and maintains efficient optimization.

DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

Recent advances in audio-driven talking-head synthesis have brought lip-sync precision close to human perception, yet emotional fidelity and real-time inference remain open challenges. Existing pipelines typically disentangle lip articulation, facial expression, and head pose in latent space; this rigid factorization ignores the intrinsic coupling between articulation and affect — e.g., downward lip corners when sad—thus limiting expressiveness.
We cast speech-conditioned facial motion as a sample from an emotion-conditioned distribution in a motion latent space. Concretely, we (i) learn a motion dictionary of orthogonal bases with an autoencoder via self-supervision, (ii) construct emotion-conditioned sub-spaces within the latent space, and (iii) design a layer-progressive cross-attention fusion module that modulates a flow-matching sampler with both audio and emotion signals. Only ten reverse ODE steps are required to generate a motion-latent trajectory, enabling real-time end-to-end latency.
Extensive experiments on MEAD and RAVDESS show that our method outperforms recent GAN- and diffusion-based baselines in emotion accuracy while running at around 75 FPS on a single desktop GPU. The proposed framework delivers the first emotionally expressive Audio2Face system that simultaneously achieves lip-sync accuracy, affective realism, and real-time performance.

Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads

We present 360Explorer, a novel approach for generating 4D controllable panoramic videos conditioned on user-provided 3D instructions for exploring and manipulating dynamic worlds.
Compared to existing perspective-based methods struggle to address spatial consistency during camera rotation in place, we introduce the panoramic view in controllable video generation models to inherently maintain the view recall consistency.
By introducing dynamic point clouds as the 4D scene representations, 360Explorer unifies the modeling of camera transformations and object movements as incomplete renders to describe precise control instructions in 3D worlds.
To tackle the data limitation in acquiring multi-viewpoint panoramic videos, we further propose a reverse warping strategy to construct the training dataset on easily accessible monocular panoramic videos.
Extensive experiments demonstrate that 360Explorer achieves superior performance in creating 4D controllable panoramic videos with camera transformation and object movements aligned with diverse provided instructions.

360Explorer: Exploring 4D Controllable World in Panoramic Videos

Recent advances in differentiable structure learning have framed the combinatorial problem of learning directed acyclic graphs as a continuous optimization problem. Various aspects, including data standardization, have been studied to identify factors that influence the empirical performance of these methods. In this work, we investigate critical limitations in differentiable structure learning methods, focusing on settings where the true structure can be identified up to Markov equivalence classes, particularly in the linear Gaussian case. While recent work highlighted potential non-convexity issues in this setting, we demonstrate and explain why the use of $\ell_1$-penalized likelihood in such cases is fundamentally inconsistent, even if the global optimum of the optimization problem can be found. To resolve this limitation, we develop a hybrid differentiable structure learning method based on $\ell_0$-penalized likelihood with hard acyclicity constraint, where the $\ell_0$ penalty can be approximated by different techniques including Gumbel-Softmax. Specifically, we first estimate the underlying moral graph, and use it to restrict the search space of the optimization problem, which helps alleviate the non-convexity issue. Experimental results show that the proposed method enhances empirical performance both before and after data standardization, providing a more reliable path for future advancements in differentiable structure learning, especially for learning Markov equivalence classes.

Revisiting Differentiable Structure Learning: Inconsistency of L1 Penalty and Beyond

With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students’ comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that, aside from recent large-scale open-source and closed-source models, most generalist open-source models, and even math-specialist models, struggle with the multimodal solution explanation task. This highlights a significant gap in current LLMs’ ability to perform visually grounded reasoning and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors. Code and data will be released publicly.

Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Multimodal learning, while contributing to numerous success stories across various fields, faces the challenge of prohibitively expensive manual annotation. To address the scarcity of annotated data, a popular solution is unsupervised domain adaptation, which has been extensively studied in unimodal settings yet remains less explored in multimodal settings. In this paper, we investigate heterogeneous multimodal domain adaptation, where the primary challenge is the varying domain shifts of different modalities from the source to the target domain. We first introduce the information bottleneck method to learn representations for each modality independently, and then match the source and target domains in the representation space with correlation alignment. To balance the domain alignment of all modalities, we formulate the problem as a multi-objective task, aiming for a Pareto optimal solution. By exploiting the properties specific to our model, the problem can be simplified to a quadratic programming problem. Further approximation yields a closed-form solution, leading to an efficient modality-balanced multimodal domain adaptation algorithm. The proposed method features \textbf{B}alanced multi-\textbf{o}bjective \textbf{o}ptimization for \textbf{m}ultimodal \textbf{d}omain \textbf{a}daptation, termed \textbf{Boomda}. Extensive empirical results showcase the effectiveness of the proposed approach and demonstrate that Boomda outperforms the competing schemes. The code is included in the supplementary material.

Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation

Joint rendering and deformation of mesh and 3D Gaussian Splatting (3DGS) has significant value as both representations offer complementary advantages for graphics applications.
However, due to differences in representation and rendering pipelines, existing studies render meshes and 3DGS separately, making it difficult to accurately handle occlusions.
Moreover, the deformed 3DGS still suffers from visual artifacts due to the sensitivity to the topology quality of the proxy mesh.
These issues pose serious obstacles to the joint use of 3DGS and meshes, making it difficult to adapt 3DGS to conventional mesh-oriented graphics pipelines.
We propose UniMGS, the first unified framework for rasterizing mesh and 3DGS in a single-pass anti-aliased manner, with a novel binding strategy for 3DGS deformation based on proxy mesh.
Our key insight is to blend the colors of both triangle and Gaussian fragments by anti-aliased $\alpha$-blending in a single pass, achieving visually coherent results with precise handling of occlusion and transparency.
To improve the visual appearance of the deformed 3DGS, our Gaussian-centric binding strategy employs a proxy mesh and spatially associates Gaussians with the mesh faces, significantly reducing rendering artifacts.
With these two components, UniMGS enables the visualization and manipulation of 3D objects represented by mesh or 3DGS within a unified framework, opening up new possibilities in embodied AI, virtual reality, and gaming.
We will release our source code to facilitate further research.

UniMGS: Unifying Mesh and 3D Gaussian Splatting with Single-Pass Rasterization and Proxy-Based Deformation

Large language model (LLM)-based visual dialogue (VD) systems have made response generation for image-grounded conversations more correct and coherent. However, user engagement - the extent to which a user is interested, emotionally involved, and willing to continue the conversation - remains a challenge. To fully explore engaging VD, we propose: (i) a new task named Audio-enhanced VD (AVD), which introduces additional audio dialogue contexts that can more vividly convey the speaker's emotions as input, with the aim of generating correct but more engaging dialogue responses. Specifically, we employ a text-to-speech model as the modality translator to generate the paired acoustic utterances from the inputting textual utterances; (ii) an accompanying approach named Visually-grounded and Interleaved Text-Audio Dialogue Modeling (VITA-DM), which utilizes both image-grounded information and interleaved text-audio utterances for visual dialogue modeling, differentiating from previous multi-modal LLM (MLLM)-based methods that normally model text and audio modalities separately. We also present three pre-training tasks to better learn multi-modal interactions across language, vision, and audio; (iii) a novel metric named Multi-Modal Engagement (MME), which fills the gap of engagement estimation in VD and can provide a fine-grained assessment along emotional, attentional, and reply engagement dimensions (EE, AE, RE). We experiment on two popular datasets and provide extensive evaluations (automatic, engagement-specific, and human), supporting the validity of our approach. Furthermore, based on empirical results that reveal that emotions contribute the most to engagement, we justify our emphasis on the emotional aspect throughout the definition, solution, and evaluation of our task.

Making Visual Dialogue More Engaging: A New Task, Method, and Metric

As a data-driven learning approach, model-based offline reinforcement learning (MORL) aims to learn a policy by exploiting a dynamics model derived from an existing dataset. Applying conservative quantification to the dynamics model, most existing works on MORL generate trajectories that approximate the real data distribution to facilitate policy learning. However, these methods typically overlook the influence of historical information on environmental dynamics, thus generating unreliable trajectories that fail to align with the true data distribution. In this paper, we propose a new MORL algorithm called Reliability-guaranteed and Reward-seeking Transformer (RT). RT can avoid generating unreliable trajectories through the calculation of cumulative reliability of the trajectories, which is a weighted variational distance between the generated trajectory distribution and the true data distribution.
Moreover, by sampling candidate actions with high rewards, RT can efficiently generate high-reward trajectories from the existing offline data, thereby further facilitating policy learning. We theoretically prove the performance guarantees of RT in policy learning, and empirically demonstrate its effectiveness against state-of-the-art model-based methods on several offline benchmark tasks and a large-scale industrial dataset from an on-demand food delivery platform.

Reliability-Guaranteed and Reward-Seeking Sequence Modeling for Model-Based Offline Reinforcement Learning

RNA inverse folding, designing sequences to form specific 3D structures, is critical for therapeutics, gene regulation, and synthetic biology. Current methods, focused on sequence recovery, struggle to address structural objectives like secondary structure consistency (SS), minimum free energy (MFE), and local distance difference test (LDDT), leading to suboptimal structural accuracy. To tackle this, we propose a reinforcement learning (RL) framework integrated with a latent diffusion model (LDM). Drawing inspiration from the success of diffusion models in RNA inverse folding, which adeptly model complex sequence-structure interactions, we develop an LDM incorporating pre-trained RNA-FM embeddings from a large-scale RNA model. These embeddings capture co-evolutionary patterns, markedly improving sequence recovery accuracy. However, existing approaches, including diffusion-based methods, cannot effectively handle non-differentiable structural objectives. By contrast, RL excels in this task by using policy-driven reward optimization to navigate complex, non-gradient-based objectives, offering a significant advantage over traditional methods. In summary, we propose the Step-wise Optimization of Latent Diffusion Model (SOLD), a novel RL framework that optimizes single-step noise without sampling the full diffusion trajectory, achieving efficient refinement of multiple structural objectives. Experimental results demonstrate SOLD surpasses its LDM baseline and state-of-the-art methods across all metrics, establishing a robust framework for RNA inverse folding with profound implications for biotechnological and therapeutic applications.

Premium content

Next from AAAI 2026

DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES