Singapore

Video generation has been advancing rapidly, and diffusion transformer (DiT) based models have demonstrated remarkable capabilities. However, their practical deployment is often hindered by slow inference speeds and high memory consumption. In this paper, we propose a novel pipelining framework named PipeDiT to accelerate video generation, which is equipped with three main innovations. First, we design a pipelining algorithm (PipeSP) for sequence parallelism (SP) to enable the computation of latent generation and communication among multiple GPUs to be pipelined, thus reducing the inference latency. Second, we propose DeDiVAE to decouple the diffusion module and the VAE module into two GPU groups whose executions can also be pipelined to reduce the memory consumption and inference latency. Third, to better utilize the GPU resources in the VAE group, we propose an attention co-processing (Aco) method to further reduce the overall video generation latency. We integrate our PipeDiT into both OpenSoraPlan and HunyuanVideo, two state-of-the-art open-source video generation frameworks, and conduct extensive experiments on two 8-GPU systems. Experimental results show that, under many common resolution and timestep configurations, our PipeDiT achieves $1.06\times$ to $4.02\times$ speedups over OpenSoraPlan and HunyuanVideo.

AAAI 2026

PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling

scalability of ml systems，efficient ml

diffusion models for vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Modeling large-scale landscapes is a foundational yet time-consuming task in many 3D applications, typically requiring substantial expertise. Recently, Text-to-3D techniques have emerged as a promising, beginner-friendly prototyping approach for generating 3D content from textual input. However, existing methods either produce unusable, problematic geometries, or fail to fully capture the user's complex intent from the input text—making it difficult to generate high-quality landscape assets with controllable spatial and geographic features. In this paper, we present LandCraft, a novel AI-assisted authoring tool that enables the rapid creation of high-quality landscape scenes based on user descriptions. Our system employs a coarse-to-fine generation process: Initially, large language and deep generative models concretize textual ideas into abstract representations that capture essential landscape features, such as spatial and geographic characteristics. Then, we leverage a comprehensive procedural generation module to synthesize the detailed, structurally consistent 3D landscapes based on these inferred representations. LandCraft can effectively generate production-ready 3D scene assets that can be seamlessly exported to external game engines or modeling software, enabling immediate practical use.

LandCraft: Designing the Structured 3D Landscapes via Text Guidance

Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks.
However, the transferability of adversarial videos to unseen models—a common and practical real-world scenario—remains unexplored. 
In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs.
We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information.
To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack.
In I2V-MLLM, we utilize an image-based multimodal large language model (I-MLLM) as a surrogate model to craft adversarial video samples.
Multimodal interactions and spatiotemporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability.
Additionally, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies.
Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as a surrogate model) achieve competitive performance, with average attack success rate (AASR) of 57.98% on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA tasks, respectively.

Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach

In open real-world autonomous driving scenarios, challenges such as sensor failure and extreme weather hinder the generalization of current autonomous driving perception models to these unseen domain, due to the domain shifts between the test and training data. As the parameter scale of autonomous driving perception models grows, traditional test-time adaptation (TTA) methods become unstable and often degrade model performance in most scenarios. To address these challenges, this paper proposes two new robust methods to improve the Batch Normalization with TTA for object detection in autonomous driving: (1) We introduce a new LearnableBN layer based on Geometric Confidence Maximization and Entropy Minimization. Specifically, we modify the traditional BN layer by incorporating auxiliary learnable parameters, which enables the BN layer to dynamically update the statistics according to the different input data. (2) We propose a novel semantic-consistency based dual-stage adaptation strategy, which encourages the model to iteratively search for the optimal solution and eliminates unstable samples during the adaptation process. Extensive experiments on the NuScenes-C dataset shows that our method achieves a maximum improvement of about 10\% using BEVFormer as the baseline across six corruption types and three levels of severity. We will make our source code available soon.

Improving Batch Normalization with Test-Time Adaptation for Robust Object Detection in Self-Driving

LiDAR semantic segmentation is a key task in advanced autonomous driving systems. Projection-based methods exhibit real-time potential due to their efficiency, but suffer from inevitable 3D information loss and rely on time-consuming post-processing, limiting overall performance. To address this, we propose MFINet, a real-time semantic segmentation network based on multi-view fusion and 2D-3D interaction enhancement. It adopts a three-branch architecture that integrates 3D Point View (3D-PV), 2D Bird’s Eye View (2D-BEV) and 2D Range View (2D-RV) to make full use of 2D and 3D representation. From 3D to 2D, we design a 3D Point Feature Projector (3DPFP), which injects 3D features into the 2D BEV and RV pseudo-images to retain effective 3D information. From 2D to 3D, a Feature Enhancement (FE) module is designed to leverage the advantages of 2D information in extracting geometric and semantic features. We also introduce a 2D-3D Fusion Head (FH) to aggregate point features from multiple views. Besides, we incorporate a Multi-Scale Dilated Attention (MSDA) module with a sliding window strategy to enhance feature discrimination. Extensive experiments on the SemanticKITTI and NuScenes benchmarks demonstrate that MFINet outperforms existing methods on the SemanticKITTI, NuScenes validation set and achieves competitive results on the NuScenes test set.

MFINet: Multi-view Fusion and 2D–3D Interaction Enhancement for Real-Time LiDAR Semantic Segmentation

Algorithms for solving nonlinear fixed-point equations---such as average-reward $Q$-learning and TD-learning---often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak–Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm's contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free $\tilde{O}(1/\sqrt{t})$ optimal rates for $Q$-learning in both average-reward and exponentially discounted settings, where $t$ denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.

Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to Q-Learning

Binary code analysis is essential for software security across various instruction set architectures. Cross-architecture binary function similarity detection faces significant challenges due to substantial differences in instruction sets and architectural conventions. Existing approaches struggle to capture relationships between code abstraction levels, and lack comprehensive cross-architecture datasets for effective evaluation. Inspired by human cognitive processes of dynamically integrating multi-level information, we propose Binary Dynamic Layer Fusion (BDLF), a novel neural architecture that enhances cross-architecture similarity detection through adaptive layer-wise feature integration. BDLF leverages Qwen3's multilingual code understanding and introduces dynamic weight generation to optimally combine representations from all previous layers. We also construct Cross-Bin, a high quality cross-architecture binary function dataset. BDLF-Qwen3 employs two-stage training: partial fine-tuning with pairwise similarity learning followed by BDLF enhancement with InfoNCE contrastive learning. Experiments demonstrate BDLF-Qwen3 significantly outperforms state-of-the-art methods, achieving 36-65\% improvement in Recall@10 across diverse CPU architectures.

BDLF-Qwen3: Enhanced Cross-Architecture Binary Function Similarity Detection Through Binary Dynamic Layer Fusion

Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D$^{3}$ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D$^{3}$ToM uses decider tokens—the tokens generated in the previous denoising step—to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D$^{3}$ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D$^{3}$ToM accelerates inference while preserving competitive performance.

D3ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others pre-process labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel ”Embedding-to-Label” process effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artefacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. Directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.

UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth <= 2 bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier and multiple inlier subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%.

Bi-VLM: Binary Post-Training Quantization for Vision-Language Models

We study the problem of learning a policy network to optimize several related objectives simultaneously in reinforcement learning (RL). Given a total of $n$ objectives, we consider finding a small set of $k$ policies that is much less than $n$, and that apply to all the objectives. This problem has broad applications in robotic control and language models. Learning one policy for all the objectives does not scale when the number of objectives becomes very large. Instead, this work introduces a two-stage, meta-training and adaptation procedure to tackle this problem. Our procedure works by first training a meta policy based on all the objectives. Then, we adapt this meta policy quickly to multiple subsets of randomly chosen objectives. This adaptation is enabled by a gradient-based approximation property of actor-critic agents, which we have empirically verified to be within a 2% error in a range of RL environments. This overall procedure, namely PolicyGradEx, can quickly estimate a task affinity score between every pair of objectives based on the estimated scores for each subset of objectives. Then, based on the estimated affinity scores, we apply a grouping procedure to cluster similar objectives into $k$ groups. Extensive experiments on three classic control benchmarks and the Meta-World benchmark demonstrate that our method outperforms state-of-the-art baselines by 16%, while being up to $26\times$ faster than full training. Ablation studies validate the design of each component of our method. For example, compared to random grouping and gradient-similarity-based grouping, our method outperforms both by 19%.

Downloads

Next from AAAI 2026

LandCraft: Designing the Structured 3D Landscapes via Text Guidance

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

LandCraft: Designing the Structured 3D Landscapes via Text Guidance

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads