Singapore

Large Language Diffusion Models, or dLLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs.
We first identify a unique characteristic of dLLMs, unlike auto-regressive LLMs, they maintain remarkably ***stable perplexity*** during direct context extrapolation. Moreover, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover dLLMs exhibit a distinct ***local perception*** phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of dLLMs. Furthermore, we identify long-context tasks where dLLMs outperform auto-regressive LLMs and others where they fall short.
Consequently, this study establishes the first length extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.

AAAI 2026

LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

long-context language models

diffusion language models

large language models

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Conversion represents an effective approach for obtaining low-power models by transforming Artificial Neural Networks (ANNs) into event-driven Spiking Neural Networks (SNNs) without additional training. However, existing spiking neuron models for conversion introduce substantial conversion errors due to insufficient comparative analysis of ANN activation distributions and SNN spike rate ranges. Here, we first reveal that channel-wise activation distributions exhibit distinct offsets, while spike rates typically lack such offsets and are configured layer-wise, resulting in severe distributional mismatch. To address this limitation, we propose Adaptive Integrate-and-Fire (AIF) neurons with channel-specific characteristics that perceive channel-wise offsets of activation distributions and dynamically adjust spike rates, thereby minimizing conversion errors. Experimental results across multiple vision and natural language processing datasets demonstrate state-of-the-art performance, with a notable achievement of 85.52\% accuracy on ImageNet-1K. Furthermore, our approach requires negligible time complexity for the conversion process, offering substantial practical value for conversion applications.

Towards Training-Free and Accurate ANN-to-SNN Conversion via Activation-Aware Redistribution

Spiking neural networks (SNNs) have demonstrated significant potential in real-time multi-sensor perception tasks due to their event-driven and parameter-efficient characteristics. A key challenge is the timestep-wise iterative update of neuronal hidden states (membrane potentials), which complicates the trade-off between accuracy and latency. SNNs tend to achieve better performance with longer timesteps, inevitably resulting in higher computational overhead and latency compared to artificial neural networks (ANNs). Moreover, many recent advances in SNNs rely on architecture-specific optimizations, which, while effective with fewer timesteps, often limit generalizability and scalability across modalities and models. To address these limitations, we propose Activation-wise Membrane Potential Propagation (AMP2), a unified hidden state update mechanism for SNNs. Inspired by the spatial propagation of membrane potentials in biological neurons, AMP2 enables dynamic transmission of membrane potentials among spatially adjacent neurons, facilitating spatiotemporal integration and cooperative dynamics of hidden states, thereby improving efficiency and accuracy while reducing reliance on extended temporal updates. This simple yet effective strategy significantly enhances SNN performance across various architectures, including MLPs and CNNs for point cloud and event-based data. Furthermore, ablation studies integrating AMP2 into Transformer-based SNNs for classification tasks demonstrate its potential as a general-purpose and efficient solution for spiking neural networks.

Activation-wise Propagation: A One-Timestep Strategy for Spiking Neural Networks

Class incremental medical image segmentation (CIMIS) aims to preserve knowledge of previously learned classes while learning new ones without relying on old-class annotations. However, existing methods 1) either adopt one-size-fits-all strategies that treat all spatial regions and feature channels equally, which may hinder the preservation of accurate old knowledge, 2) or focus solely on aligning local prototypes with global ones for old classes while overlooking their local representations in new data, leading to knowledge degradation. To mitigate the above issues, we propose Prototype-Guided Calibration Distillation (PGCD) and Dual-Aligned Prototype Distillation (DAPD) for CIMIS in this paper. Specifically, PGCD exploits prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions, effectively reinforcing reliable old knowledge and suppressing misleading cues from old classes. Complementarily, DAPD aligns the local prototypes of old classes extracted from the current model with both global historical prototypes and local prototypes, further enhancing segmentation performance on old categories. Comprehensive evaluations on two widely used multi-organ segmentation benchmarks demonstrate that our method outperforms current state-of-the-art methods, highlighting its robustness and generalization capabilities.

Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation

Fault Diagnosis (FD) on sequential data suffers from irregular sampling (with missing values), limited training data, and varying underlying environments. In response, this paper proposes FD by adjoint learning in continuous-time model space. Model-Space Learning employs well-fitted models that capture data's dynamics (i.e., changing information) as more stable and concise representations of the original data. The Continuous-Time Reservoir Computing Network (CT-Res) is first introduced, which embeds Ordinary Differential Equation (ODE) within the reservoir-based hidden layer to govern continuous-time hidden-state evolution, naturally handling irregular sampling without relying on fixed time steps and effectively capturing intrinsic data dynamics. By fitting each sequence via CT-Res and representing it with the fitted model, the original sequences are mapped from the data space into the continuous-time model space. We further develop an adjoint learning strategy by incorporating a discrete-time "adjoint Echo State Network (ESN)" that shares structure and parameters with CT-Res, thus enabling efficient training by bypassing the computationally intensive ODE solver, with joint optimization of fitting accuracy and class discrimination in the model space. Experiments on multiple FD benchmarks highlight the effectiveness and efficiency of our study, particularly with missing values and scarce training data.

Fault Diagnosis of Irregular Sequences by Adjoint Learning in Continuous-Time Model Space

Attributed Question Answering (AQA) aims to enhance the reliability of AI-generated answers by including references for each statement, helping users to validate the provided information. However, existing work on AQA has primarily focused on text-only input, and has largely overlooked the role of multimodality. We introduce MAVis, a first benchmark designed to evaluate end-to-end systems on understanding user intent behind visual questions, retrieving evidence from multimodal documents, and generating answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with sentence-level citations referring to multimodal documents. We develop automatic metrics along three dimensions -- informativeness, groundedness, and fluency -- and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs within multimodal RAG generate more informative and fluent answers than unimodal RAG but exhibit weak groundedness for image documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research.

MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

The self-attention mechanism has been a key factor in the advancement of vision Transformers. However, its quadratic complexity imposes a heavy computational burden in high-resolution scenarios, restricting the practical application. Previous methods attempt to mitigate this issue by introducing handcrafted patterns such as locality or sparsity, which inevitably compromise model capacity. In this paper, we present a novel attention paradigm termed Circulant Attention by exploiting the inherent efficient pattern of self-attention. Specifically, we first identify that the self-attention matrix in vision Transformers often approximates the Block Circulant matrix with Circulant Blocks (BCCB), a kind of structured matrix whose multiplication with other matrices can be performed in $\mathcal{O}(N\log N)$ time. Leveraging this interesting pattern, we explicitly model the attention map as its nearest BCCB matrix and propose an efficient computation algorithm for fast calculation. The resulting approach closely mirrors vanilla self-attention, differing only in its use of BCCB matrices. Since our design is inspired by the inherent efficient paradigm, it not only delivers $\mathcal{O}(N\log N)$ computation complexity, but also largely maintains the capacity of standard self-attention. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our approach. These results establish our circulant attention as a promising alternative to self-attention for vision Transformer architectures. Code will be released.

Vision Transformers Are Circulant Attention Learners

Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations.
To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, and fine-grained textual control signals that describe specific body part movements over time. 
In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability.
Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements. 
Code will be released.

FineXtrol: Controllable Motion Generation via Fine-Grained Text

Multi-view 3D object detection has garnered increasing attention, particularly due to its success in autonomous driving systems. Although multi-view systems possess rich semantic information, their spatial-geometric reasoning capabilities remain limited. Recent studies employ simulated point cloud generation mechanisms to facilitate LiDAR-camera multi-modal knowledge distillation, achieving formal structural consistency. However, these methods still suffer from two major drawbacks i) alignment challenges arising from significant discrepancies between LiDAR and camera, and ii) prediction errors from simulated point cloud may degrade the extracted image semantic information during fusion. Accordingly, we propose adaptive-smooth distillation to optimize the granularity of the alignment based on the feature discrepancy for LiDAR-camera knowledge distillation. Specifically, this work considers both LiDAR to camera cross-modal distillation and LiDAR-camera fusion to simulated point cloud-camera fusion multi-modal distillation. Then, we introduce a heterogeneous fusion module to strategically bias the fusion process toward the extracted camera features, thereby enhancing the robustness of the fusion feature. Additionally, soft-weighted response distillation is proposed to facilitate the student model to selectively mimic the high-quality output of the teacher model. Extensive experiments have quantified the superiority of our method, achieving statistically significant improvements of 4.9% mAP and 4.5 % NDS.

Adaptive-Smooth LiDAR-Camera Knowledge Distillation with Heterogeneous Fusion for Multi-View 3D Object Detection

Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation.
In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose ``Global Compression Commander'' (\textit{i.e.}, \textbf{GlobalCom$^2$}), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the ``commander'' to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. 
Extensive experiments show that GlobalCom$^2$ maintains over \textbf{90\%} performance while compressing \textbf{90\%} visual tokens, reducing FLOPs and peak memory to \textbf{9.1\%} and \textbf{60\%} respectively. \textit{Code is available in supplementary materials.}

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Partially View-aligned Clustering (PVC) addresses the challenge of partial view alignment in multi-view learning by leveraging complementary and consistent information. While existing PVC methods show promise, most rely on distance-based strategies that are sensitive to view-specific details and noise, limiting their robustness. In this work, we propose a novel view alignment strategy that reformulates the alignment task as an anomaly detection problem. Rather than learning a view-alignment matrix that enforces strict one-to-one correspondences across views, we adopt a progressive approach to identify well-aligned samples. Specifically, we sample subsets of data by generating random view combinations from unaligned samples and propose an anomaly combination detection module to evaluate the alignment consistency of these combinations. In addition, our progressive training framework alternates between updating model parameters and selecting high-confidence view combinations for subsequent optimization. By reformulating view alignment as an anomaly detection task, our approach provides a more robust and effective solution to partial view alignment. Experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in the PVC problem.

Downloads

Next from AAAI 2026

Towards Training-Free and Accurate ANN-to-SNN Conversion via Activation-Aware Redistribution

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Towards Training-Free and Accurate ANN-to-SNN Conversion via Activation-Aware Redistribution

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads