Singapore

Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during active computation windows, our method proactively prefetches required KV Cache into GPU L2 cache, enabling high-speed L2 cache hits for subsequent accesses and effectively hiding HBM access latency within computational cycles. Extensive experiments on NVIDIA H20 GPUs demonstrate that the proposed method achieves 2.15× improvement in attention kernel efficiency and up to 1.97× end-to-end throughput enhancement, surpassing state-of-the-art baseline FlashAttention-3. Notably, our solution maintains orthogonality to existing optimization techniques and can be integrated with current inference frameworks, providing a scalable latency-hiding solution for next-generation LLM inference engines.

AAAI 2026

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

prefetch

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Source-free domain adaptation (SFDA) aims to transfer knowledge from the source domain to an unlabeled target domain without requiring access to source data. Although previous works have focused on clustering target domain samples from continuous training, there are still some challenges: i) More source domain knowledge is forgotten with more training epochs. ii) Achieving better learning results often requires increased computational resources. To solve these problems, we propose a novel Marginal Exploration for Source-Free Domain Adaptation (ME-SFDA) method, which is a multi-scale information fusion learning based on our designed Pyramidal Atkinson-Shiffrin memory. Specifically, we design a two-step module to split samples into clustered cores and response scatters by sensory memory. Then, a novel technique is proposed for clustering samples in a hierarchical way, utilizing long-term memory to cluster cores derived from splitting the samples earlier and guide response scatters. To effectively divide samples of different classes, we propose a method that encourages unambiguous cluster assignments for the samples using multi-scale fusion information. To verify the generality of our approach, we not only discuss the UDA and SFDA tasks but also apply it to the semi-supervised domain adaptation (SSDA), which utilizes a few labeled target samples based on UDA. Extensive experiments on four standard benchmarks indicate that our approach outperforms previous SOTA methods.

ME-SFDA: Marginal Exploration with Pyramidal Atkinson-Shiffrin Memory for Source-Free Domain Adaptation

Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.

Dereflection Any Image with Diffusion Priors and Diversified Data

Classes, as fundamental elements of Computer Vision, have been extensively studied within incremental learning frameworks. In contrast, tokens, which play essential roles in many research fields, exhibit similar characteristics of growth, yet investigations into their incremental learning remain significantly scarce. This research gap primarily stems from the holistic nature of tokens in language, which imposes significant challenges on the design of incremental learning frameworks for them. To overcome this obstacle, in this work, we turn to a type of token, gene, for a large-scale biological dataset—single-cell transcriptomics—to formulate a pipeline for gene incremental learning and establish corresponding evaluations. We found that the forgetting problem also exists in gene incremental learning, thus we adapted existing class incremental learning methods to mitigate the forgetting of genes. Through extensive experiments, we demonstrated the soundness of our framework design and evaluations, as well as the effectiveness of the method adaptations. Finally, we provide a complete benchmark for gene incremental learning in single-cell transcriptomics.

Gene Incremental Learning for Single-Cell Transcriptomics

Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. We present OmniScale, a modular and efficient training framework to accelerate the development of omni-modal LLMs. OmniScale introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. OmniScale also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. Using OmniScale, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

OmniScale: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Anomaly detection on data streams presents significant challenges, requiring methods to maintain high detection accuracy among evolving distributions while ensuring real-time efficiency. Here we introduce $\mathcal{IDK}$-$\mathcal{S}$, a novel $\mathbf{I}$ncremental $\mathbf{D}$istributional $\mathbf{K}$ernel for $\mathbf{S}$treaming anomaly detection that effectively addresses these challenges by creating a new dynamic representation in the kernel mean embedding framework. The superiority of $\mathcal{IDK}$-$\mathcal{S}$ is attributed to two key innovations. First, it inherits the strengths of the Isolation Distributional Kernel, an offline detector that has demonstrated significant performance advantages over foundational methods like Isolation Forest and Local Outlier Factor due to the use of a data-dependent kernel. Second, it adopts a lightweight incremental update mechanism that significantly reduces computational overhead compared to the naive baseline strategy of performing a full model retraining, which is achieved without compromising detection accuracy, a claim supported by its statistical equivalence to the full retrain model. Our extensive experiments on thirteen benchmarks demonstrate that $\mathcal{IDK}$-$\mathcal{S}$ achieves superior detection accuracy while operating substantially faster, in many cases by an order of magnitude, than existing state-of-the-art methods.

IDK-S: Incremental Distributional Kernel for Streaming Anomaly Detection

Recent brain decoding studies have primarily emphasized the development of brain decoders, while largely neglecting the segmentation step. Existing methods typically adopt fixed-length segmentation, which might overlook subject- or task-level variability and disrupt intrinsic neural structures within brain signals. To address this gap, we propose $\textbf{S}^\textbf{3}$, which leverages spiking neurons as an isolating segmenter for brain signal decoding. $\textbf{S}^\textbf{3}$ segments brain signals adaptively, considering subject- and task-level variability while preserving intrinsic neural structures in brain signals. It exploits the unique reset mechanism of spiking neurons to enforce temporal pattern isolation for the generation of each segmentation point. To optimize $\textbf{S}^\textbf{3}$ for enhancing task performance in the absence of segmentation labels, we develop an optimization method where pseudo-labels are created with a stochastic-greedy algorithm to optimize them, circumventing gradient blockade between them. Experiments on 10 downstream tasks across 13 public datasets demonstrate that $\textbf{S}^\textbf{3}$ consistently outperforms existing methods, validating its effectiveness, generalizability and interpretability.

S³: Spiking Neurons as an Isolating Segmenter for Brain Signal Decoding

Multi-modal object re-identification (ReID) aims to retrieve specific targets by leveraging complementary cues from different sensing modalities. Despite recent progress, two key challenges remain:
(1) the limited ability to jointly address both modality and viewpoint discrepancies, and
(2) the difficulty of effectively leveraging reliable target-domain data to improve generalization.
To address these challenges, we propose Proxy-driven Test-Time Training (ProxyTTT), a unified framework that enhances both multi-modal identity representation learning and model generalization. During training, we propose a Multi-Proxy Learning (MPL) mechanism to address the representation bias across different views and modalities. MPL disentangles fine-grained modality-specific and modality-common identity proxies as semantic anchors to align identity features across diverse perspectives and sensing modalities. This alignment strategy enables the model to learn robust and discriminative global identity representations under heterogeneous modality conditions.
At test time, to reliably exploit target domain data, we propose Proxy-guided Entropy-based Selective Adaptation (PESA) for test-time training. Specifically, PESA leverages the semantic structure encoded by identity proxies to estimate prediction uncertainty via entropy, and selectively adapts the model using only high-confidence samples. This selective adaptation effectively mitigates the domain shift between training and deployment environments, improving the model’s generalization in real-world scenarios.
Extensive experiments on four public multi-modal ReID benchmarks (RGBNT201, RGBNT100, MSVR310, and WMVeID863) demonstrate the effectiveness of ProxyTTT.

ProxyTTT: Proxy-driven Test-Time Training for Multi-modal Re-identification

Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (e.g., GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates the allowance of Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. Particularly, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage. Our codes will be released soon.

GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting

We introduce a new notion of deterministic stable solution for non-cooperative games, termed subsidized equilibrium. It assumes that an amount of money can be used as a pool of subsidies to stabilize a strategy profile that otherwise would not be accepted by (some of) the players. Roughly speaking, for a given amount of money, a strategy profile is a subsidized equilibrium if the total payoff loss incurred by players not playing best-responses does not exceed that amount, i.e., there is enough money to refund all players experiencing a regret. With respect to many other solution concepts in the literature, the notion of subsidized equilibrium has important advantages. Specifically, for a sufficiently high value of money, a subsidized equilibrium always exists and can even be computed in polynomial time; also, existence of an efficient subsidized equilibrium can be guaranteed. Thus, determining for which amounts of money existence, polynomial time computability and efficiency can or cannot be achieved becomes an intriguing question. We provide initial results towards this direction for some widely studied classes of games.

Compensate to Not Deviate: On Subsidised Equilibria

Assessing the strength of arguments is essential for determining the outcomes of any argument-based system. A wide range of semantics has been proposed in the literature. These take as input a set of arguments—each assigned a basic weight and potentially subject to attacks from others—and compute a single strength value for each argument. Despite the diversity of argument types (or schemes), existing semantics apply uniform evaluation criteria across all arguments. In this paper, we advocate for type-dependent evaluations, acknowledging that the impact of attacks can vary across types. Given that many argument-based systems involve heterogeneous types of arguments, we propose a broad family of hybrid semantics that combine distinct base semantics, each tailored to specific argument types. We investigate their theoretical properties, present concrete instances within this family, and examine their computational complexity.

Downloads

Next from AAAI 2026

ME-SFDA: Marginal Exploration with Pyramidal Atkinson-Shiffrin Memory for Source-Free Domain Adaptation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

ME-SFDA: Marginal Exploration with Pyramidal Atkinson-Shiffrin Memory for Source-Free Domain Adaptation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads