Singapore

Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a significant challenge in resolution generalization, particularly in the widely used Diffusion Transformers, lies in the mismatch between the positional encodings encountered during testing and those used during training. While existing methods have employed techniques such as interpolation, extrapolation, or their combinations, none have fully resolved this issue. In this paper, we propose a novel two-dimensional randomized positional encodings (RPE-2D) framework that focuses on learning positional order of image patches instead of the specific distances between them, enabling seamless high- and low-resolution image generation without requiring high- and low-resolution image training. Specifically, RPE-2D independently selects positions over a broader range along both the horizontal and vertical axes, ensuring that all position encodings are trained during the inference phase, thus improving resolution generalization. Additionally, we propose a random data augmentation technique to enhance the modeling of position order. To address the issue of image cropping caused by the augmentation, we introduce corresponding micro-conditioning to enable the model to perceive the specific cropping patterns. On the ImageNet dataset, our proposed RPE-2D achieves state-of-the-art resolution generalization performance, outperforming existing competitive methods when trained at a resolution of $256 \times 256$ and inferred at $384 \times 384$ and $512 \times 512$, as well as when scaling from $512 \times 512$ to $768 \times 768$ and $1024 \times 1024$.
And it also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration and multi-resolution inheritance.

AAAI 2026

Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings

randomized positional encodings

diffusion transformers

resolution generalization

image generation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multi-view automatic translational correction (ATC) in coronary angiography (CAG) is a fundamental step for intraoperative diagnosis and downstream 3D reconstruction. However, learning-based ATC methods require large-scale annotated datasets, which are difficult to obtain due to heartbeat-induced vascular deformation and high labeling costs. Synthetic datasets have been widely adopted to supplement, but fail to provide sufficient supervision for clinical models, due to a significant gap in both style and structure. To address this, we propose a novel annotation-free framework for high-quality CAG data synthesis and robust ATC training. Our approach generates a fully labeled, high-fidelity dataset by simulating realistic dense continuous view CAG sequences without manual annotation. Furthermore, to mitigate cross-view matching errors caused by non-rigid motion, we introduce an evolutionary epipolar optimization algorithm that refines geometric consistency under large viewpoint variations. Meanwhile, theoretical analysis shows that our proposed neighboring-view error propagation strategy leads to reduced matching error compared to conventional cross-view computation. Extensive experiments on real clinical datasets demonstrate that our annotation-free approach significantly outperforms weakly supervised baselines and achieves performance in parallel with fully supervised models trained on real annotations. The method also generalizes well on multi-center datasets, highlighting its robustness and clinical potential. Code is available in Supplementary Material.

Automatic Translational Correction of Multi-View Coronary Angiography Based on Auto-Annotation Data Generation

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture.

ReCode: Updating Code API Knowledge with Reinforcement Learning

Spatial-temporal prediction plays a crucial role in various domains, including intelligent transportation and environmental monitoring. Although large language model has shown advantages in long-range dependency modeling and excellent generalization ability for forecasting, it has limited understanding of spatial-temporal features. Especially for spatial features, most existing methods still simplify the spatial-temporal prediction task into multiple independent temporal prediction tasks, failing to effectively encode the dynamic evolution of spatial relations. To address these problems, we propose ST-VLM (Spatial-Temporal Forecasting with Vision-Language Model), a novel framework that leverages visual representations to encode the dynamic spatial dependencies within spatial-temporal data and integrates multi-modal information to enhance prediction. This framework transforms spatial-temporal features into three modalities: vision, text, and time series, enhances cross-modal fusion through an attention-aware fusion mechanism in the first-layer of Vision-Language Model (VLM), optimizes multi-modal feature interaction via adaptive fine-tuning strategies. After fusion, the multi-modal embeddings are subsequently used for the final spatial-temporal prediction task. Extensive experiments demonstrate that ST-VLM achieves state-of-the-art performance across various datasets. In particular, the framework exhibits promising results in few-shot scenarios, verifying its strong generalization ability.

ST-VLM: A Spatial-to-Image Multimodal Spatial-Temporal Prediction Framework with Vision-Language Model

Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning capabilities. However, growing evidence reveals that such reasoning models often produce CoTs plagued by excessive redundancy, including unnecessary verification steps and repetitive reasoning shifts. The root cause lies in post-training of them that overly rely on outcome reward paradigms, as the data of process reward paradigms, which regulate intermediate reasoning steps, is difficult to construct at scale. To address this, we propose **PI**, a novel framework for Test-time **P**rompt **I**ntervention. PI provides an interface to dynamically guide and regulate reasoning paths during inference through timely (**When** module) and proper (**How** module) interventions and post-intervention sampling (**Which** module). This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs’ reasoning processes, enhancing controllability and interpretability. Extensive experiments across multiple models and datasets demonstrate that PI significantly shortens CoTs while reducing hallucination, yielding more concise and reliable reasoning.

Test-time Prompt Intervention

Contrastive Language–Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English–image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image–text data. Existing multilingual vision–language models exhibit consistently low retrieval performance in underrepresented languages—including Czech, Finnish, Croatian, Hungarian, Romanian—on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision–language alignment. Our approach requires no image–text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

The application of federated domain generalization in person re-identification (FedDG-ReID) aims to enhance the model's generalization ability in unseen domains while protecting client data privacy. However, existing mainstream methods typically rely on global feature representations and simple averaging operations for model aggregation, leading to two limitations in domain generalization: (1) Using only global features makes it difficult to capture subtle, domain-invariant local details (such as accessories or textures); (2) Uniform parameter averaging treats all clients as equivalent, ignoring their differences in robust feature extraction capabilities, thereby diluting the contributions of high-quality clients. To address these issues, we propose a novel federated learning framework—Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration (FedARKS)—comprising two mechanisms: RK (Robust Knowledge) and KS (Knowledge Selection). In our design, each client employs a dual-branch network of RK: the Global Feature Processing Branch serves as the primary component, extracting overall representations for model aggregation and server-side updates; while the Body Part Processing Branch acts as an auxiliary component, focusing on extracting domain-invariant local details to supplement and guide the local training process during global feature learning. Additionally, our KS mechanism adaptively assigns corresponding aggregation weights to clients based on their ability to extract domain-invariant knowledge, enabling the server to better integrate cross-domain invariant knowledge extracted by clients. Extensive experiments validate that FedARKS achieves state-of-the-art generalization results on the FedDG-ReID benchmark, demonstrating that learning subtle body part features can effectively assist and reinforce global representations, thereby enabling robust cross-domain person ReID capabilities.

FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification

Class imbalance has been extensively studied in single-view scenarios; however, addressing this challenge in multi-view contexts remains an open problem, with even scarcer research focusing on trustworthy solutions. In this paper, we tackle a particularly challenging class imbalance problem in multi-view scenarios: long-tailed classification. We propose TMLC, a Trusted Multi-view Long-tailed Classification framework, which makes contributions on two critical aspects: opinion aggregation and pseudo-data generation. Specifically, inspired by Social Identity Theory, we design a group consensus opinion aggregation mechanism that guides decision-making toward the direction favored by the majority of the group. In terms of pseudo-data generation, we introduce a novel distance metric to adapt SMOTE for multi-view scenarios and develop an uncertainty-guided data generation module that produces high-quality pseudo-data, effectively mitigating the adverse effects of class imbalance. Extensive experiments on long-tailed multi-view datasets demonstrate that our model is capable of achieving superior performance.

Trusted Multi-view Learning for Long-tailed Classification

Reconstructing high-fidelity MR images from undersampled k-space data remains a challenging problem in accelerated MRI. While Mamba variants for vision tasks offer promising long-range modeling capabilities with linear-time complexity, their direct application to MRI reconstruction inherits two key limitations: (1) insensitivity to high-frequency anatomical details; and (2) reliance on redundant multi-directional scanning. To address these limitations, we introduce HiFi-Mamba, a novel dual-stream Mamba-based architecture comprising stacked 𝓌-Laplacian (WL) and HiFi-Mamba blocks. Specifically, the WL block performs fidelity-preserving spectral decoupling, producing complementary low- and high-frequency streams. This separation enables the HiFi-Mamba Block to focus on low-frequency structures, enhancing global feature modeling. Concurrently, the HiFi-Mamba block selectively integrates high-frequency features through adaptive state-space modulation, preserving comprehensive spectral details. To eliminate the scanning redundancy, the HiFi-Mamba Block adopts a streamlined unidirectional traversal strategy that preserves long-range modeling capability with improved computational efficiency. Extensive experiments on standard MRI reconstruction benchmarks demonstrate that HiFi-Mamba consistently outperforms state-of-the-art CNN-based, Transformer-based, and other Mamba-based models in reconstruction accuracy while maintaining a compact and efficient model design.

HiFi-Mamba: Dual-Stream 𝓌-Laplacian Enhanced Mamba for High-Fidelity MRI Reconstruction

Multimodal change detection (MCD) has important applications in disaster assessment, but the nonlinear distortion of features and spatial misalignment caused by sensor imaging differences make it difficult to obtain changes through direct comparison. To overcome the above problems, this study aims to realize MCD by capturing the modality-independent structural commonality features between Multimodal Remote Sensing Images (MRSIs). To achieve this, we devise a basic Graph Kolmogorov-Arnold Network (GKAN) to excavate spatial structural relationships and cross-modal nonlinear mappings simultaneously. Based on this, we propose a Dual-branch GKAN (DGKAN) for unsupervised MCD, which can capture spatial-spectral structural commonality features and compare them directly to detect changes. Concretely, the GKAN is used within the DGKAN to build two autoencoders consisting of a Siamese encoder and two independent decoders to learn spatial-spectral structural commonality features through feature reconstruction. Besides, we introduce a Covariance Structural Commonality Loss (CSCL), which guides the network in extracting spatial-spectral structural commonality features between MRSIs by unsupervised constraints on the distributional consistency of cross-modal features. Experiments on several MCD datasets show that the proposed DGKAN can achieve convincing results, and ablation studies verify the effectiveness of the GKAN and CSCL. The code will be available.

DGKAN: Dual-branch Graph Kolmogorov-Arnold Network for Unsupervised Multimodal Change Detection

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating up-to-date external knowledge, yet real-world web environments present unique challenges. These limitations manifest as two key challenges: pervasive misinformation in the web environment, which introduces unreliable or misleading content that can degrade retrieval accuracy, and the underutilization of web tools, which, if effectively employed, could enhance query precision and help mitigate this noise, ultimately improving retrieval results in RAG systems. To address these issues, we propose WebFilter, a novel RAG framework that generates source-restricted queries and filters out unreliable content. This approach combines a retrieval filtering mechanism with a behavior- and outcome-driven reward strategy, optimizing both query formulation and retrieval outcomes. Extensive experiments demonstrate that WebFilter improves answer quality and retrieval precision, outperforming existing RAG methods on both in-domain and out-of-domain benchmarks. Code is available at: https://github.com/GuoqingWang1/WebFilter.

Downloads

Next from AAAI 2026

Automatic Translational Correction of Multi-View Coronary Angiography Based on Auto-Annotation Data Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Automatic Translational Correction of Multi-View Coronary Angiography Based on Auto-Annotation Data Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads