Singapore

Algorithms for solving nonlinear fixed-point equations---such as average-reward $Q$-learning and TD-learning---often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak–Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm&#39;s contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free $\tilde{O}(1/\sqrt{t})$ optimal rates for $Q$-learning in both average-reward and exponentially discounted settings, where $t$ denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.

AAAI 2026

Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to Q-Learning

seminorm contraction

parameter-free rates

q-learning

Algorithms for solving nonlinear fixed-point equations---such as average-reward $Q$-learning and TD-learning---often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak–Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm's contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free $\tilde{O}(1/\sqrt{t})$ optimal rates for $Q$-learning in both average-reward and exponentially discounted settings, where $t$ denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Binary code analysis is essential for software security across various instruction set architectures. Cross-architecture binary function similarity detection faces significant challenges due to substantial differences in instruction sets and architectural conventions. Existing approaches struggle to capture relationships between code abstraction levels, and lack comprehensive cross-architecture datasets for effective evaluation. Inspired by human cognitive processes of dynamically integrating multi-level information, we propose Binary Dynamic Layer Fusion (BDLF), a novel neural architecture that enhances cross-architecture similarity detection through adaptive layer-wise feature integration. BDLF leverages Qwen3's multilingual code understanding and introduces dynamic weight generation to optimally combine representations from all previous layers. We also construct Cross-Bin, a high quality cross-architecture binary function dataset. BDLF-Qwen3 employs two-stage training: partial fine-tuning with pairwise similarity learning followed by BDLF enhancement with InfoNCE contrastive learning. Experiments demonstrate BDLF-Qwen3 significantly outperforms state-of-the-art methods, achieving 36-65\% improvement in Recall@10 across diverse CPU architectures.

BDLF-Qwen3: Enhanced Cross-Architecture Binary Function Similarity Detection Through Binary Dynamic Layer Fusion

Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D$^{3}$ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D$^{3}$ToM uses decider tokens—the tokens generated in the previous denoising step—to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D$^{3}$ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D$^{3}$ToM accelerates inference while preserving competitive performance.

D3ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others pre-process labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel ”Embedding-to-Label” process effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artefacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. Directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.

UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth <= 2 bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier and multiple inlier subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%.

Bi-VLM: Binary Post-Training Quantization for Vision-Language Models

We study the problem of learning a policy network to optimize several related objectives simultaneously in reinforcement learning (RL). Given a total of $n$ objectives, we consider finding a small set of $k$ policies that is much less than $n$, and that apply to all the objectives. This problem has broad applications in robotic control and language models. Learning one policy for all the objectives does not scale when the number of objectives becomes very large. Instead, this work introduces a two-stage, meta-training and adaptation procedure to tackle this problem. Our procedure works by first training a meta policy based on all the objectives. Then, we adapt this meta policy quickly to multiple subsets of randomly chosen objectives. This adaptation is enabled by a gradient-based approximation property of actor-critic agents, which we have empirically verified to be within a 2% error in a range of RL environments. This overall procedure, namely PolicyGradEx, can quickly estimate a task affinity score between every pair of objectives based on the estimated scores for each subset of objectives. Then, based on the estimated affinity scores, we apply a grouping procedure to cluster similar objectives into $k$ groups. Extensive experiments on three classic control benchmarks and the Meta-World benchmark demonstrate that our method outperforms state-of-the-art baselines by 16%, while being up to $26\times$ faster than full training. Ablation studies validate the design of each component of our method. For example, compared to random grouping and gradient-similarity-based grouping, our method outperforms both by 19%.

Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation

Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing 196,262 image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

The Transformer model, renowned for its powerful attention mechanism, has achieved state-of-the-art performance in various artificial intelligence tasks but faces challenges with quantum data. With a growing focus on leveraging quantum machine learning for quantum data, particularly in quantum chemistry, we propose the Molecular Quantum Transformer (MQT) for modeling interactions in molecular quantum systems. By utilizing quantum circuits to implement the attention mechanism on the molecular configurations, MQT can efficiently calculate ground-state energies for all configurations. Numerical demonstrations show that in calculating ground-state energies for $H_2$, $LiH$, $BeH_2$, and $H_4$, MQT outperforms the classical Transformer, highlighting the promise of quantum effects in Transformer structures. Furthermore, its pretraining capability on diverse molecular data facilitates the efficient learning of new molecules, extending its applicability to complex molecular systems with minimal additional effort. Our method offers an alternative to existing quantum algorithms for estimating ground-state energies, opening new avenues in quantum chemistry and materials science.

Quantum Transformer for Molecular Learning: Multi-Configuration Ground-State Energy Prediction

Transfer learning of diffusion models to smaller target domains is challenging, as naively fine-tuning the model often results in poor generalization. Test-time guidance methods help mitigate this by offering controllable improvements in image fidelity through a trade-off with sample diversity. However, this benefit comes at a high computational cost, typically requiring dual forward passes during sampling. We propose the \underline{Do}main-\underline{g}uided \underline{Fi}ne-\underline{t}uning (DogFit) method, an effective guidance mechanism for diffusion transfer learning that maintains controllability without incurring additional computational overhead. DogFit injects a domain-aware guidance offset into the training loss, effectively internalizing the guided behavior during the fine-tuning process. The domain-aware design is motivated by our observation that during fine-tuning, the unconditional source model offers a stronger marginal estimate than the target model. To support efficient controllable fidelity–diversity trade-offs at inference, we encode the guidance strength value as an additional model input through a lightweight conditioning mechanism. We further investigate the optimal placement and timing of the guidance offset during training and propose two simple scheduling strategies, i.e., \textit{late-start} and \textit{cut-off}, which improve generation quality and training stability. Experiments \footnote{Code is provided in suppl. materials and will be made public.} on DiT and SiT backbones across six diverse target domains show that DogFit can outperform prior guidance methods in transfer learning in terms of FID and $\text{FD}_{\text{DINOV2}}$ while requiring up to 2× fewer sampling TFLOPS.

DogFit: Domain-guided Fine-tuning for Efficient Transfer Learning of Diffusion Models

Generative models have become a powerful tool for synthesizing training data in computer vision tasks. Current approaches solely focus on aligning generated images with the target real dataset distributions. As a result, they only captured the common features in the real dataset and merely generated 'easy samples', which are already well-learned from real data. In contrast, those rare 'hard samples', with atypical features but crucial for enhancing performance, cannot be effectively generated. Consequently, these approaches must synthesize large volumes of data to yield appreciable performance gains, yet the upper bound remains limited. To overcome this limitation, we present a novel methodology that can learn to control the learning difficulty of samples during generation, in addition to domain alignment. Thus, it can efficiently generate valuable `hard samples' that yield significant performance improvements for target tasks. This is achieved by incorporating learning difficulty as a new condition in generative models with a designed encoder structure, training and generation strategy. Experimental results across multiple datasets show that our method can achieve higher performance with less generation cost. Specifically, we can get the best performance with only 10\% addtional synthetic data, saving 63.4 GPU hours of generation than previous SOTA on ImageNet. Moreover, our method also offers insightful visualizations of category-specific hard factors, serving as a tool for analyzing the datasets.

Difficulty Controlled Diffusion Model for Synthesizing Effective Training Data

While instruction-based image editing is emerging, extending it to 360° panorama introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360° panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erasing artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360° panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.

Content not yet available

Next from AAAI 2026

BDLF-Qwen3: Enhanced Cross-Architecture Binary Function Similarity Detection Through Binary Dynamic Layer Fusion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES