Singapore

Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward &amp; Policy Optimization (URPO), that unifies instruction-following (&quot;player&quot;) and reward modeling (&quot;referee&quot;) within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO&#39;s superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and achieving a 36% relative improvement on the challenging AIME reasoning benchmark. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.

AAAI 2026

URPO: A Unified Reward &amp; Policy Optimization Framework for Large Language Models

unified alignment

reinforcement learning from human feedback (rlhf)

large language model

Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following ("player") and reward modeling ("referee") within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO's superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and achieving a 36% relative improvement on the challenging AIME reasoning benchmark. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.

URPO: A Unified Reward & Policy Optimization Framework for Large Language Models

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Event cameras have gained increasing popularity in computer vision due to their ultra-high dynamic range and temporal resolution. However, event networks heavily rely on task-specific designs due to the unstructured data distribution and spatial-temporal (S-T) inhomogeneity, making it hard to reuse existing architectures for new tasks. We propose OmniEvent, the first unified event representation learning framework that achieves SOTA performance across diverse tasks, fully removing the need of task-specific designs. Unlike previous methods that treat event data as 3D point clouds with manually tuned S-T scaling weights, OmniEvent proposes a decouple-enhance-fuse paradigm, where the local feature aggregation and enhancement is done independently on the spatial and temporal domains to avoid inhomogeneity issues. Space-filling curves are applied to enable large receptive fields while improving memory and compute efficiency. The features from individual domains are then fused by attention to learn S-T interactions. The output of OmniEvent is a grid-shaped tensor, which enables standard vision models to process event data without architecture change. With a unified framework and similar hyper-parameters, OmniEvent out-performs (tasks-specific) SOTA by up to 68.2% across 3 representative tasks and 10 datasets (Fig.1). Code will be released upon acceptance.

OmniEvent: Unified Event Representation Learning

Low-Rank Adaptation (LoRA) has emerged as a powerful parameter-efficient fine-tuning (PEFT) method for adapting large language models to downstream tasks. While recent work integrates mixture-of-experts (MoE) mechanisms with multiple LoRA modules to handle multi-task or complex scenarios, existing approaches face two key limitations: restricted cross-expert knowledge sharing and subsequent expert homogenization. To address these challenges, we propose a novel diversity-regulated asymmetric LoRA decomposition framework for efficient complex-task adaptation, which enables flexible knowledge sharing through asymmetric expert decomposition and guarantees the expert diversity with a dual orthogonality regularization. Extensive experiments on eight public benchmarks, spanning both multi-task and single-task settings, demonstrate the superiority of our approach over existing methods.

D2MoRA: Diversity-Regulated Asymmetric MoE-LoRA Decomposition for Efficient Multi-Task Adaptation

Rectification flow Transformers (RFTs) have shown promising performance in diffusion-based image synthesis, but are typically confined to lower-resolution scenarios, limiting their ability to generate high-resolution images. Existing resolution extrapolation approaches often suffer from excessive computational overhead, resulting in prolonged inference times. We propose LookFlow, a training-free high-resolution synthesis framework that accelerates inference while preserving visual quality. Building on pretrained text-to-image RFTs, LookFlow employs a dynamic lookahead guidance flow mechanism to refine high-resolution velocity predictions by leveraging multi-timestep lookahead information extracted from a low-resolution flow. Additionally, reusing temporally similar features across consecutive timesteps drastically reduces computation and significantly decreases inference time overhead. Extensive experiments on COCO demonstrate that LookFlow robustly scales resolutions from $4\times$ to $25 \times$, achieving up to a maximum speedup of $2.01 \times$ while maintaining competitive visual fidelity.

LookFlow: Training-Free and Efficient High-Resolution Image Synthesis via Dynamic Lookahead Guidance Flow

We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions.
To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.

One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion

Robust medical image classification under input corruption and bag-level annotation remains a critical challenge in clinical AI applications. We propose \textbf{QAPNet}, a Quantum-Attentive Patchwise Network that integrates quantum neural encoding, additive attention-based instance reweighting, and prototype-contrastive regularization for reliable diagnosis from degraded inputs. Our framework uses a sliding-window strategy to divide each MRI medical Image into overlapping patches where each encoded via an 8-qubit quantum circuit using $RY$-based noise-sensitive layers for yielding expressive low-dimensional representations without classical CNNs. A lightweight additive attention mechanism computes instance-wise importance weights that enable interpretable and noise-aware bag-level aggregation. To enhance robustness, we apply a contrastive loss that aligns clean and noisy embeddings and enforce prototype-guided clustering via class-wise centroids. We evaluate QAPNet across seven benchmark medical imaging datasets under three levels of additive Gaussian noise ($\sigma \in \{5\%, 10\%, 30\%\}$). QAPNet consistently outperforms eight strong baselines and achieves up to $+20.8\%$ higher accuracy in OASIS (with $30\%$ noise), $+17.7\%$ in PathMNIST and maintains stable performance ($<4\%$ degradation) in all settings. Ablation studies confirm the critical role of quantum encoding, attention-based aggregation, and prototype contrastive learning. These results suggest that QAPNet offers a scalable and interpretable architecture for noisy medical imaging tasks in the real world to bridge the quantum representation learning with robust clinical prediction.

QAPNet: A Quantum-Attentive Patchwise Network for Robust Medical Image Classification Under Noisy Inputs

Large language models (LLMs) exhibit strong generative capabilities and have shown great potential in code generation. Existing chain-of-thought (CoT) prompting methods enhance model reasoning by eliciting intermediate steps, but suffer from two major limitations: First, their uniform application tends to induce overthinking on simple tasks. Second, they lack intention abstraction in code generation, such as explicitly modeling core algorithmic design and efficiency, leading models to focus on surface-level structures while neglecting the global problem objective. Inspired by the cognitive economy principle of engaging structured reasoning only when necessary to conserve cognitive resources, we propose RoutingGen, a novel difficulty-aware routing framework that dynamically adapts prompting strategies for code generation. For simple tasks, it adopts few-shot prompting; for more complex ones, it invokes a structured reasoning strategy, termed Intention Chain-of-Thought (ICoT), which we introduce to guide the model in capturing task intention, such as the core algorithmic logic and its time complexity. Experiments across three models and six standard code generation benchmarks show that RoutingGen achieves state-of-the-art performance in most settings, while reducing total token usage by 46.37\% on average across settings. Furthermore, ICoT outperforms six existing prompting baselines on challenging benchmarks.

Intention Chain-of-Thought Prompting with Dynamic Routing for Code Generation

Self-supervised monocular depth estimation methods severely compromise accuracy in dynamic objects due to their static scene assumption. 
Existing approaches for dynamic scenes suffer from two critical shortcomings: 1) reliance on supervised segmentation models (requiring costly annotations) or computationally intensive multi-branch models to isolate moving objects, and 2) simple integration of 2D/3D motion flow without reliable supervision for dynamic objects. 
We propose AdaDepth, a two‑stage framework that jointly performs unsupervised scene decomposition and dynamic-aware depth learning. In the initial structural stage, our geometry-motion joint scene decomposition (GMoDecomp) module ensures the robust generation of a depth prior and simultaneously partitions the scene into multiple regions through the fusion of geometric and motion cues. 
In the region-adaptive refinement stage, we exploit the depth prior and decomposed regions to introduce motion-aware and geometry-consistent constraints, effectively improving depth estimation in dynamic scenes. 
AdaDepth achieves accurate depth prediction in highly dynamic scenes without relying on external labels or specialized segmentation models. Extensive experiments on KITTI, Cityscapes, and Waymo Open demonstrate its superiority over state-of-the-art approaches.

AdaDepth: Exploiting Inherent Scene Information for Self-Supervised Depth Estimation in Dynamic Scenes

Reinforcement Fine-tuning (RFT) methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have demonstrated strong capabilities in aligning Large Language Models (LLMs) with human preferences. However, these approaches often suffer from limited data efficiency, necessitating extensive on-policy rollouts to maintain competitive performance. We propose PSPO (Prompt-Level Prioritization and Experience-Weighted Smoothing for Efficient Policy Optimization), a lightweight yet effective enhancement to GRPO that improves training stability and sample efficiency through two complementary techniques. First, we introduce an experience-weighted reward smoothing mechanism, which uses exponential moving averages to track group-level reward statistics for each prompt. This enables more stable advantage estimation across training steps without storing entire trajectories, allowing the model to capture historical reward trends in a lightweight and memory-efficient manner. Second, we adopt a prompt-level prioritized sampling strategy, which is an online data selection method inspired by prioritized experience replay. It dynamically emphasizes higher-impact prompts based on their relative advantages, thereby improving data efficiency. Experiments on multiple mathematical reasoning benchmarks and models show that PSPO achieves comparable or better accuracy than GRPO, while significantly accelerating convergence, and maintaining low computational and memory overhead.

PSPO: Prompt-Level Prioritization and Experience-Weighted Smoothing for Efficient Policy Optimization

While neural solvers have shown remarkable performance on Vehicle Routing Problems (VRPs), two key challenges persist. First, it remains difficult to determine which parts of the input graph are most critical for making optimal routing decisions during the decoding stage. Second, current neural models are typically trained on smaller problem instances (50-100 nodes), and their ability to generalize to large-scale scenarios is underexplored. To address these challenges, we introduce a novel U-Net architecture that captures multi-level information, enhancing the decision-making process in the decoder. Building on this, we propose a unified neural solver for a wide range of Vehicle Routing Problems. Our extensive experiments demonstrate the effectiveness of this framework on both small and large-scale problem instances, showcasing its superior performance and generalization capabilities.

Scale-Net: A Hierarchical U-Net Framework for Cross-Scale Generalization in Multi-Task Vehicle Routing

Recent advancements in large language models (LLMs) have greatly improved their ability to perform complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM's self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables LLMs to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27\% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16\% accuracy improvement.

Content not yet available

Downloads

Next from AAAI 2026

OmniEvent: Unified Event Representation Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Content not yet available

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

OmniEvent: Unified Event Representation Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads