Singapore

Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models.
To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision.
However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models.
In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary.
We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment.
We further refine weak labels with a graph smoothing method.
Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines.
Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.

AAAI 2026

Selective Weak-to-Strong Generalization

superintelligence

scalable oversight

value alignment

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

While deep generative models have significantly advanced representation learning, they may inherit or amplify biases and fairness issues by encoding sensitive attributes alongside predictive features. Enforcing strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated. To address this challenge, we propose CAD-VAE(Correlation-Aware Disentangled VAE), which introduces a correlated latent code to capture the information shared between the target and sensitive attributes. Given this correlated latent, our method effectively separates overlapping factors without extra domain knowledge by directly minimizing the conditional mutual information between target and sensitive codes. A relevance-driven optimization strategy refines the correlated code by efficiently capturing essential correlated features and eliminating redundancy. Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing.

CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement

Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure.
Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.

Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models

The proliferation of large language models (LLMs) has significantly advanced intelligent systems. Unfortunately, LLMs often face knowledge conflicts between internal memory and retrieved external information, arising from misinformation, biases, or outdated knowledge. These conflicts undermine response reliability and introduce uncertainty in decision-making. In this work, we analyze how LLMs navigate knowledge conflicts from an information-theoretic perspective and reveal that when conflicting and supplementary information exhibit significant differences, LLMs confidently resolve their preferences and alleviate the uncertainty during their response generation. When this difference is ambiguous, LLMs experience considerable uncertainty about their generation. Based on this insight, we propose Swin-VIB, a novel framework that integrates a pipeline of variational information bottleneck models to adapt the retrieved information difference, facilitating robust response generation of LLMs even in conflicting contexts. Extensive experiments confirm our theoretical analysis and demonstrate the performance of Swin-VIB. Notably, Swin-VIB outperforms all competitive baselines in terms of the accuracy of the multiple-choice task, while improving the EM values in the open-ended QA task by at least 11.14%.

Accommodate Knowledge Conflicts in Retrieval-augmented LLMs: Towards Robust Response Generation in the Wild

This paper presents a novel strategy for a multi-agent pursuit-evasion game involving multiple faster pursuers with heterogenous speeds and a single slower evader. We define a geometric region, the evader's safe-reachable set, as the intersection of Apollonius circles derived from each pursuer-evader pair. The capture strategy is formulated as a zero-sum differential game where the pursuers cooperatively minimize the area of this set, while the evader seeks to maximize it, effectively playing a game of spatial containment. By deriving the analytical gradients of the safe-reachable set's area with respect to agent positions, we obtain closed-form, instantaneous optimal control laws for the heading of each agent. These strategies are computationally efficient, allowing for real-time implementation. Simulations demonstrate that the gradient-based controls effectively steer the pursuers to systematically shrink the evader’s safe region, leading to guaranteed capture. This area-minimization approach provides a clear geometric objective for cooperative capture.

Area-Optimal Control Strategies for Heterogeneous Multi-Agent Pursuit

The recent introduction of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis. Several studies have further improved the rendering quality of 3DGS, yet they still exhibit noticeable visual discrepancies when synthesizing views at sampling rates unseen during training. Specifically, they suffer from (i) erosion-induced blurring artifacts when zooming in and (ii) dilation-induced staircase artifacts when zooming out. We speculate that these artifacts arise from the fundamental limitation of the alpha blending adopted in 3DGS methods. Instead of the conventional alpha blending that computes alpha and transmittance as scalar quantities over a pixel, we propose to replace it with our novel Gaussian Blending that treats alpha and transmittance as spatially varying distributions. Thus, transmittances can be updated considering the spatial distribution of alpha values across the pixel area, allowing nearby background splats to contribute to final rendering. Our Gaussian Blending is applicable to any 3DGS variants, and maintain real-time rendering speed with no extra memory cost. Extensive experiments demonstrate that Gaussian Blending effectively captures fine details at various sampling rates unseen during training, consistently outperforming existing novel view synthesis models across both unseen and seen sampling rates. We will publicly release our code upon acceptance.

Gaussian Blending: Rethinking Alpha Blending in 3D Gaussian Splatting

Sound effect editing—modifying audio by adding, removing, or replacing elements—remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MMDiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content and achieves state-of-the-art performance in the field of sound effect editing and audio generation.

AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control

3D open-vocabulary scene understanding, which accurately perceives complex semantic properties of objects in space, has gained significant attention in recent years. In this paper, we propose GAGS, a framework that distills 2D CLIP features into 3D Gaussian splatting, enabling open-vocabulary queries for renderings on arbitrary viewpoints. The main challenge of distilling 2D features for 3D fields lies in the multiview inconsistency of extracted 2D features, which provides unstable supervision for the 3D feature field. GAGS addresses this challenge with two novel strategies. First, GAGS associates the prompt point density of SAM with the camera distances, which significantly improves the multiview consistency of segmentation results. Second, GAGS further decodes a granularity factor to guide the distillation process and this granularity factor can be learned in a unsupervised manner to only select the multiview consistent 2D features in the distillation process. Experimental results on two datasets show that GAGS improves visual grounding accuracy by an average of 10.5\% and semantic segmentation accuracy by an average of 7.4\%, with an inference speed 2× faster than baseline methods.
Please refer to the supplementary material for more visualizations and source code.

GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting

Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision-making with standalone intelligence. While recent multi-agent approaches introduce cooperation, they often focus merely on perception-level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end-to-end autonomous driving. In this paper, we present UniMM-V2X, a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi-level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision-making. To adapt to diverse downstream tasks and further enhance the quality of multi-level fusion, we incorporate a Mixture-of-Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR-V2X dataset demonstrate our approach achieves state-of-the-art (SOTA) performance with a 39.7\% improvement in perception accuracy, a 7.2\% reduction in prediction error, and a 33.2\% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE-enhanced multi-level cooperative paradigm.

UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. 
Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. 
To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step.
POSE employs a carefully designed two-phase process to distill video models:
(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories.
(ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos.
For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. 
Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15\% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$\times$, from 1000 seconds to 10 seconds, while maintaining competitive performance.

Phased One-Step Adversarial Equilibrium for Video Diffusion Models

The automation of diagram generation has gained significant attention in recent years.In previous work on automated diagram generation, most studies focused on generating diagrams from natural language descriptions. Many of these methods produced diagrams that were not user-friendly for editing, lacking drag-and-drop functionality. This paper proposes a novel task of generating editable, high-fidelity diagrams from either text descriptions or raster images (non-editable images), and it may be the first to introduce the tasks of diagram restoration and style transfer in this context. To address these tasks, we created the Diagram-mxGraph dataset, covering three main tasks: diagram restoration, text-to-diagram generation, and diagram style transfer. We introduced two key innovations: Fine-grained Adaptive Background Suppression (FABS) and Component-Aware Adaptive Loss (CAAL). By leveraging the capabilities of pre-trained Vision Transformers (ViTs) and the Diagram Adapter module, we effectively align the features of diagrams with a Large Language Model (LLM) to generate mxGraph format textual descriptions, ultimately producing editable diagrams.

Downloads

Next from AAAI 2026

CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES