United States

Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs), but the performance is still underwhelming due to too much noisy preference data yielded in the loop. To combat this issue, we present an Uncertainty-enhanced Preference Optimization (UPO) framework to make the LLM self-evolve with reliable feedback. The key idea is mitigating the noisy preference pairs derived from the current policy and reward models by performing pair-wise uncertainty estimation and judiciously reliable feedback sampling. To reach this goal, we thus introduce an estimator model, which incorporates Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation for the batch of preference pairs. Compared to the existing methods that directly filter generated responses based on the reward score, the estimator focuses on the model uncertainty in a
pair-wise manner and effectively bypasses the confirmation bias problem of the reward model. Additionally, we also propose an uncertainty-enhanced self-evolution algorithm to better improve the LLM robustly align with these reliable feedback data. Extensive experiments over multiple benchmarks demonstrate our framework substantially improves the performance of iterative preference optimization.

AAAI 2025

Self-Evolutionary Large Language Models Through Uncertainty-Enhanced Preference Optimization

snlp

language models

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Multi-view tensor clustering (MVTC) has gained much attention for its effectiveness in capturing global high-order correlations across views. However, current MVTC methods suffer from two limitations: 1) adopting a two-stage process to learn the latent features for clustering, and 2) either ignoring local similarities within views or treating local similarities and global high-order correlations equally.
In this paper, we propose a smooth low-rank MVTC (SLR-MVTC) method, which aims to extract latent features that are smooth within each view and low-rank across views, enhancing clustering performance. Specifically, we first learn latent features from each view using orthogonal projection and then construct the latent feature tensor by concatenation and rotation. Then, we introduce a new smooth tensor nuclear norm to depict the low-rank components of the low-frequency parts in the feature tensor. Benefiting from the fast Fourier transform along the sample dimension, the obtained low-frequency components effectively capture local smoothness within views, while their low-rank parts further explore global correlations across views. Experimental results on six multi-view datasets demonstrate that SLR-MVTC outperforms state-of-the-art algorithms in terms of clustering performance and CPU time.

SLR-MVTC: Smooth Low-Rank Multi-View Tensor Clustering

The diffusion model has lately been shown to achieve remarkable performances through its ability of generating high quality images. However, current diffusion model studies consider only the static learning of a data distribution from a single data source, which is resulting in catastrophic forgetting when attempting to learn new data. In this paper, we explore a more realistic learning scenario where the training data is continuously acquired for training the model successively. We define first a dynamic diffusion model, under the challenging Online Task-Free Continual Learning (OTFCL) paradigm, and then we propose the Dynamic Expansion Diffusion Model (DEDM) for addressing catastrophic forgetting and data distribution shifts under OTFCL. We propose to add new  diffusion components to a mixture of diffusion models following the evaluation of a criterion which compares the probabilistic representation of the new data with the DEDM model's existing knowledge. In addition, for maintaining an optimal architecture, we propose a component-discarding approach ensuring knowledge diversity while minimizing the total number of parameters in the DEDM. Furthermore, we show how the proposed DEDM can be implemented as a teacher module in a unified framework for representation learning, in which a knowledge distillation approach is proposed for training a student module aiming to compress the teacher's knowledge into a latent space. Our model can be trained in a completely unsupervised learning fashion while ensuring continual data generation and representation learning.

Dynamic Expansion Diffusion Learning For Lifelong Generative Modelling

Point cloud data labeling is considered a time-consuming and expensive task in autonomous driving, whereas unsupervised learning can avoid it by learning point cloud representations from unannotated data. In this paper, we propose UOV, a novel 3D unsupervised framework assisted by 2D open-vocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of UOV, extensive experiments are conducted on multiple related datasets. We achieved a record-breaking 47.73% mIoU on the annotation-free 3D segmentation task in nuScenes, surpassing the previous best model by 3.13% mIoU. Meanwhile, the performance of fine-tuning with 1% data on nuScenes and SemanticKITTI reached a remarkable 51.75% mIoU and 48.14% mIoU, outperforming all previous pre-trained models.

3D Annotation-Free Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving

Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: \textbf{local control}. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve this goal. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.

Local Conditional Controlling for Text-to-Image Diffusion Models

Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions. By modifying responses of unknown questions in the training data to refusal responses such as ``I don't know", RAIT enhances the reliability of LLMs and reduces their hallucination.
Generally, RAIT modifies training samples base on the correctness of initial LLM's response. However, this crude approach can cause LLMs to excessively refuse answering questions they could have correctly addressed, a problem we call over-refusal.
In this paper, we explore two primary causes of over-refusal:
Static conflict emerges when RAIT data is constructed solely on correctness criteria, causing similar samples in the LLM's feature space to be assigned different labels (original vs. modified "I don't know").
Dynamic conflict occurs due to changes in LLM's knowledge state during fine-tuning, which transforms previous unknown questions into knowns, while the training data settled by initial LLM remains unchanged, resulting in conflicts.
These conflicts cause the trained LLM to misclassify known questions as unknown, resulting in over-refusal.
To address this issue, we introduce Certainty Represented Knowledge Flow for Refusal-Aware Instructions Construction (CRaFT). CRaFT centers on two main contributions: First, we additionally incorporate response certainty to selectively filter and modify data, reducing static conflicts. Second, we implement preliminary rehearsal training to characterize changes in the LLM's knowledge state, which helps mitigate dynamic conflicts during the fine-tuning process. 
We conducted extensive experiments on open-ended question answering and multiple-choice question task. Experiment results show that CRaFT can improve LLM's reliability during the RAIT process. Source code and training data will be released at Github.

Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning

Understanding of bimanual hand-object interaction plays an important role in robotics and virtual reality. Recently, great progress has been made on this problem. However, due to significant occlusions between hands and object as well as the high degree-of-freedom motions, it is still challenging to collect and annotate a high-quality, large-scale dataset, which prevents further improvement of bimanual hand-object interaction-related tasks. In this work, we propose a new 3D Gaussian Splatting (3DGS)-based data augmentation framework for bimanual hand-object interaction, which is capable of augmenting existing dataset to large-scale photorealistic data with various hand-object pose and viewpoints. First, we use mesh-based 3DGS to model objects and hands, and to deal with the rendering blur problem due to multi-resolution input images used, we design a super-resolution module. Second, we extend the single hand grasping pose optimization module for the bimanual hand object to generate various poses of bimanual hand-object interaction, which can significantly expand the pose distribution of the dataset. Third, we conduct an analysis for the impact of different aspects of the proposed data augmentation on the understanding of the bimanual hand-object interaction. We perform our data augmentation on two benchmarks, H2O and Arctic, and verify that our method can improve the performance of the baselines. We will release our code and dataset upon acceptance.

HOGSA: Bimanual Hand-Object Interaction Understanding with 3D Gaussian Splatting Based Data Augmentation

Simulating fuel sloshing within aircraft tanks during flight is crucial for aircraft safety research. Traditional methods based on Navier-Stokes equations are computationally expensive. In this paper, we treat fluid motion as point cloud transformation and propose the first neural network method specifically designed for simulating fuel sloshing in aircraft. This model is also the first deep learning model capable of stably modeling fluid particle dynamics in such complex scenarios. Our triangle feature fusion design achieves an optimal balance among fluid dynamics modeling, momentum conservation constraints, and global stability control. Additionally, we constructed the Fueltank dataset, the first dataset for aircraft fuel surface sloshing. It comprises 320,000 frames across four typical tank types and covers a wide range of flight maneuvers, including multi-directional rotations. We conducted comprehensive experiments on both our dataset and the take-off scenario of the aircraft. Compared to existing neural network-based fluid simulation algorithms, we significantly enhanced accuracy  while maintaining high computational speed. Compared to traditional SPH methods, our speed improved approximately 10 times. Furthermore, compared to traditional fluid simulation software such as Flow3D, our computation speed increased by more than 300 times.

A Pioneering Neural Network Method for Efficient and Robust Fuel Sloshing Simulation in Aircraft

Machine unlearning without access to real data distribution is challenging. The existing method based on data-free distillation achieved unlearning by filtering out synthetic samples containing forgetting information but struggled to efficiently distill the retaining-related knowledge. In this work, we analyze that such a problem is due to over-filtering, which leads to a reduction in the synthesized retaining-related information. We propose a novel method, Inhibited Synthetic PostFilter (ISPF), to tackle this challenge from two perspectives: 1) Inhibited Synthetic, by reducing the synthesized forgetting information, and 2) PostFilter, by fully utilizing the retaining-related information in synthesized samples. Experimental results demonstrate that the proposed ISPF effectively tackles the challenge and outperforms existing methods.

Toward Efficient Data-Free Unlearning

Formal verification provides critical security assurances for neural networks, yet its practical application suffers from the long verification time. This work introduces a novel method for training verification-friendly neural networks, which are robust, easy to verify, and relatively accurate. Our method integrates neuron behavior consistency into the training process, making neuron activation states consistent across different inputs in a nearby domain, reducing the number of unstable neurons and tightening the bounds of neurons thereby enhancing neural network verifiability. We evaluated our method using the MNIST, Fashion-MNIST, and CIFAR-10 datasets across various network architectures. The results of the experiment demonstrate that networks trained using our method are verification-friendly across different radii and different model architectures, whereas other tools fail to maintain verifiability as the radius increases. We also show that our method can be combined with existing methods to further improve the verifiability of networks.

Training Verification-Friendly Neural Networks via Neuron Behavior Consistency

Audio-visual semantic segmentation (AVSS) has garnered significant interest in the multi-modal domain, which aims to segment the video objects producing specific sound in the corresponding audio. Despite notable progress, existing methods struggle with new classes not present in the original training set. Towards this issue, we introduce the few shot incremental learning (FSIL) to the AVSS task, whose goal is seamlessly integrating new classes with limited incremental samples while preserving existing knowledge of old classes. Two challenges emerge for this new setting: (1) To reduce labeling expenses, old classes within the incremental samples are treated akin to background, same as silent objects. Training the model directly with background annotations may exacerbate the loss of distinctive knowledge of old classes, such as outlines and sounds. (2) Most existing models adopt the early cross-modal fusion with the single-tower design. It involves more class characteristics to class representations, impeding the knowledge transfer between classes based on similarity. To address above issues, we propose a Few-shot Incremental learning framework via class-centric foregrouNd aggreGation and dual-tower knowlEdge tRansfer (FINGER) for the AVSS task, which comprises two targeted modules: (1) The class-centric foreground aggregation assembles class-specific features for each foreground class and disregards background features. The background class is thus excluded during training and inferred based on the foreground predictions. (2) The dual-tower knowledge transfer delays the cross-modal fusion to separately conduct knowledge transfer for each modality. Extensive experiments prove the effectiveness of the FINGER model by significantly surpassing the state-of-the-arts. The code is available (https://anonymous.4open.science/r/FINGER).

Premium content

Next from AAAI 2025

SLR-MVTC: Smooth Low-Rank Multi-View Tensor Clustering

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES