United States

Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103\% on WebShop and 75\% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.

AAAI 2025

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

snlp

language models

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Automatic Speech Recognition (ASR) transcripts exhibit recognition errors and various spoken language phenomena such as disfluencies, ungrammatical sentences, and incomplete sentences, hence suffering from poor readability. To improve readability, we propose a Contextualized Spoken-to-Written conversion (CoS2W) task to address ASR and grammar errors and also transfer the informal text into the formal style with content preserved, utilizing contexts and auxiliary information. This task naturally matches the in-context learning capabilities of Large Language Models (LLMs). To facilitate comprehensive comparisons of various LLMs, we construct a document-level Spoken-to-Written conversion of ASR Transcripts Benchmark (SWAB) dataset. Using SWAB, we study the impact of different granularity levels on the CoS2W performance, and propose methods to exploit contexts and auxiliary information to enhance the outputs. Experimental results reveal that LLMs have the potential to excel in the CoS2W task, particularly in grammaticality and formality, our methods achieve effective understanding of contexts and auxiliary information by LLMs. We further investigate the effectiveness of using LLMs as evaluators and find that LLM evaluators show strong correlations with human evaluations on rankings of faithfulness and formality, which validates the reliability of LLM evaluators for the CoS2W task.

Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

We initiate the study of computing envy-free allocations of indivisible items in the extension setting, i.e., when some part of the allocation is fixed and the task is to allocate the remaining items. In view of the NP-hardness of the problem, we investigate whether - and under which conditions - one can obtain fixed-parameter algorithms for computing a solution in settings where most of the allocation is already fixed. Our results provide a broad complexity-theoretic classification of the problem which includes: (a) fixed-parameter algorithms tailored to settings with few distinct types of agents or items; (b) lower bounds which exclude the generalization of these positive results to more general settings. We conclude by showing that - unlike when computing allocations from scratch - the non-algorithmic question of whether more relaxed EF1 or EFX allocations exist can be completely resolved in the extension setting.

The Complexity of Extending Fair Allocations of Indivisible Goods

Multi-objective Bayesian optimization (MOBO) has shown promising performance on various expensive multi-objective optimization problems (EMOPs). However, effectively modeling complex distributions of the Pareto optimal solutions is difficult with limited function evaluations. Existing Pareto set learning algorithms may exhibit considerable instability in such expensive scenarios, leading to significant deviations between the obtained solution set and the Pareto set (PS). In this paper, we propose a novel Composite Diffusion Model based Pareto Set Learning algorithm, namely CDM-PSL, for expensive MOBO. CDM-PSL includes both unconditional and conditional diffusion model for generating high-quality samples efficiently. Besides, we introduce a weighting method based on information entropy to balance different objectives. This method is integrated with the guiding strategy, ensuring that all the objectives are appropriately balanced and given due consideration during the optimization process. Extensive experimental results on both synthetic benchmarks and real-world problems demonstrates that CDM-PSL attains superior performance compared with various state-of-the-art MOBO algorithms.

Expensive Multi-Objective Bayesian Optimization Based on Diffusion Models

Graph similarity computation (GSC) is to calculate the similarity between one pair of graphs, which is a fundamental problem with fruitful applications in the graph community. In GSC, graph edit distance (GED) and maximum common subgraph (MCS) are the two most adopted similarity metrics, both of which are NP-hard to compute. Instead of calculating the exact values, state-of-the-art solutions resort to leveraging graph neural networks (GNNs) to learn data-driven models for the estimation of GED and MCS. Most of them are built on components involving node-level interactions crossing graphs, which engender vast computation overhead but are of little avail in effectiveness. Motivated by this, in the paper, we present GraSP, a simple yet effective GSC approach for GED and MCS prediction. More concretely, GraSP achieves high result efficacy through several key instruments: enhanced node features via positional encoding and a GNN model augmented by a gating mechanism, residual connections, as well as multi-scale pooling. Theoretically, GraSP can surpass the 1-WL test, indicating its high expressiveness. Empirically, extensive experiments comparing GraSP against 10 competitors on multiple widely adopted benchmark datasets showcase the superiority of GraSP over prior arts in terms of both effectiveness and efficiency. The source code is available at https://anonymous.4open.science/r/GraSP-2024.

GraSP: Simple yet Effective Graph Similarity Predictions

Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. Despite recent advancements in Large Language Models (LLMs), research on assessing and improving instruction-following (IF) alignment within the RAG domain remains limited. To address this issue, we propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems. We start by manually crafting a minimal set of atomic instructions ($<$100) and developing combination rules to synthesize and verify complex instructions for a seed set. We then use supervised models for instruction rewriting while simultaneously generating code to automate the verification of instruction quality via a Python executor. Finally, we integrate these instructions with extensive RAG and general data samples, scaling up to a high-quality VIF-RAG-QA dataset ($>$100k) through automated processes. To address the gap in instruction-following auto-evaluation for RAG systems, we introduce FollowRAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and 4 knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG can seamlessly integrate with different RAG benchmarks. Using FollowRAG and 8 widely-used IF and foundational abilities benchmarks for LLMs, we demonstrate that VIF-RAG markedly enhances LLM performance across a broad range of general instruction constraints while effectively leveraging its capabilities in RAG scenarios. Further analysis offers practical insights for achieving IF alignment in RAG systems.

Toward Verifiable Instruction-Following Alignment for Retrieval Augmented Generation

Dense video captioning (DVC) aims to describe multiple events within a video, and its performance is greatly affected by the accuracy of video event detection. Video event detection involves predicting the proposal boundaries (start and end times) and the classification score of each event in a video. Recently, a few methods have applied diffusion models originally designed for image object detection to detect events in DVC. These methods add noise to the ground-truth event proposal boundaries, and subsequently learn the denoising process. However, these methods often overlook the fundamental differences between videos and images. We observe that, whereas in images the important information for object classification is normally around the boundaries of the ground truth boxes, in videos the key information for event classification is typically centered in the middle of ground-truth event proposals. As a result, the classification module in these existing diffusion models becomes insensitive to boundary changes introduced by the added noise, leading to sub-optimal performance. This paper introduces DiffDVC, an innovative diffusion model for DVC. The core of DiffDVC is a boundary-sensitive detector. The detector increases the sensitivity of the classification module to boundary changes by focusing on frames within a specific range around the start and end times of noisy event proposals. Additionally, this range is dynamically adjusted to suit different event proposals. Comprehensive experiments on ActivityNet-1.3, ActivityNet Captions and YouCook2 datasets show DiffDVC achieving superior performance.

DiffDVC: Accurate Event Detection for Dense Video Captioning via Diffusion Models

The ability of zero-shot translation emerges when we train a multilingual model with certain translation directions; the model can then directly translate in unseen directions. Alternatively, zero-shot translation can be accomplished by pivoting through a third language (e.g., English). In our work, we observe that both direct and pivot translations are noisy and achieve less satisfactory performance. We propose EBBS, an ensemble method with a novel bi-level beam search algorithm, where each ensemble component explores its own prediction step by step at the lower level but all components are synchronized by a "soft voting" mechanism at the upper level. Results on two popular multilingual translation datasets show that EBBS consistently outperforms direct and pivot translations, as well as existing ensemble techniques. Further, we can distill the ensemble's knowledge back to the multilingual model to improve inference efficiency; profoundly, our EBBS-distilled model can even outperform EBBS as it learns from the ensemble knowledge.

EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Meta-learning, or "learning to learn," aims to enable models to quickly adapt to new tasks with minimal data. While traditional methods like Model-Agnostic Meta-Learning (MAML) optimize parameters in Euclidean space, they often struggle to capture complex learning dynamics, particularly in few-shot learning scenarios. To address this limitation, we propose Stiefel-MAML, which integrates Riemannian geometry by optimizing within the Stiefel manifold, a space that naturally enforces orthogonality constraints. By leveraging the geometric structure of the Stiefel manifold, we improve parameter expressiveness and enable more efficient optimization through Riemannian gradient calculations and retraction operations. We also introduce a novel kernel-based loss function defined on the Stiefel manifold, further enhancing the model’s ability to explore the parameter space. Experimental results on benchmark datasets—including Omniglot, Mini-ImageNet, FC-100, and CUB—demonstrate that Stiefel-MAML consistently outperforms traditional MAML, achieving superior performance across various few-shot learning tasks. Our findings highlight the potential of Riemannian geometry to enhance meta-learning, paving the way for future research on optimizing over different geometric structures.

Riemannian Geometric-based Meta Learning

Large Vision-Language Models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as \textit{hallucination}).
A promising solution to address this problem is using human-preference alignment techniques, such as best-of-$n$ sampling and reinforcement learning.
However, these techniques face the difficulty arising from the scarcity of visual preference data required to train a visual reward model (VRM).
In this work, we continue the line of research.
We present a $\textbf{Ro}$bust $\textbf{V}$isual $\textbf{R}$eward$ \textbf{M}$odel (RoVRM) which improves human-preference alignment for LVLMs.
RoVRM leverages auxiliary textual preference data through a three-phase progressive training approach and optimal transport-based preference data selection to effectively mitigate the scarcity of visual preference data.
We experiment with RoVRM on the commonly used vision-language tasks based on the LLaVA-1.5-7B and -13B models.
Experimental results demonstrate that RoVRM consistently outperforms traditional VRMs. 
Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization.

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Despite being empowered with alignment mechanisms, large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks that can compromise their alignment mechanisms. 
This vulnerability poses significant risks to the real-world applications.
Existing work faces challenges in both training efficiency and generalization capabilities (i.e., Reinforcement Learning from Human Feedback and Red-Teaming). 
Developing effective strategies to enable LLMs to resist continuously evolving jailbreak attempts represents a significant challenge. 
To address this challenge, we propose a novel defensive paradigm called GuidelineLLM, which assists LLMs in recognizing queries that may have harmful content. 
Before LLMs respond to a query, GuidelineLLM first identifies potential risks associated with the query, summarizes these risks into guideline suggestions, and then feeds these guidelines to the responding LLMs.
Importantly, our approach eliminates the necessity for additional safety fine-tuning of the LLMs themselves; only the GuidelineLLM requires fine-tuning. This characteristic enhances the general applicability of GuidelineLLM across various LLMs. 
Experimental results demonstrate that GuidelineLLM can significantly reduce the attack success rate (ASR) against the LLMs (an average reduction of 34.17\% ASR) while maintaining the helpfulness of the LLMs in handling benign queries. 
Code is available at $\texttt{Anonymous}$.

Premium content

Next from AAAI 2025

Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES