United States

Singing voice synthesis has achieved remarkable progress in generating natural and high-quality voices. However, existing methods rarely provide precise control over vocal techniques such as intensity, mixed voice, falsetto, bubble, and breathy tones, thereby limiting the expressive potential of synthetic voices. We introduce TechSinger, an advanced system for controllable singing voice synthesis that supports five languages and seven vocal techniques. TechSinger leverages a flow-matching-based generative model to produce singing voices with enhanced expressive control over techniques. To enhance the diversity of training data, we develop a technique detection model that annotates datasets with phoneme-level technique labels. Additionally, our prompt-based technique prediction model enables users to specify desired vocal attributes through natural language, offering fine-grained control over the synthesized singing. Experimental results demonstrate that TechSinger significantly enhances the expressiveness and realism of synthetic singing voices, outperforming existing methods in terms of audio quality and technique-specific control. Audio samples and additional details are available at https://tech-singer.github.io.

AAAI 2025

TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



We initiate the study of computing envy-free allocations of indivisible items in the extension setting, i.e., when some part of the allocation is fixed and the task is to allocate the remaining items. In view of the NP-hardness of the problem, we investigate whether - and under which conditions - one can obtain fixed-parameter algorithms for computing a solution in settings where most of the allocation is already fixed. Our results provide a broad complexity-theoretic classification of the problem which includes: (a) fixed-parameter algorithms tailored to settings with few distinct types of agents or items; (b) lower bounds which exclude the generalization of these positive results to more general settings. We conclude by showing that - unlike when computing allocations from scratch - the non-algorithmic question of whether more relaxed EF1 or EFX allocations exist can be completely resolved in the extension setting.

The Complexity of Extending Fair Allocations of Indivisible Goods

Multi-objective Bayesian optimization (MOBO) has shown promising performance on various expensive multi-objective optimization problems (EMOPs). However, effectively modeling complex distributions of the Pareto optimal solutions is difficult with limited function evaluations. Existing Pareto set learning algorithms may exhibit considerable instability in such expensive scenarios, leading to significant deviations between the obtained solution set and the Pareto set (PS). In this paper, we propose a novel Composite Diffusion Model based Pareto Set Learning algorithm, namely CDM-PSL, for expensive MOBO. CDM-PSL includes both unconditional and conditional diffusion model for generating high-quality samples efficiently. Besides, we introduce a weighting method based on information entropy to balance different objectives. This method is integrated with the guiding strategy, ensuring that all the objectives are appropriately balanced and given due consideration during the optimization process. Extensive experimental results on both synthetic benchmarks and real-world problems demonstrates that CDM-PSL attains superior performance compared with various state-of-the-art MOBO algorithms.

Expensive Multi-Objective Bayesian Optimization Based on Diffusion Models

Graph similarity computation (GSC) is to calculate the similarity between one pair of graphs, which is a fundamental problem with fruitful applications in the graph community. In GSC, graph edit distance (GED) and maximum common subgraph (MCS) are the two most adopted similarity metrics, both of which are NP-hard to compute. Instead of calculating the exact values, state-of-the-art solutions resort to leveraging graph neural networks (GNNs) to learn data-driven models for the estimation of GED and MCS. Most of them are built on components involving node-level interactions crossing graphs, which engender vast computation overhead but are of little avail in effectiveness. Motivated by this, in the paper, we present GraSP, a simple yet effective GSC approach for GED and MCS prediction. More concretely, GraSP achieves high result efficacy through several key instruments: enhanced node features via positional encoding and a GNN model augmented by a gating mechanism, residual connections, as well as multi-scale pooling. Theoretically, GraSP can surpass the 1-WL test, indicating its high expressiveness. Empirically, extensive experiments comparing GraSP against 10 competitors on multiple widely adopted benchmark datasets showcase the superiority of GraSP over prior arts in terms of both effectiveness and efficiency. The source code is available at https://anonymous.4open.science/r/GraSP-2024.

GraSP: Simple yet Effective Graph Similarity Predictions

Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. Despite recent advancements in Large Language Models (LLMs), research on assessing and improving instruction-following (IF) alignment within the RAG domain remains limited. To address this issue, we propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems. We start by manually crafting a minimal set of atomic instructions ($<$100) and developing combination rules to synthesize and verify complex instructions for a seed set. We then use supervised models for instruction rewriting while simultaneously generating code to automate the verification of instruction quality via a Python executor. Finally, we integrate these instructions with extensive RAG and general data samples, scaling up to a high-quality VIF-RAG-QA dataset ($>$100k) through automated processes. To address the gap in instruction-following auto-evaluation for RAG systems, we introduce FollowRAG Benchmark, which includes approximately 3K test samples, covering 22 categories of general instruction constraints and 4 knowledge-intensive QA datasets. Due to its robust pipeline design, FollowRAG can seamlessly integrate with different RAG benchmarks. Using FollowRAG and 8 widely-used IF and foundational abilities benchmarks for LLMs, we demonstrate that VIF-RAG markedly enhances LLM performance across a broad range of general instruction constraints while effectively leveraging its capabilities in RAG scenarios. Further analysis offers practical insights for achieving IF alignment in RAG systems.

Toward Verifiable Instruction-Following Alignment for Retrieval Augmented Generation

Dense video captioning (DVC) aims to describe multiple events within a video, and its performance is greatly affected by the accuracy of video event detection. Video event detection involves predicting the proposal boundaries (start and end times) and the classification score of each event in a video. Recently, a few methods have applied diffusion models originally designed for image object detection to detect events in DVC. These methods add noise to the ground-truth event proposal boundaries, and subsequently learn the denoising process. However, these methods often overlook the fundamental differences between videos and images. We observe that, whereas in images the important information for object classification is normally around the boundaries of the ground truth boxes, in videos the key information for event classification is typically centered in the middle of ground-truth event proposals. As a result, the classification module in these existing diffusion models becomes insensitive to boundary changes introduced by the added noise, leading to sub-optimal performance. This paper introduces DiffDVC, an innovative diffusion model for DVC. The core of DiffDVC is a boundary-sensitive detector. The detector increases the sensitivity of the classification module to boundary changes by focusing on frames within a specific range around the start and end times of noisy event proposals. Additionally, this range is dynamically adjusted to suit different event proposals. Comprehensive experiments on ActivityNet-1.3, ActivityNet Captions and YouCook2 datasets show DiffDVC achieving superior performance.

DiffDVC: Accurate Event Detection for Dense Video Captioning via Diffusion Models

The ability of zero-shot translation emerges when we train a multilingual model with certain translation directions; the model can then directly translate in unseen directions. Alternatively, zero-shot translation can be accomplished by pivoting through a third language (e.g., English). In our work, we observe that both direct and pivot translations are noisy and achieve less satisfactory performance. We propose EBBS, an ensemble method with a novel bi-level beam search algorithm, where each ensemble component explores its own prediction step by step at the lower level but all components are synchronized by a "soft voting" mechanism at the upper level. Results on two popular multilingual translation datasets show that EBBS consistently outperforms direct and pivot translations, as well as existing ensemble techniques. Further, we can distill the ensemble's knowledge back to the multilingual model to improve inference efficiency; profoundly, our EBBS-distilled model can even outperform EBBS as it learns from the ensemble knowledge.

EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Meta-learning, or "learning to learn," aims to enable models to quickly adapt to new tasks with minimal data. While traditional methods like Model-Agnostic Meta-Learning (MAML) optimize parameters in Euclidean space, they often struggle to capture complex learning dynamics, particularly in few-shot learning scenarios. To address this limitation, we propose Stiefel-MAML, which integrates Riemannian geometry by optimizing within the Stiefel manifold, a space that naturally enforces orthogonality constraints. By leveraging the geometric structure of the Stiefel manifold, we improve parameter expressiveness and enable more efficient optimization through Riemannian gradient calculations and retraction operations. We also introduce a novel kernel-based loss function defined on the Stiefel manifold, further enhancing the model’s ability to explore the parameter space. Experimental results on benchmark datasets—including Omniglot, Mini-ImageNet, FC-100, and CUB—demonstrate that Stiefel-MAML consistently outperforms traditional MAML, achieving superior performance across various few-shot learning tasks. Our findings highlight the potential of Riemannian geometry to enhance meta-learning, paving the way for future research on optimizing over different geometric structures.

Riemannian Geometric-based Meta Learning

Large Vision-Language Models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as \textit{hallucination}).
A promising solution to address this problem is using human-preference alignment techniques, such as best-of-$n$ sampling and reinforcement learning.
However, these techniques face the difficulty arising from the scarcity of visual preference data required to train a visual reward model (VRM).
In this work, we continue the line of research.
We present a $\textbf{Ro}$bust $\textbf{V}$isual $\textbf{R}$eward$ \textbf{M}$odel (RoVRM) which improves human-preference alignment for LVLMs.
RoVRM leverages auxiliary textual preference data through a three-phase progressive training approach and optimal transport-based preference data selection to effectively mitigate the scarcity of visual preference data.
We experiment with RoVRM on the commonly used vision-language tasks based on the LLaVA-1.5-7B and -13B models.
Experimental results demonstrate that RoVRM consistently outperforms traditional VRMs. 
Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization.

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Despite being empowered with alignment mechanisms, large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks that can compromise their alignment mechanisms. 
This vulnerability poses significant risks to the real-world applications.
Existing work faces challenges in both training efficiency and generalization capabilities (i.e., Reinforcement Learning from Human Feedback and Red-Teaming). 
Developing effective strategies to enable LLMs to resist continuously evolving jailbreak attempts represents a significant challenge. 
To address this challenge, we propose a novel defensive paradigm called GuidelineLLM, which assists LLMs in recognizing queries that may have harmful content. 
Before LLMs respond to a query, GuidelineLLM first identifies potential risks associated with the query, summarizes these risks into guideline suggestions, and then feeds these guidelines to the responding LLMs.
Importantly, our approach eliminates the necessity for additional safety fine-tuning of the LLMs themselves; only the GuidelineLLM requires fine-tuning. This characteristic enhances the general applicability of GuidelineLLM across various LLMs. 
Experimental results demonstrate that GuidelineLLM can significantly reduce the attack success rate (ASR) against the LLMs (an average reduction of 34.17\% ASR) while maintaining the helpfulness of the LLMs in handling benign queries. 
Code is available at $\texttt{Anonymous}$.

Look Before You Leap: Enhance Attention and Vigilance Regarding Harmful Content with GuidelineLLM

Time series forecasting is an important application in various domains such as energy management, traffic planning, financial markets, meteorology, and medicine. However, real-time series data often present intricate temporal variability and sharp fluctuations, which pose significant challenges for time series forecasting. Previous models that rely on 1D time series representations usually struggle with complex temporal variations. To address the limitations of 1D time series, this study introduces the Times2D method that transforms the 1D time series into 2D space. Times2D consists of three main parts: first, a Periodic Decomposition Block (PDB) that captures temporal variations within a period and between the same periods by converting the time series into a 2D tensor in the frequency domain. Second, the First and Second Derivative Heatmaps (FSDH) capture sharp changes and turning points, respectively. Finally, an Aggregation Forecasting Block (AFB) integrates the 2D tensors from PDB and FSDH for accurate forecasting. This 2D transformation enables the utilization of 2D convolutional operations to effectively capture long and short characteristics of the time series. Comprehensive experimental results across large-scale data in the literature demonstrate that the proposed Times2D model achieves state-of-the-art performance in both short-term and long-term forecasting.

Premium content

Next from AAAI 2025

The Complexity of Extending Fair Allocations of Indivisible Goods

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES