United States

Singing Voice Synthesis (SVS) aims to generate singing voices of high fidelity and expressiveness. Conventional SVS systems usually utilize an acoustic model to transform a music score into acoustic features, followed by a vocoder to reconstruct the singing voice. It was recently shown that end-to-end modeling is effective in the fields of SVS and Text to Speech (TTS). In this work, we thus present a fully end-to-end SVS method together with a chunkwise streaming inference to address the latency issue for practical usages. Note that this is the first attempt to fully implement end-to-end streaming audio synthesis using latent representations in VAE. We have made specific improvements to enhance the performance of streaming SVS using latent representations. Experimental results demonstrate that the proposed method achieves synthesized audio with high expressiveness and pitch accuracy in both streaming SVS and TTS tasks.

AAAI 2025

CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder

snlp

speech synthesis

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Drug response prediction (DRP) is a longstanding challenge in modern oncology that underpins personalized treatment. Early DRP methods, trained on label-rich cell line samples, suffer from performance degradation when applied to label-scarce patient samples due to the distribution shift. Recently, a few transfer learning efforts have addressed this issue by aligning cell line (source domain) and patient (target domain) data via unsupervised domain adaptation (UDA). However, these efforts often treat each drug's response prediction as an isolated task, requiring model retraining when the drug changes; and focus only on aligning data distributions as a whole, neglecting the category (e.g., different cancers or tissues) confusion problem. To address these limitations, we propose a knowledge-guided domain adaptation model to transfer the DRP from cell lines to patients, named TransDRP. Specifically, TransDRP operates in two phases: pre-training and adaptation. In the first phase, we pre-train a multi-label graph neural network using molecular knowledge, to simultaneously predict responses for various drugs and capture their interdependencies. In the second phase, we implement a global-local domain adversarial strategy with clinical knowledge, to encourage representation alignment within same cancer categories and separation among different cancer categories across domains. Extensive experiments demonstrate that TransDRP outperforms state-of-the-art UDA methods in both transfer efficiency and precision for the patient DRP.

Knowledge-guided domain adaptation model for transferring drug response prediction from cell lines to patients

Model counting is a fundamental task that involves determining the number of satisfying assignments to a logical formula, typically in conjunctive normal form (CNF). While CNF model counting has received extensive attention over recent decades, interest in Pseudo-Boolean (PB) model counting is emerging partly due to the greater flexibility of PB formulas. As such, we observed feature gaps in existing PB counters such as a lack of support for projected and incremental settings, which could hinder adoption.

In this work, our main contribution is the introduction of the PB model counter PI-PBC, the first exact PB model counter with support for projected and incremental model counting. Our counter, PI-PBC, uses our Least Occurrence Weighted Min Degree (LOW-MD) computation ordering heuristic to support projected model counting and a cache mechanism to enable incremental model counting. In our evaluations, PI-PBC completed at least 1.40x the number of benchmarks of competing methods for projected model counting and at least 1.18x of competing methods in incremental model counting.

Towards Projected and Incremental Pseudo-Boolean Model Counting

Large vision-language models show tremendous potential in understanding visual information through human languages. However, they are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image. In this paper, we reveal that object hallucination can be attributed to overconfidence in irrelevant visual features when soft visual tokens map to the LLM's word embedding space. Specifically, by figuring out the semantic similarity between visual tokens and LLM's word embedding, we observe that the smoothness of similarity distribution strongly correlates with the emergence of object hallucinations. To mitigate hallucinations, we propose using the Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing stochastic noise, facilitating the constraining of irrelevant information. Furthermore, we propose an entropy-based noise-controlling strategy to enable the injected noise to be adaptively constrained regarding the smoothness of the similarity distribution. We adapt the proposed AdaVIB across distinct model architectures. Experimental results demonstrate that the proposed AdaVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks.

Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow

Learned image compression (LIC) has achieved state-of-the-art rate-distortion performance, deemed promising for next-generation image compression techniques. However, pre-trained LIC models usually suffer from significant performance degradation when applied to out-of-training-domain images, implying their poor generalization capabilities. To tackle this problem, we propose a few-shot domain adaptation method for LIC by integrating plug-and-play adapters into pre-trained models. Drawing inspiration from the analogy between latent channels and frequency components, we examine domain gaps in LIC and observe that out-of-training-domain images disrupt pre-trained channel-wise decomposition. Consequently, we introduce a method for channel-wise re-allocation using convolution-based adapters and low-rank adapters, which are lightweight and compatible to mainstream LIC schemes. Extensive experiments across multiple domains and multiple representative LIC schemes demonstrate that our method significantly enhances pre-trained models, achieving comparable performance to H.266/VVC intra coding with merely 25 target-domain samples. Additionally, our method matches the performance of full-model finetune while transmitting fewer than 2% of the parameters.

Few-Shot Domain Adaptation for Learned Image Compression

Online to batch conversion involves constructing a new batch learner by utilizing a series of models generated by an existing online learning algorithm, for achieving generalization guarantees under i.i.d assumption. However, when applied to real-world streaming applications such as streaming recommender systems, the data stream may be sampled from time-varying distributions instead of persistently being i.i.d. This poses a challenge in terms of out-of-distribution (OOD) generalization. Existing approaches employ fixed conversion mechanisms that are unable to adapt to novel testing distributions, hindering the testing accuracy of the batch learner. To address these issues, we propose AdaO2B, an adaptive online to batch conversion approach under the bandit setting. AdaO2B is designed to be aware of the distribution shifts in the testing data and achieves OOD generalization guarantees. Specifically, AdaO2B can dynamically combine the sequence of models learned by a contextual bandit algorithm and determine appropriate combination weights using a context-aware weighting function. This innovative approach allows for the conversion of a sequence of models into a batch learner that facilitates OOD generalization. Theoretical analysis provides justification for why and how the learned adaptive batch learner can achieve OOD generalization error guarantees. Experimental results have demonstrated that AdaO2B significantly outperforms state-of-the-art baselines on both synthetic data and real-world data.

AdaO2B: Adaptive Online to Batch Conversion for Out-of-Distribution Generalization

In recent years, agents have become capable of communicating seamlessly via natural language and navigating in environments that involve cooperation and competition, a fact that can introduce social dilemmas. Due to the interleaving of cooperation and competition, understanding agents' decision-making in such environments is challenging, and humans can benefit from obtaining explanations. However, such environments and scenarios have rarely been explored in the context of explainable AI. While some explanation methods for cooperative environments can be applied in mixed-motive setups, they do not address inter-agent competition, cheap-talk, or implicit communication by actions. In this work, we design explanation methods to address these issues. Then, we proceed to establish generality and demonstrate the applicability of the methods to three games with vastly different properties.
Lastly, we demonstrate the effectiveness and usefulness of the methods for humans in two mixed-motive games. The first is a challenging 7-player game called no-press Diplomacy. The second is a 3-player game inspired by the Prisoner's Dilemma, featuring communication in natural language.

Explaining Decisions of Agents in Mixed-Motive Games

The objective of Composed Image Retrieval (CIR) is to identify a target image that meets the requirement based on a multimodal query (including the reference image and the modification text) provided by the user. Despite the notable success of existing approaches, they fail to adequately address the modification relation between visual entities and modification actions. This limitation is non-trivial due to three challenges: 1) irrelevant factor perturbation, 2) vague semantic boundaries, and 3) implicit modification relations. To address the above challenges, we propose an Entity miNing and modifiCation relatiOn binDing nEtwoRk (ENCODER), which has been designed to mine visual entities and modification actions, and then bind modification relations. Among the various components of the proposed ENCODER, we have initially designed the Latent Factor Filter (LFF) module to filter visual and textual latent factors related to modification semantics based on a threshold gating mechanism. Secondly, we propose Entity-Action Binding (EAB), which comprises modality-shared Learnable Relation Queries (LRQ) that are capable of mining visual entities and modification actions, as well as learning implicit modification relations for entity-action binding. Finally, the Multi-scale Composition module is introduced to achieve multi-scale feature composition, with guidance provided by entity-action binding. Extensive experiments on four benchmark datasets demonstrate the superiority of our proposed method. Our codes and checkpoints are released in the supplement material.

ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval

Dependency parsing is crucial in natural language processing for analyzing syntactic structures. Dependency parsers enhanced by pre-trained language models have achieved outstanding performance in high-resource languages. In contrast, cross-language dependency parsing is an effective strategy to learn useful knowledge from high-resource languages and compensate for the deficiency of low-resource languages. However, the key challenge for cross-language dependency parsing is to reduce distributional biases and excavate in-depth commonalities. To address this issue, we propose the novel dynamic syntactic feature filtering and injecting networks based on the traditional shared-private model which utilizes a shared and two private encoders to separate features from source or target languages. Concretely, a language-specific syntactic feature filtering network (LSFN) on private encoders emphasizes helpful information and ignores irrelevant or harmful information from the source language. Meanwhile, the language-invariant syntactic feature injection network (LIIN) on the shared encoder can combine the advantages of BiLSTM and improved transformer encoders to transcend language boundaries, thus amplifying syntactic commonalities across languages. We perform experiments on seven benchmark datasets to measure the efficacy of our proposed model and observe an average absolute gain of $1.84$ UAS and $3.43$ LAS compared with the shared-private model. Comparative experiments validate that both LSFN and LIIN components are complementary in transferring beneficial knowledge from source to target languages. Detailed analyses highlight that our model can effectively capture linguistic commonalities and minimize differences, showcasing its robustness and efficacy. Our code will be publicly available at https://flamelunar.github.io/.

Dynamic Syntactic Feature Filtering and Injecting Networks for Cross-lingual Dependency Parsing

Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding—accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.

Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning

Missing values in multivariate time series data can harm machine learning performance and introduce bias. These gaps arise from sensor malfunctions, blackouts, and human error. Previous work has addressed missing at random, complete blackouts, and forecasting scenarios. This paper addresses a more general missing pattern, termed $\textbf{partial blackout}$, where a subset of features is missing for consecutive time steps. This scenario is more common in real-world applications. We introduce a two-stage imputation process using self-attention and diffusion processes to model feature and temporal correlations. Our model outperforms state-of-the-art models in partial blackout scenarios and offers better scalability, promising practical data imputation solutions.

Premium content

Next from AAAI 2025

Knowledge-guided domain adaptation model for transferring drug response prediction from cell lines to patients

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES