United States

Humans can perceive speakers’ characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to their voice style. Recently, vision-driven Text-to-speech ( TTS ) scholars grounded their investigations on real-person faces, thereby restricting effective speech synthesis from applying to vast potential usage scenarios with diverse characters and image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity characteristics and emotional representations from a wide variety of image styles. Meanwhile, it mitigates the extraneous information (e.g., background, clothing, and hair color, etc.), resulting in synthesized speech closely aligned with a character’s persona. Furthermore, to overcome the scarcity of multi-modal TTS data, we have devised an innovative dataset, namely Expressive Multi-Modal TTS ( EM2TTS), which is diligently curated and annotated to facilitate research in this domain. The experimental results demonstrate our proposed FaceSpeak can generate portrait-aligned voice with satisfactory naturalness and quality. Demos are released at https://facespeak.github.io/

AAAI 2025

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Unsupervised domain adaptation (UDA) refers to a domain adaptation framework in which a learning model is trained based on the labeled samples on the source domain and unlabelled ones in the target domain. The dominant existing methods in the field that rely on the classical covariate shift assumption to learn domain-invariant feature representation have yielded suboptimal performance under label distribution shift. In this paper, we propose a novel Conditional Adversarial SUpport ALignment (CASUAL) whose aim is to minimize the conditional symmetric support divergence between the source’s and target domain’s feature representation distributions, aiming at a more discriminative representation for the classification task. We also introduce a novel theoretical target risk bound, which justifies the merits of aligning the supports of conditional feature distributions compared to the existing marginal support alignment approach in the UDA settings. We then provide a complete training process for learning in which the objective optimization functions are precisely based on the proposed target risk bound. Our empirical results demonstrate that CASUAL outperforms other state-of-the-art methods on different UDA benchmark tasks under different label shift conditions.

CASUAL: Conditional Support Alignment for Domain Adaptation with Label Shift

The introduction of Feature Pyramid Network (FPN) has significantly improved object detection performance. However, substantial challenges persist when detecting tiny objects. The features of tiny objects occupy a very small proportion of the feature maps. Although FPN integrates multi-scale features, it does not directly enhance or enrich the features of tiny objects. Furthermore, FPN lacks spatial perception ability. To address these issues, we propose a novel High Frequency and Spatial Perception Feature Pyramid Network (HS-FPN) containing two innovative modules. First, we designed a high frequency perception module (HFP) that generates high frequency responses through high pass filters. These high frequency responses are used as mask weights from both spatial and channel perspectives to enrich and highlight the features of tiny objects in the original feature maps. Second, we developed a spatial dependency perception module (SDP) to capture the spatial dependencies FPN lacks. Our experiments demonstrate that detector based on HS-FPN exhibits competitive advantages over state-of-the-art models on the AI-TOD dataset for tiny object detection.

HS-FPN: High Frequency and Spatial Perception FPN for Tiny Object Detection

Text-to-image diffusion models significantly enhance the efficiency of artistic creation with high-fidelity image generation. However, in typical application scenarios like comic book production, they can neither place each subject into its expected spot nor maintain the consistent appearance of each subject across images. For these issues, we pioneer a novel task, *Layout-to-Consistent-Image* (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts. To accomplish this challenging task, we present a new formalization of *dual energy guidance* with optimization in a dual semantic-latent space and thus propose a training-free pipeline, __SpotActor__, which features a layout-conditioned backward update stage and a consistent forward sampling stage. In the backward stage, we innovate a nuanced layout energy function to mimic the attention activations with a sigmoid-like objective. While in the forward stage, we design *Regional Interconnection Self-Attention* (RISA) and *Semantic Fusion Cross-Attention* (SFCA) mechanisms that allow mutual interactions across images. To evaluate the performance, we present __ActorBench__, a specified benchmark with hundreds of reasonable prompt-box pairs stemming from object detection datasets. Comprehensive experiments are conducted to demonstrate the effectiveness of our method. The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications with superior layout alignment, subject consistency, prompt conformity and background diversity.

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

Identifying the Markov properties or conditional independencies of a collection of random variables is a fundamental task in statistics for modeling and inference. Existing approaches often learn the structure of a probabilistic graph, which encodes these dependencies, by assuming the variables follow a distribution with a simple parametric form. Moreover, the computational cost of many algorithms scales poorly for high-dimensional distributions, as they need to estimate all the edges in the graph simultaneously. In this work, we propose a scalable algorithm to infer the conditional independence relationships of each variable by exploiting the local Markov property. The proposed method, named Localized Sparsity Identification for Non-Gaussian Distributions (SING), estimates the graph without restricting the family of conditional distributions for each variable. We show that the localized SING algorithm includes existing approaches, such as neighborhood selection with Lasso, as a special case. We demonstrate the effectiveness of our algorithm in both Gaussian and non-Gaussian settings compared to existing methods. Lastly, we show the scalability of the proposed approach by applying it to high-dimensional non-Gaussian examples, including a biological dataset with more than 150 variables.

Learning Local Neighborhoods of Non-Gaussian Graphical Models

Time Series Forecasting aims at predicting future values for a time series and plays a crucial role in many real-world applications, e.g., finance, disease spread, or weather prediction. However, it is also a very challenging task, especially for long-term forecasting. In this paper, we introduce WaveletMixer, an iterative multi-levels, multi-resolutions and multi-phases approach to effectively capture with long-term dependencies of multivariate time series in both global and local perspectives for improving forecasting accuracy. WaveletMixer fundamentally differs to existing works in the following key aspects. First, it exploits multi-levels properties of Wavelet transformation to create multiple forecasting models for different frequency domains at different level of resolutions. Second, the relationships among different frequency domains are exploited to iteratively adjust all prediction models at all levels simultaneously in both local and global perspectives to reduce prediction errors and biases, thus significantly improving the final prediction accuracy. Third, while WaveletMixer is a general framework that can be used to boost performance of any deep-learning architecture (e.g., MLP, LSTM or Transformer), we additionally introduce TS-Learner, an MLP-based model to further enhance the performance in long-term forecasting. Extensive experiments have conducted on nine real-world datasets to demonstrate the performance of WaveletMixer compared to SOTA methods and to reveal its important characteristics. Code and extended experimental results are available in the supplementary material.

WaveletMixer: A Multi-Resolution Wavelets Based MLP-Mixer for Multivariate Long-Term Time Series Forecasting

Continual Learning (CL) is a highly relevant setting gaining traction in recent machine learning research. Among CL
works, architectural and hybrid strategies are particularly effective due to their potential to adapt the model architecture as
new tasks are presented. However, many existing solutions do not efficiently exploit model sparsity, and are prone to capacity 
saturation due to their inefficient use of available weights, which limits the number of learnable tasks. In this paper, we
propose TinySubNets (TSN), a novel architectural CL strategy that addresses the issues through the unique combination of
pruning with different sparsity levels, adaptive quantization, and weight sharing. Pruning identifies a subset of weights that
preserve model performance, making less relevant weights available for future tasks. Adaptive quantization allows a single 
weight to be separated into multiple parts which can be assigned to different tasks. Weight sharing between tasks boosts
the exploitation of capacity and task similarity, allowing for the identification of a better trade-off between model accuracy
and capacity. These features allow TSN to efficiently leverage the available capacity, enhance knowledge transfer, and reduce
computational resources consumption. Experimental results involving common benchmark CL datasets and scenarios show
that our proposed strategy achieves better results in terms of accuracy than existing state-of-the-art CL strategies. Moreover, 
our strategy is shown to provide a significantly improved model capacity exploitation.

TinySubNets: An Efficient and Low Capacity Continual Learning Strategy

We propose a novel hybrid calibration-free method FreeCap to accurately capture global multi-person motions in open environments. Our system combines a single LiDAR with expandable moving cameras, allowing for flexible and precise motion estimation in a unified world coordinate. In particular, We introduce a local-to-global pose-aware cross-sensor human-matching module that predicts the alignment among each sensor, even in the absence of calibration. Additionally, our coarse-to-fine sensor-expandable pose optimizer further optimizes the 3D human key points and the alignments, it is also capable of incorporating additional cameras to enhance accuracy. Extensive experiments on Human-M3 and FreeMotion datasets demonstrate that our method significantly outperforms state-of-the-art single-modal methods, offering an expandable and efficient solution for multi-person motion capture across various applications.

FreeCap: Hybrid Calibration-Free Motion Capture in Open Environments

We study neural network training (NNT): optimizing a neural network's parameters to minimize the training loss over a given dataset. NNT has been studied extensively under theoretic lenses, mainly on two-layer networks with linear or ReLU activation functions where the parameters can take any real value (here referred to as continuous NNT (C-NNT)). However, less is known about deeper neural networks, which adhere to substantially stronger capabilities in practice. In addition, the complexity of the discrete variant of the problem (D-NNT in short), in which the parameters are taken from a given finite set of options, has remained less explored despite its theoretical and practical significance.     


In this work, we show that the hardness of NNT is dramatically affected by the network depth. Specifically, we show that, under standard complexity assumptions, there is no bounded-error probabilistic polynomial time (BPP) algorithm for D-NNT even on instances with fixed dimensions and dataset size, having a deep architecture. As it is generally assumed that NP $\subseteq$ BPP, our result indicates that D-NNT is unlikely to be in NP. Furthermore, using a polynomial reduction we show that the above result also holds for C-NNT, albeit with more structured instances. We complement these results with a comprehensive list of NP-hardness lower bounds for D-NNT on two-layer networks, showing that fixing the number of dimensions, the dataset size, or the number of neurons in the hidden layer leaves the problem challenging. Finally, we obtain a pseudo-polynomial algorithm for D-NNT on a two-layer network with a fixed dataset size.

On the Hardness of Training Deep Neural Networks Discretely

Identifying the causal pathways of unfairness is a critical objective in improving policy design and algorithmic decision-making. However, prior work in causal fairness analysis requires knowledge of the causal graph, hindering practical applications in complex or low-knowledge domains. Moreover, relying on global discovery methods to learn causal structure from data can result in unstable performance with finite samples, potentially leading to contradictory fairness conclusions. To mitigate these issues, we introduce *local discovery for direct discrimination* (LD3): an algorithm tailored to uncover structural evidence of direct discrimination by identifying the causal parents of an outcome variable. LD3 performs a linear number of conditional independence tests relative to variable set size, and allows for latent confounding under the sufficient condition that no parent of the outcome is latent. LD3 prevents unnecessary adjustment, resulting in more interpretable adjustment sets for assessing unfairness.  We introduce a graphical criterion for identifying the *weighted controlled direct effect* (WCDE), a qualitative indicator of direct discrimination, and show that the knowledge returned by LD3 satisfies this criterion. We deploy LD3 for causal fairness analyses of two complex decision systems: criminal recidivism prediction and liver transplant allocation. Results on real-world data demonstrate more plausible causal relations than baselines, which took 46$\times$ to 5870$\times$ longer to execute.

Local Causal Discovery for Structural Evidence of Direct Discrimination

Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. 
There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment, which is error-prone due to small deviations between alignment targets; (2) There is a lack of sufficient data to train music error detection models, resulting in over-reliance on heuristics. 
To address (1), we propose a novel transformer model, \textit{Muse}, that takes audio inputs and outputs annotated music scores. 
This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. 
To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1\% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Compared with existing transcription methods repurposed for music error detection, our model can handle multiple instruments. This allows the model to scale across multiple instruments and generalize across datasets like MAESTRO and CocoChorales.

Premium content

Next from AAAI 2025

CASUAL: Conditional Support Alignment for Domain Adaptation with Label Shift

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES