United States

The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) language model; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead of the corresponding acoustic tokens. The experimental findings reveal that our model outperforms baselines in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies. Demo and code can be found in supplementary material.

AAAI 2025

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering

technical paper

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Randomized search heuristics have been applied successfully to a plethora of problems. This success is complemented by a large body of theoretical results. Unfortunately, the vast majority of these results regard problems with binary or continuous decision variables -- the theoretical analysis of randomized search heuristics for unbounded integer domains is almost nonexistent. To resolve this shortcoming, we start the runtime analysis of multi-objective evolutionary algorithms, which are among the most successful randomized search heuristics, for unbounded integer search spaces. We analyze single- and full-dimensional mutation operators with three different mutation strengths, namely changes by plus/minus one (unit strength), random changes following a law with exponential tails, and random changes following a power-law. The performance guarantees we prove on a recently proposed natural benchmark problem suggest that unit mutation strengths can be slow when the initial solutions are far from the Pareto front. When setting the expected change right (depending on the benchmark parameter and the distance of the initial solutions), the mutation strength with exponential tails yields the best runtime guarantees in our results --  however, with a wrong choice of this expectation, the performance guarantees quickly become highly uninteresting. With power-law mutation, which is an essentially parameter-less mutation operator, we obtain good results uniformly over all problem parameters and starting points. We complement our mathematical findings with experimental results that suggest that our bounds are not always tight. Most prominently, our experiments indicate that power-law mutation outperforms the one with exponential tails even when the latter uses a near-optimal parametrization. Hence, we suggest to favor power-law mutation for unknown problems in integer spaces.

Runtime Analysis for Multi-Objective Evolutionary Algorithms in Unbounded Integer Spaces

Multimodal recommendation (MMRec) aims to integrate multimodal information of items to address the inherent data sparsity issue in collaborative-based recommendation. Traditional MMRec methods typically capture the structure-level item representations from the observed user behaviors within the multimodal graph, overlooking the potential impact of negative instances for personalized preference understanding. In light of the outstanding generative ability and step-by-step inference characteristic of Diffusion Models (DMs), we propose a Curriculum Conditioned Diffusion framework for Multimodal Recommendation (CCDRec), which precisely excavates the modality-aware distribution-level correlation among multi-modalities and elegantly integrates the reverse phase of DMs into negative sampling to highlight the most suitable instances in a curricular manner. Specifically, CCDRec proposes the Diffusion-controlled Multimodal Aligning module (DMA) to align multimodal knowledge with collaborative signals by capturing the fine-grained relationships among multi-modalities in the probabilistic distribution space. Furthermore, CCDRec designs the Negative-sensitive Diffusive Inferring module (NDI) to progressively synthesize the negative sample pool with diverse hardness to support the following knowledge-aware negative sampling. To gradually ramp up the training complexity, CCDRec further introduces a Curricular Negative Sampler (CNS) to tally the curriculum learning paradigm with the reverse phase of DMA, thereby adaptively sampling the gold-standard negative instances to enhance optimization. Extensive experiments on three datasets with four diverse backbones demonstrate the effectiveness and robustness of our CCDRec. The visualization analyses also clarify the underlying mechanism of our DMA in multimodal representation alignment and CNS in curricular negative discovery. The code and the corresponding dataset will be uploaded in the Appendix.

Curriculum Conditioned Diffusion for Multimodal Recommendation

In this paper, we consider the classic fair division problem of allocating $m$ divisible items to $n$ agents with linear valuations over the items. We define novel notions of fair shares from the perspective of individual agents via the cake-cutting process. These shares generalize the notion of proportionality by taking into account the valuations of other agents via constraints capturing envy.  We study what fraction (approximation) of these shares are achievable in the worst case, and present tight and non-trivial approximation bounds as a function of $n$ and $m$. In particular, we show a tight approximation bound of $\Theta(\sqrt{n})$ for various notions of such shares. We show this bound via a novel application of dual fitting, which may be of independent interest. We also present a bound of $O(m^{2/3})$ for a strict notion of share, with an almost matching lower bound. We further develop weaker notions of shares whose approximation bound interpolates smoothly between proportionality and the shares described above. We finally present empirical results showing that our definitions lead to more reasonable shares than the standard fair share notion of proportionality.

Fair Division via the Cake-Cutting Share

Large Language Models (LLMs) are often English-centric
due to the disproportionate distribution of languages in their
pre-training data. Enhancing non-English language capabilities
through post-pretraining often results in catastrophic forgetting
of high-resource languages. Previous methods either
achieve good expansion with severe forgetting or slight forgetting
with poor expansion, indicating the challenge of balancing
language expansion while preventing forgetting. In
this paper, we propose a method called MoE-LPR (Mixture-of-
Experts with Language Priors Routing) to alleviate this
problem. MoE-LPR employs a two-stage training approach to
enhance the multilingual capability. First, the model is post-pretrained
into a Mixture-of-Experts(MoE) architecture by
upcycling, where all the original parameters are frozen and
new experts are added. In this stage, we focus improving
the ability on expanded languages, without using any original
language data. Then, the model reviews the knowledge
of the original languages with replay data amounting to less
than 1% of post-pretraining, where we incorporate language
priors routing to better recover the abilities of the original languages.
Evaluations on multiple benchmarks show that MoE-LPR
outperforms other post-pretraining methods. Freezing
original parameters preserves original language knowledge
while adding new experts preserves the learning ability. Reviewing
with LPR enables effective utilization of multilingual
knowledge within the parameters. Additionally, the
MoE architecture maintains the same inference overhead
while increasing total model parameters. Extensive experiments
demonstrate MoE-LPR’s effectiveness in improving
expanded languages and preserving original language proficiency
with superior scalability.

MoE-LPR: Multilingual Extension of Large Language Models Through Mixture-of-Experts with Language Priors Routing

As an essential technique for Graph Contrastive Learning (GCL), Graph Augmentation (GA) enhances the generalization capability of the model by introducing diverse forms of the same graph. To pursue information completeness, the majority of GCLs have devised augmentation strategies that simultaneously target the two types of information available in graphs: attributes and topology. Nonetheless, these strategies invariably ignore the correlation between these two types of graph information, which limits the representation ability of the model. To overcome this drawback, this paper proposes a novel GCL framework, named Joint spectrAl augMentation (GCL-JAM). The main idea is to transform the original graph into an attribute-interpolated graph to align node attributes with graph topology and then perform spectral augmentation on this newly constructed graph for their joint augmentation. Theoretically, the ability of the proposed graph transformation to harmonize node attributes and graph topology, and the superiority of the proposed joint spectral augmentation over existing augmentations are demonstrated, respectively. Extensive experiments on homophily and heterophily graphs validate the effectiveness of GCL-JAM.

Graph Contrastive Learning with Joint Spectral Augmentation of Attribute and Topology

Personalized federated learning (PFL) studies effective model personalization to address the data heterogeneity issue among clients in traditional federated learning (FL). Existing PFL approaches mainly generate personalized models by relying solely on the clients' latest updated models while ignoring their previous updates, which may result in suboptimal personalized model learning. To bridge this gap, we propose a novel framework termed pFedSeq, designed for personalizing adapters to fine-tune a foundation model with FL. In pFedSeq, the server maintains and trains a sequential learner, which processes a sequence of past adapter updates from clients and generates calibrations for personalized adapters. To optimally capture the cross-client and cross-step relations hidden in previous updates and generate high-performing personalized adapters, pFedSeq adopts the powerful selective state space model (SSM) as the architecture of sequential learner. Through extensive experiments on four public benchmark datasets, we demonstrate the superiority of pFedSeq over state-of-the-art PFL methods.

Look Back for More: Harnessing Historical Sequential Updates for Personalized Federated Adapter Tuning

Many manufacturing companies are facing an acute shortage of qualified workers. Deploying robotic cells is a potential solution to address this challenge. Historically robots have been deployed only in mass production applications in manufacturing. A large fraction of manufacturing is classified as high-mix manufacturing where a large variety of products are produced. Manually programming robots is not a viable solution in high-mix manufacturing applications. Robotic cells need to be powered by embodied AI to make them useful in high-mix manufacturing applications. This paper aims to build a bridge between smart manufacturing and AI communities to enable AI researchers to develop methods and tool that can be successfully deployed to realize smart robotic cells for high-mix manufacturing applications. This paper highlights key requirements for developing embodied AI for powering robotic cells for high-mix manufacturing applications. It also makes the case for approaches that combine model-based and data-driven methods to meet the needs of embodied AI in manufacturing applications and describes the role of generative AI approaches in smart manufacturing applications. Finally, it describes how AI can be used to enhance digital twins and augment human-machine interfaces in manufacturing applications.

Embodied AI for Smart Robotic Cells in Manufacturing Applications

Dental disease is a prevalent chronic condition associated with substantial financial burden, personal suffering, and increased risk of systemic diseases. Despite widespread recommendations for twice-daily tooth brushing, adherence to recommended oral self-care behaviors remains sub-optimal due to factors such as forgetfulness and disengagement. To address this, we developed Oralytics, a mHealth intervention system designed to complement clinician-delivered preventative care for marginalized individuals at risk for dental disease. Oralytics incorporates an online reinforcement learning algorithm to determine optimal times to deliver intervention prompts that encourage oral self-care behaviors. We have deployed Oralytics in a registered clinical trial. The deployment required careful design to manage challenges specific to the clinical trials setting in the U.S. In this paper, we (1) highlight key design decisions of the RL algorithm that address these challenges and (2) conduct a re-sampling analysis to evaluate algorithm design decisions. A second phase (randomized control trial) of Oralytics is planned to start in spring 2025.

A Deployed Online Reinforcement Learning Algorithm in an Oral Health Clinical Trial

This talk presents my ongoing research to develop a reliable and adaptive learning framework tailored for edge intelligence, addressing key challenges in resource-constrained environments such as smart agriculture, autonomous retail, and smart homes. Edge computing requires efficient and reliable models that can dynamically adapt to changing conditions while making accurate predictions. To achieve this, I propose a multi-layered framework that integrates sparse learning for reliability, adaptive computation for efficient resource management, and generalization strategies, enabling robust AI deployment at the edge.

Advancing Reliable Edge Intelligence

Machine Learning (ML) algorithms are increasingly used in our daily lives, yet often exhibit discrimination against protected groups. In this talk, I discuss the growing concern of bias in ML and overview existing approaches to address fairness issues. Then, I present three novel approaches developed by my research group. The first leverages generative AI to eliminate biases in training datasets, the second tackles non-convex problems arise in fair learning, and the third introduces a matrix decomposition-based post-processing approach to identify and eliminate unfair model components.

Premium content

Next from AAAI 2025

Runtime Analysis for Multi-Objective Evolutionary Algorithms in Unbounded Integer Spaces

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES