United States

Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM), have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming self-supervised learning (SSL) encoder for real-time audio input. LSLM fuses both channels for autoregressive generation and detects turn-taking in real time. Three fusion strategies—early fusion, middle fusion, and late fusion—are explored, with middle fusion achieving an optimal balance between speech generation and real-time interaction. Two experimental settings, command-based FDM and voice-based FDM, demonstrate LSLM’s robustness to noise and sensitivity to diverse instructions. Our results highlight LSLM’s capability to achieve duplex communication with minimal impact on existing systems. This study aims to advance the development of interactive speech dialogue systems, enhancing their applicability in real-world contexts.

AAAI 2025

Language Model Can Listen While Speaking

speech signal processing

snlp

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Cross-domain few-shot learning (CDFSL) requires the model to transfer knowledge from the data-abundant source domain to data-scarce target domains for fast adaptation, where the large domain gap makes CDFSL a challenging problem.
Masked Autoencoder (MAE) excels in effectively using unlabeled data and learning image's global structures, enhancing model generalization and robustness. However, in the CDFSL task with significant domain shifts, we find MAE even shows lower performance than the baseline supervised models. 
In this paper, we first delve into this phenomenon for an interpretation.
We find that MAE tends to focus on low-level domain information during reconstructing pixels while changing the reconstruction target to token features could mitigate this problem. However, not all features are beneficial, as we then find reconstructing high-level features can hardly improve the model's transferability, indicating a trade-off between filtering domain information and preserving the image's global structure. 
In all, the reconstruction target matters for the CDFSL task.
Based on the above findings and interpretations, we further propose Domain-Agonostic Masked Image Modeling (DAMIM) for the CDFSL task. DAMIM includes an Aggregated Feature Reconstruction module to automatically aggregate features for reconstruction, with balanced learning of domain-agnostic information and images' global structure, and a Lightweight Decoder module to further benefit the encoder's generalizability.
Experiments on four CDFSL datasets demonstrate that our method achieves state-of-the-art performance.

Reconstruction Target Matters in Masked Image Modeling for Cross-Domain Few-Shot Learning

Label correction methods are popular for their simple architecture in learning with noisy labels. However, they suffer severely from false label correction and achieve subpar performance compared with state-of-the-art methods. In this paper, we revisit the label correction methods through theoretical analysis of gradient scaling and demonstrate that the sample-wise dynamic and class-wise uniformity of interpolation weight prevents memorization of the mislabeled samples. We then propose DULC, a simple yet effective label correction method 
that uses the normalized Jensen-Shannon divergence (JSD) metric as the interpolation weight to promote sample-wise dynamic and class-wise uniformity. Additionally, we provide theoretical evidence that sharpening predictions in label correction facilitates the memorization of true class, and we achieve it by employing the augmentation strategy along with the sharpening function. Extensive experiments on CIFAR-10, CIFAR-100, TinyImageNet, WebVision and Clothing1M datasets demonstrate substantial improvements over state-of-the-art methods. Our code is available in supplementary material.

Revisiting Interpolation for Noisy Label Correction

Generative retrieval constitutes an innovative approach in information retrieval, leveraging generative language models(LM) to generate a ranked list of document identifiers (docid) for a given query. It simplifies the retrieval pipeline by replacing the large external index with model parameters. However, existing works merely learned the relationship between queries and document identifiers, which is unable to directly represent the relevance between queries and documents. To address the above problem, we propose a novel and general generative retrieval framework, namely Leveraging Document-Oriented Contrastive Learning in Generative Retrieval (DOGR), which leverages contrastive learning to improve generative retrieval tasks. It adopts a two-stage learning strategy that captures the relationship between queries and documents comprehensively through direct interactions. Furthermore, negative sampling methods and corresponding contrastive learning objectives are implemented to enhance the learning of semantic representations, thereby promoting a thorough comprehension of the relationship between queries and documents. Experimental results demonstrate that DOGR achieves state-of-the-art performance compared to existing generative retrieval methods on two public benchmark datasets. Further experiments have shown that our framework is generally effective for common identifier construction techniques.

DOGR: Leveraging Document-Oriented Contrastive Learning in Generative Retrieval

Incomplete multi-view clustering is a challenging task that
perform partition on multi-view data with missing views.
Previous methods directly recover missing instances or fill
in incomplete similarity matrices. However, most of them
suffer from the following limitations: (1) multi-view fusion
and completion are conducted before spectral analysis, resulting
in sub-optimal clustering performance. (2) The complex
optimization process requires heavy computational burden,
limiting their scalability to large-scale problems. In this
paper, we propose a highly efficient rotation-invariant spectral
embedding (RISE) method for scalable incomplete multiview
clustering. Specifically, RISE learns incomplete viewspecific
embeddings from bipartite graphs with incomplete
structural information. Meanwhile, the complete consensus
representation with second-order rotation-invariant property
is recovered from the incomplete view-specific embeddings.
Moreover, we design a fast alternating optimization algorithm
with linear complexity and promising convergence to solve
the proposed formulation. Extensive experiments on multiple
datasets demonstrate the effectiveness, scalability and efficiency
of RISE compared to the state-of-the-art methods.

Highly Efficient Rotation-Invariant Spectral Embedding for Scalable Incomplete Multi-View Clustering

Point-based interactive colorization techniques allow users to effortlessly colorize grayscale images using user-provided color hints. However, point-based methods often face challenges when different colors are given to semantically similar areas, leading to color intermingling and unsatisfactory results—an issue we refer to as color collapse. The fundamental cause of the color collapse is the inadequacy of points for defining the boundaries for each color. To mitigate color collapse, we introduce a lasso tool that can control the scope of each color hint. Additionally, we design a framework that leverages the user-provided lassos to localize the attention masks. The experimental results show that using a single lasso is as effective as applying 4.18 individual color hints and can achieve the desired outcomes in 30% less time than using points alone.

Enabling Region-Specific Control via Lassos in Point-Based Colorization

We propose a transformation diffusion model for point cloud registration to balance precision and efficiency. Our method formulates point cloud registration as a denoising diffusion process from noisy transformation to object transformation, which is represented by quaternion and translation. Specifically, in training stage, object transformation diffuses from ground-truth transformation to random distribution, and the model learns to reverse this noising process. In sampling stage, the model refines randomly generated transformation to the optimal transformation in a progressive way. We derive the variational bound in closed form for training and provide instantiation of the model. Our diffusion model maps transformation into latent space, and splits the transformation into two components (rotation and translation) based on the fact that they belong to different solution spaces. In addition, our work provides the following crucial findings: (i) Point cloud registration, one of the representative discriminative tasks, can be solved by a generative way and mapped into latent space to obtain new unified probabilistic formulation. (ii) Our model, Transformation Diffusion Model (TDM) can be a plug-and-play agent for point cloud registration, making our method applicable to different deep registration networks. Experimental results on synthetic and real-world datasets demonstrate that, in correspondence-free and correspondence-based scenarios, TDM can both achieve exceeding 60% performance improvements and higher efficiency simultaneously.

Where Precision Meets Efficiency: Transformation Diffusion Model for Point Cloud Registration

Personalized text-to-image generation methods can generate customized images based on the reference images, which have garnered wide research interest. Recent methods propose a finetuning-free approach with a decoupled cross-attention mechanism to generate personalized images requiring no test-time finetuning. However, when multiple reference images are provided, the current decoupled cross-attention mechanism encounters the object confusion problem and fails to map each reference image to its corresponding object, thereby seriously limiting its scope of application. To address the object confusion problem, in this work we investigate the relevance of different positions of the latent image features to the target object in diffusion model, and accordingly propose a weighted-merge method to merge multiple reference image features into the corresponding objects. Next, we integrate this weighted-merge method into existing pre-trained models and continue to train the model on a multi-object dataset constructed from the open-sourced SA-1B dataset. To mitigate object confusion and reduce training costs, we propose an object quality score to estimate the image quality for the selection of high-quality training samples. Furthermore, our weighted-merge training framework can be employed on single-object generation when a single object has multiple reference images. The experiments verify that our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation, and remarkably improves the performance on single-object personalized image generation. Our code will be made available soon.

Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

Trajectory generation has garnered significant attention from researchers in the field of spatio-temporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin and destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. The excellent performance demonstrated with minimal training data further verifies the effectiveness of our holistic semantic representation. Our code will be made publicly available.

Holistic Semantic Representation for Navigational Trajectory Generation

Large language models (LLMs) have achieved significant progress in mathematical reasoning, especially in elementary math. However, they remain indisposed on tackling complex questions at high-school or college levels, which put forward a more advanced requirement of mastering relevant mathematical theorems. For we humans, whether selecting the appropriate theorems according to the provided question is a crucial factor affecting the quality of the ultimate solutions, yet which has been neglected by previous research in the field of LLM reasoning. In this paper, we propose a novel approach to enhance the LLM's capability of utilizing the mathematical theorems to specific problems, which we refer to as Theorem Rationale (TR). To this end, a new dataset encompassing problem-theorem-solution triples is deliberately established for transferring principles of TR. Furthermore, we develop an evolving strategy to boost hierarchical instructions oriented on the theorems to alleviate difficulty in acquiring the curated data and facilitate the digestion of theorem application from various perspectives. Evaluations on a wide range of public datasets exhibit that the model fine-tuned with our dataset achieves consistent improvements at varying mathematical levels compared to the backbone. And further ablation studies illustrate the effectiveness of our proposed evolutionary strategies on enhancing the model's capability of math problem-solving. Overall, extensive experiments reveal the potential of our proposed method which highlights the significance of aligning the problems with the concrete theorems for LLMs to alleviate hallucination and improve the models' mathematical reasoning capabilities.

Learning Theorem Rationale for Improving the Mathematical Reasoning Capability of Large Language Models

Lyric-to-melody generation aims to automatically create melodies based on given lyrics, requiring the capture of complex and subtle correlations between them. However, previous works usually suffer from two main challenges: 1) lyric-melody alignment modeling, which is often simplified to one-syllable/word-to-one-note alignment, while others have the problem of low alignment accuracy; 2) lyric-melody harmony modeling, which usually relies heavily on intermediates or strict rules, limiting model's capabilities and generative diversity. In this paper, we propose SongGLM, a lyric-to-melody generation system that leverages 2D alignment encoding and multi-task pre-training based on the General Language Model (GLM) to guarantee the alignment and harmony between lyrics and melodies. Specifically, 1) we introduce a unified symbolic song representation for lyrics and melodies with word-level and phrase-level (2D) alignment encoding to capture the lyric-melody alignment; 2) we design a multi-task pre-training framework with hierarchical blank infilling objectives (n-gram, phrase, and long span), and incorporate lyric-melody relationships into the extraction of harmonized n-grams to ensure the lyric-melody harmony. We also construct a large-scale lyric-melody paired dataset comprising over 200,000 English song pieces for pre-training and fine-tuning. The objective and subjective results indicate that SongGLM can generate melodies from lyrics with significant improvements in both alignment and harmony, outperforming all the previous baseline methods.

Premium content

Next from AAAI 2025

Reconstruction Target Matters in Masked Image Modeling for Cross-Domain Few-Shot Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES