
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Music-to-dance generation aims to synthesize human dance motion conditioned on music input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, but lacks explicit physical movement descriptions. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the model’s ability to learn diverse dance patterns. These limitations highlight the need for additional semantic guidance beyond the musical signal. In this paper, we propose DanceChat, a novel framework that leverages a Large Language Model (LLM) as a choreographer to generate high-level textual instructions from structured music descriptions. These instructions serve as semantic guidance to bridge the gap between music and motion. DanceChat integrates music, beat, and text features into a unified representation, and employs a diffusion-based motion generator trained with a proposed multi-modal alignment loss. Extensive experiments on AIST++ dataset show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.
