Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
3D human motion generation has seen a substantial rise in interest over the recent years, and while considerable progress has been made performance wise, many of the approaches in the state-of-the-art still struggle with complex and detailed generations unseen in the original data. This is commonly attributed to the scarcity of available motion datasets, and the prohibitive cost for generating more training examples. Motivated by this set of challenges, we introduce CoMA, A multimodal framework designed for complex human motion generation, editing and comprehension. CoMA employs multiple independent agents, powered by large language and vision models, as well as a mask transformer-based motion generator with body part specific encoders and codebooks for fine-grained, detailed generations. This recipe allows for generation of short and long motion sequences with detailed instructions, editing generations with user provided text instructions and also self-correcting output sequences for even better motions. We evaluate our method with the two most popular benchmark human motion datasets, using novel splits that separate them into basic and complex actions, and subsequently compare CoMA's performance with state-of-the-art methods.