AAAI 2026

January 22, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Recent advances in transformer-based text-to-motion generation have significantly improved motion quality. However, achieving both real-time performance and long-horizon scalability remains an open challenge. In this paper, we present MOGO (Motion Generation with One-pass), a novel autoregressive framework for efficient and scalable 3D human motion generation. MOGO consists of two key components. First, we introduce MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences through learnable scaling parameters, which dynamically regulate the information flow at each layer to produce compact yet expressive multi-level representations. Second, to fully utilize the high-quality motion representations, we further design the RQHC-Transformer, a residual quantized hierarchical causal transformer that structurally aligns with the multi-level latent hierarchy produced by MoSA-VQ. Each level is decoded by a dedicated transformer block, enabling efficient multi-scale generation in a single forward pass. Compared to diffusion-based and LLM-based approaches, it achieves lower inference latency while maintaining high motion quality. Notably, our hierarchical latent modeling—through the synergy of MoSA-VQ and RQHC-Transformer—empowers MOGO with seamless and coherent infinite-length generation. By iteratively extending motion from any given frame and allowing control signals to be updated at arbitrary points, the model produces stable transitions and responds adaptively to new conditions, enabling real-time, controllable long-horizon synthesis with strong temporal consistency. Extensive experiments on HumanML3D and KIT-ML validate the quality and efficiency of our approach, while evaluation on the unseen CMP dataset demonstrates strong zero-shot generalization capabilities.

Downloads

Paper

Next from AAAI 2026

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
poster

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

AAAI 2026

+4Jinqiao WangMing Tang
Ming Tang and 6 other authors

22 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved