Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Recent advances in transformer-based text-to-motion generation have significantly improved motion quality. However, achieving both real-time performance and long-horizon scalability remains an open challenge. In this paper, we present MOGO (Motion Generation with One-pass), a novel autoregressive framework for efficient and scalable 3D human motion generation. MOGO consists of two key components. First, we introduce MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences through learnable scaling parameters, which dynamically regulate the information flow at each layer to produce compact yet expressive multi-level representations. Second, to fully utilize the high-quality motion representations, we further design the RQHC-Transformer, a residual quantized hierarchical causal transformer that structurally aligns with the multi-level latent hierarchy produced by MoSA-VQ. Each level is decoded by a dedicated transformer block, enabling efficient multi-scale generation in a single forward pass. Compared to diffusion-based and LLM-based approaches, it achieves lower inference latency while maintaining high motion quality. Notably, our hierarchical latent modeling—through the synergy of MoSA-VQ and RQHC-Transformer—empowers MOGO with seamless and coherent infinite-length generation. By iteratively extending motion from any given frame and allowing control signals to be updated at arbitrary points, the model produces stable transitions and responds adaptively to new conditions, enabling real-time, controllable long-horizon synthesis with strong temporal consistency. Extensive experiments on HumanML3D and KIT-ML validate the quality and efficiency of our approach, while evaluation on the unseen CMP dataset demonstrates strong zero-shot generalization capabilities.
