Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Generating realistic and coordinated 3D human motion for multiple individuals within complex environments remains a significant challenge. Existing text-to-motion methods are often blind'' to the physical scene, leading to implausible motions, while scene-conditioned (HSI) approaches demand cumbersome full 3D data and largely neglect multi-person dynamics. To address these limitations, we introduce the VL2Motion paradigm and its embodiment, MMG-VL, a hierarchical framework that generates coordinated multi-person motions from the most accessible inputs: a single 2D image and natural language. MMG-VL first employs a Scene-Aware Intent Planner (SAIP) to interpret the visual context and decompose the user's command into a set of spatially-grounded, multi-person action blueprints. Subsequently, a Coordinated Motion Synthesizer (CMS) translates these blueprints into high-fidelity 3D motion sequences. The synergy between these stages is driven by two novel loss functions: a Spatial-Semantic Grounding Loss to ensure the planner's output is grounded in visual reality, and a Coordinated Environmental Realism Loss that enforces physical constraints and coherent group dynamics during synthesis. To facilitate this research, we introduce HumanVL, the first large-scale dataset featuring multi-person activities in multi-room scenes, providing aligned images, text, blueprints, 3D motions, and scene geometry. Extensive experiments demonstrate that MMG-VL significantly outperforms existing methods in generating spatially coherent, physically realistic, and coordinated multi-person motions, paving the way for more scalable and intuitive creation of dynamic virtual worlds.