Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Recent end-to-end robotic manipulation research increasingly adopts architectures inspired by large language models (LLMs) to enable robust manipulation. However, a critical challenge arises from severe distribution shifts in robotic action data, primarily due to substantial numerical variations in action commands across diverse robotic platforms and tasks, hindering the effective transfer of pretrained knowledge. To address this limitation, we utilize the language-modal action representation motion'' as the target for pretraining. Unlike conventional discretized action representations that are sensitive to numerical scales, the motion representation specifically disregards numeric scale effects, emphasizing directionality instead. This abstraction effectively reduces the impact of distribution shifts, providing a more generalizable representation suitable for pretraining. Moreover, using the motion representation narrows the feature distance between action tokens and standard vocabulary tokens, mitigating modality gaps. Multi-task experiments on two benchmarks demonstrate that this method significantly improves generalization performance and transferability in robotic manipulation tasks.
