Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Vision-Language-Action (VLA) models often struggle with generalization to real-world scenarios due to the mismatch between observation and action spaces. While training data comes from diverse camera perspectives, the models predict end-effector poses in the robot base coordinate system, leading to inconsistencies. To address this issue, we propose an Observation-Centric VLA (OC-VLA) framework, which directly grounds action predictions in the camera's observation space. By using the camera's extrinsic matrix to transform end-effector poses from the robot frame to the camera frame, our approach unifies prediction targets across different viewpoints. This simple, plug-and-play method ensures consistent alignment between perception and action, improving model robustness to camera viewpoint variations. Our method offers a straightforward solution that can be easily integrated into existing VLA models without significant architectural changes. Extensive experiments on both simulated and real-world robotic manipulation tasks demonstrate that OC-VLA achieves better convergence, improves task success rates, and enhances generalization across camera viewpoints. The code will be publicly available.
