Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Vision-Language-Action (VLA) models enable robotic systems to perform embodied tasks but face deployment challenges due to the high computational demands of the dense Large Language Models (LLMs), with existing early-exit-based sparsification methods often overlooking the critical semantic role of final layers in downstream tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-LayEr Vision Language Action model (MoLe-VLA or simply MoLe) architecture for dynamic LLM layer activation. Specifically, we introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot’s current state, mimicking the brain's distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognition ability of LLM lost during the layer-skipping, we devise a Cognitive self-Knowledge Distillation (CogKD) to enhance the understanding of task demands and generate task-relevant action sequences by leveraging cognition features. Extensive experiments in RLBench simulations and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance, improving the mean success rate by 9.7\% across ten simulation tasks while accelerating inference by 36.8\% over OpenVLA.