Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Vision Transformers (ViT) and their variants have achieved remarkable success in various complex tasks, but these accomplishments come with high computational costs and significant inference latency. Token pruning, as an effective technique, reduces computational burden by removing redundant or unimportant tokens, thereby lowering model resource consumption and inference time. Although existing retraining-free token pruning algorithms perform well in accelerating inference, their pruning strategies are often limited to locally optimal mask configurations. They fail to fully explore the interdependencies among intra-layer mask variables from a global perspective, which in turn constrains the overall performance improvement of the model. To address these limitations, we propose V-Pruner (A Fast and Globally-informed Token Pruning Framework for Vision Transformer). This framework delivers a fast, efficient, and streamlined end-to-end pruning workflow that operates without user intervention. This algorithm consists of three stages: Token Mask Search, Token Mask Rearrangement, and Token Mask Tuning. In the Token Mask Search stage, we utilize Fisher information to identify key and redundant tokens; In the Token Mask Rearrangement stage, we introduce Reinforcement learning algorithm to deeply explore the global interactions among intra-layer mask variables, thereby overcoming the limitation of traditional methods that focus only on local information and enhancing the overall pruning performance; Finally, in the Token Mask Tuning stage, we precisely adjust the mask variables to restore the accuracy of the pruned model, aiming to compensate for any potential accuracy loss during the pruning process. We evaluated this approach on ViT-L, DeiT-B, DeiT-S, and DeiT-T models, and experimental results show that compared to existing pruning methods, V-Pruner exhibits superior performance in balancing accuracy, speed, and FLOPs, providing a significant competitive advantage.