Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment.
In the paper, we devise a \textbf{\textit{filter-correlate-compress}}'' framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements \textbf{\textit{FiCoCo-V}}, a training-free method operating within the vision encoder.
It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately \textit{filter} out redundant visual tokens.
To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from \textit{correlate}d discarded tokens with a self-preserving \textit{compress}ion, thereby preventing the dilution of their own core content. The framework's \textbf{\textit{FiCoCo-L}} variant further leverages task-aware textual priors to perform token reduction directly within the LLM decoder. Extensive experiments demonstrate that the \textit{FiCoCo} series effectively accelerates a range of MLLMs, achieves up to \textbf{14.7×} FLOPs reduction with \textbf{93.6\%} performance retention. Our methods consistently outperform state-of-the-art training-free approaches, showcasing effectiveness and generalizability across model architectures, sizes, and tasks without requiring retraining. \textit{Code is available in supplementary materials.}
