Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Video-Language Models (LVLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs in terms of safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Specifically, given incomplete video-text pairs, we first design a multi-modal feature approximation module to construct relational multi-modal graphs based on available cross-modal high semantic similarity features, which can approximate more reliable completion features for the missing modalities. Then, we propose a multi-modal knowledge distillation module to reduce over-reliance on the complete modality. Finally, we propose a multi-granularity multi-modal integration module to integrate semantics-similar video-text pairs by mapping them more compact in the common feature space. Extensive experimental results on several incomplete dataset demonstrate that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.
