Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Multimodal Large Language Models (MLLMs) employing the Mixture-of-Experts (MoE) structure exhibit encouraging results in visual language tasks. However, they struggle with catastrophic forgetting due to a lack of effective collaboration among experts and negative transfer across tasks. This happens because the router typically employed in MoE for managing expert assignments is inadequate when there are significant shifts in data distribution across various tasks. A drop in the effectiveness of earlier tasks is caused by negative transfer, which occurs due to conflicts in shared knowledge between tasks, disturbing the knowledge already acquired. To address these issues, we propose the Knowledge Space Synergy framework in Mixture of Experts (KSS-MoE) for Continual Visual Instruction Tuning (CVIT). It dynamically combines the knowledge subspaces of experts to improve the integration of fine-grained complementary knowledge and collaborative abilities of experts, thus addressing the limitations of the basic router. Furthermore, we introduce a general expert that maintains orthogonal subspaces for shared knowledge, enabling effective cross-task knowledge utilization while reducing negative transfer. Extensive experiments conducted on eight CVIT tasks confirm the excellence of KSS-MoE, showcasing its top-tier performance. Our code is available in the appendix.
