Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Pre-trained Vision-Language Models (VLMs), e.g. CLIP, have become essential tools in multimodal transfer learn- ing. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, cur- rent researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches un- derexplored and revealing notable performance gaps. To ad- dress these challenges, we introduce a novel Reconstruction- based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter- efficient fine-tuning, and (2) a reconstruction branch that pre- serves general knowledge by reconstructing latent space fea- tures back into the original feature space. This design facil- itates a dynamic balance between general and task-specific knowledge. Importantly, although RMAdapter introduces an additional reconstruction branch, it is carefully optimized to remain lightweight. By computing reconstruction loss lo- cally at each layer and sharing projection modules, the over- all computational overhead is kept minimal. A consistency constraint is also incorporated to better regulate the trade- off between discriminability and generalization. We compre- hensively evaluate the effectiveness of RMAdapter on three representative tasks: generalization to new categories, gen- eralization to new target datasets, and domain generalization. Without relying on data augmentation or duplicate prompt de- signs, our RMAdapter consistently outperforms state-of-the- art approaches across all evaluation metrics.
