Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning models with human preferences. However, existing DPO-based methods suffer from three key drawbacks: they rely on only a single positive-negative preference pair per question, restricting the diversity and richness of feedback; they often emphasize minimizing negative preference scores while neglecting to strengthen the positive preferences; and they depend on either human-annotated preferences or expert model outputs - both xpensive and difficult to scale. Moreover, the deterministic ranking assumptions of recent Group-based preference optimization methods break down in open-ended tasks such as Visual Question Answering (VQA), where multiple answers can be equally plausible but differ subtly in relevance or specificity. Given this subtle variance in preferences, we propose to perform ranking over groups of preferences rather than relying on fine-grained ranking of individual ones, which is often noisy and subjective. To address these challenges, we introduce Self-Supervised Visual Preference Alignment via Differentiable Multi-Preference Multi-Group Ranking (SMPRO), a novel framework that (1) self-generates rich, diverse preference groups while eliminating the need for external annotations, (2) employs a fully differentiable ranking objective based on sorting networks to capture nuanced preference gradients across arbitrary numbers of preferences both within and across these groups, and (3) incorporates multiple positive preferences to enrich the positive preference group, capturing subtle distinctions among high-quality preferences. Extensive experiments across diverse visual tasks demonstrate that our approach achieves state-of-the-art performance in self-supervised setting. Specifically, our model surpasses existing baselines, achieving notable improvements such as 82.4% on MMBench, 63.2% on MM-Star, 94.6% on LLaVA-W, and 81.9% on AI2D. These results underscore the effectiveness of our approach in capturing richer preference signals and demonstrate its scalability for open-ended, ambiguous VQA tasks.
