
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Given the inherent subjectivity of user satisfaction in dialogue systems, minority users may assign different satisfaction ratings than majority users for system responses due to varying intents and preferences. Existing methods for aligning language models with human preferences primarily focus on training universal systems that minimize controversy while overlooking the need for user-specific adaptation. We propose a unified framework that models both individual-specific and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, to learn group preferences, we propose an EM-based Majority-Minority Preference-Aware Clustering (M²PC) algorithm that discovers distinct user groups without supervision. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset show consistent improvement in user satisfaction estimation.
