Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Alignment of large language models (LLMs) with human preferences typically relies on supervised reward models or external judges, which in turn require abundant preference data. We propose a generative preference modeling approach for low-resource and domain-specific scenarios, reframing preference learning as an inverse reinforcement learning problem. Instead of training a discriminative reward model, we train the LLM itself to infer and maximize an implicit reward function underlying high-quality reasoning. Specifically, we leverage Chain-of-Thought (CoT) sampling to generate diverse candidate solutions for each query and refine fine-grained preferences from these without additional human labels. We also introduce an entropy-guided token scoring mechanism to rank and weight the sampled CoTs, boosting the importance of high-confidence answers and strategically high-entropy tokens. Building on this, we train the model with our Self-Evaluated Group Advantage (SEGA) algorithm. Compared with other methods, this algorithm efficiently utilizes the fine-grained preference information in group candidate solutions to update the strategy. Our method eliminates dependence on external judges or reward classifiers, instead relying on the generative model’s own judgments. Experiments on general benchmarks and domain-specific tasks—such as mathematical reasoning and medical question answering—demonstrate that our generative preference model achieves significant improvements with limited data.