EMNLP 2025

November 07, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Large Language Models (LLMs) require alignment via reinforcement learning (RL) to effectively perform task-specific objectives, such as human preference alignment and enhanced reasoning. While Proximal Policy Optimization (PPO) is widely adopted, its computational overhead, stemming from additional value model requirements, limits applicability. Existing alternatives, like Group Relative Policy Optimization (GRPO), mitigate computational costs but remain sensitive to reward model quality. To address this, we introduce Group Preference Reward Shaping (GPRS), a novel method that leverages preference-based comparisons rather than precise numerical rewards. GPRS requires no extra model components and remains robust across varying reward model sizes and qualities. Extensive experiments demonstrate that GPRS consistently outperforms existing critic-model-free RL algorithms in Reinforcement Learning from Human Feedback (RLHF) and reasoning tasks, providing stable and good alignment performance.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning
poster

Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning

EMNLP 2025

Anda Cheng and 2 other authors

07 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved