Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks through the \textbf{example-driven learning paradigm}. However, in high-stakes domains such as emergency response or industrial safety, real incidents are scarce, confidential, or both, while concise \emph{rule books} are plentiful. We formalize this underexplored setting as \textbf{rule knowledge–driven reasoning} and ask: \emph{Can an LLM reason reliably when rules are abundant but examples are almost nil?} To answer this question we introduce \textbf{RULER}, a fully automatic benchmark that derives 32K rigorously verified questions from 1K expert-curated emergency response rule knowledge to probe three core abilities—\emph{rule memorization}, \emph{single-rule application}, and \emph{multi-rule complex reasoning}, supported by a hallucination-aware evaluation suite and novel relational metrics. A comprehensive empirical study of five open-source LLMs and five enhancement strategies shows that, after reliable performance on rule memorization and single-rule application, multi-rule complex reasoning plateaus at 5.4 on a 10-point scale. We bridge this gap with \textbf{RAMPS}—a \textbf{R}ule-knowledge-\textbf{A}ware \textbf{M}onte-Carlo-tree-search \textbf{P}rocess-reward \textbf{S}upervision framework. RAMPS injects rule knowledge priors into MCTS, distills 12K step-level traces without human annotation, and trains an advantage-based reward model that scores candidate reasoning paths during the beam search inference. Experimental results demonstrate a notable improvement in complex reasoning, increasing to 7.7 (+2.3). Together, RULER and RAMPS provide an automatic benchmark and a strong baseline suite for rule knowledge-driven reasoning in LLMs.
