Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Despite being successful in board games and reinforcement learning (RL), Monte-Carlo Tree Search (MCTS) combined with Multi-Armed Bandit (MAB) has seen limited success in domain-independent classical planning until recently. Previous work Wissow and Asai, 2024 showed that UCB1, designed for bounded rewards, does not perform well as applied to cost-to-go estimates in classical planning, because cost-to-go estimates are unbounded, and showed improved performance using a Gaussian reward MAB instead. This paper further sharpens our understanding of ideal bandits for planning tasks. Existing work has two issues: first, Gaussian MABs under-specify the support of cost-to-go estimates as $(-\infty,\infty)$, which we can narrow down. Second, Full-Bellman backup Schulte and Keller, 2014 that backpropagates sample max/min lacks theoretical justification. We use \emph{Peaks-Over-Threashold Extreme Value Theory} to resolve both issues at once, propose a new bandit algorithm (UCB1-Uniform). We formally prove its regret bound and empirically demonstrate its performance in classical planning.
