Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
To date, distributional reinforcement learning (distributional RL) methods have exclusively focused on the discounted setting, where an agent aims to optimize a potentially-discounted sum of rewards over time. In this work, we extend distributional RL to the average-reward setting, where an agent aims to optimize the reward received per time-step. In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution, as well as the differential return distribution of an average-reward MDP. We derive proven-convergent tabular algorithms for both prediction and control, as well as a broader family of algorithms that have appealing scaling properties. Empirically, we find that these algorithms yield competitive and sometimes superior performance when compared to their non-distributional equivalents, while also capturing rich information about the long-run per-step reward and differential return distributions.