Content not yet available
This lecture has no active video or poster.
Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
We study the problem of learning a policy network to optimize several related objectives simultaneously in reinforcement learning (RL). Given a total of $n$ objectives, we consider finding a small set of $k$ policies that is much less than $n$, and that apply to all the objectives. This problem has broad applications in robotic control and language models. Learning one policy for all the objectives does not scale when the number of objectives becomes very large. Instead, this work introduces a two-stage, meta-training and adaptation procedure to tackle this problem. Our procedure works by first training a meta policy based on all the objectives. Then, we adapt this meta policy quickly to multiple subsets of randomly chosen objectives. This adaptation is enabled by a gradient-based approximation property of actor-critic agents, which we have empirically verified to be within a 2% error in a range of RL environments. This overall procedure, namely PolicyGradEx, can quickly estimate a task affinity score between every pair of objectives based on the estimated scores for each subset of objectives. Then, based on the estimated affinity scores, we apply a grouping procedure to cluster similar objectives into $k$ groups. Extensive experiments on three classic control benchmarks and the Meta-World benchmark demonstrate that our method outperforms state-of-the-art baselines by 16%, while being up to $26\times$ faster than full training. Ablation studies validate the design of each component of our method. For example, compared to random grouping and gradient-similarity-based grouping, our method outperforms both by 19%.