Distributed Offline Policy Optimization Over Batch Data
Abstract
Federated learning (FL) has received increasing interests during the past years, However, most of the existing works focus on supervised learning, and federated learning for sequential decision making has not been fully explored. Part of the reason is that learning a policy for sequential decision making typically requires repeated interaction with the environments, which is costly in many FL applications. To overcome this issue, this work proposes a federated offline policy optimization method abbreviated as FedOPO that allows clients to jointly learn the optimal policy without interacting with environments during training. Albeit the nonconcave-convexstrongly concave nature of the resultant max-min-max problem, we establish both the local and global convergence of our FedOPO algorithm. Experiments on the OpenAI gym demonstrate that our algorithm is able to find a near-optimal policy while enjoying various merits brought by FL, including training speedup and improved asymptotic performance.