BRPO: Batch Residual Policy Optimization

Authors: Sungryull Sohn, Yinlam Chow, Jayden Ooi, Ofir Nachum, Honglak Lee, Ed Chi, Craig Boutilier

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Experimental Results To illustrate the effectiveness of BRPO, we compare against six baselines: DQN [Mnih et al., 2013], discrete BCQ [Fujimoto et al., 2019], KL-regularized Q-learning (KL-Q) [Jaques et al., 2019], SPIBB [Laroche and Trichelair, 2017], Behavior Cloning (BC) [Kober and Peters, 2010], and BRPO-C, which is a simplified version of BRPO that uses a constant (tunable) parameter as confidence weight3. We do not consider ensemble models, thus do not include methods like BEAR [Kumar et al., 2019] among our baselines. CPI is also excluded since it is subsumed by BRPO-C with a grid search on the confidence. It is also generally inferior to BRPO-C because candidate policy learning does not optimize the performance of the final mixture policy. We evaluated on three discrete-action Open AI Gym tasks [Brockman et al., 2016]: Cartpole-v1, Lunarlander-v2, and Acrobot-v1. The behavior policy in each environment is trained using standard DQN until it reaches 75% of optimal performance, similar to the process adopted in related work (e.g., [Fujimoto et al., 2018]). To assess how exploration and the quality of behavior policy affect learning, we generate five sets of data for each task by injecting different random exploration into the same behavior policy. Specifically, we add ε-greedy exploration for ε = 1 (fully random), 0.5, 0.25, 0.15, and 0.05, generating 100K transitions each for batch RL training. All models use the same architecture for a given environment details (architectures, hyper-parameters, etc.) are described in the appendix of the extended paper. While training is entirely offline, policy performance is evaluated online using the simulator, at every 1000 training iterations. Each measurement is the average return w.r.t. 40 evaluation episodes and 5 random seeds, and results are averaged over a sliding window of size 10. Table 1 shows the average return of BRPO and the other baselines under the best hyper-parameter configurations in each task setting. Behavior policy performance decreases as ε increases, as expected, and BC matches that very closely. DQN performs poorly in the batch setting. Its performance improves as ε increases from 0.05 to 0.25, due to increased state-action coverage, but as ε goes higher (0.5, 1.0), the state space coverage decreases again since the (near-) random policy is less likely to reach a state far away from the initial state. BCQ, KL-Q and SPIBB follow the behavior policy in some ways, and showing different performance characteristics over the data sets. The underperformance relative to BRPO is more prominent for very low or very high ε, suggesting deficiency due to overly conservative updates or following the behavior policy too closely, when BRPO is able to learn. Since BRPO exploits the statistics of each (s, a) pair in the batch data, it achieves good performance in almost all scenarios, outperforming the baselines. The stable performance and robustness across various scenarios make BRPO an appealing algorithm for batch/offline RL in real-world, where it is usually difficult to estimate the amount of exploration required prior to training, given access only to batch data.
Researcher Affiliation Collaboration Sungryull Sohn ,1,2 , Yinlam Chow ,1 , Jayden Ooi ,1 , Ofir Nachum1 , Honglak Lee1,2 , Ed Chi1 and Craig Boutilier1 1Google Research 2University of Michigan srsohn@umich.edu, {yinlamchow, jayden, ofirnachum, honglak, edchi, cboutilier}@google.com
Pseudocode Yes Algorithm 1 BRPO algorithm
Open Source Code No The paper does not provide any explicit statements about making the source code available or links to a code repository.
Open Datasets Yes We evaluated on three discrete-action Open AI Gym tasks [Brockman et al., 2016]: Cartpole-v1, Lunarlander-v2, and Acrobot-v1.
Dataset Splits No The paper describes generating 'five sets of data for each task' and using them for 'batch RL training', and evaluates policy performance online. However, it does not specify explicit 'training/validation/test dataset splits' or mention a 'validation set'.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software like 'Open AI Gym', 'DQN', 'BCQ', etc., but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes To assess how exploration and the quality of behavior policy affect learning, we generate five sets of data for each task by injecting different random exploration into the same behavior policy. Specifically, we add ε-greedy exploration for ε = 1 (fully random), 0.5, 0.25, 0.15, and 0.05, generating 100K transitions each for batch RL training. All models use the same architecture for a given environment details (architectures, hyper-parameters, etc.) are described in the appendix of the extended paper. While training is entirely offline, policy performance is evaluated online using the simulator, at every 1000 training iterations. Each measurement is the average return w.r.t. 40 evaluation episodes and 5 random seeds, and results are averaged over a sliding window of size 10.