reproducibilityindex.ai

BRPO: Batch Residual Policy Optimization

Authors: Sungryull Sohn, Yinlam Chow, Jayden Ooi, Ofir Nachum, Honglak Lee, Ed Chi, Craig Boutilier

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 Experimental Results To illustrate the effectiveness of BRPO, we compare against six baselines: DQN [Mnih et al., 2013], discrete BCQ [Fujimoto et al., 2019], KL-regularized Q-learning (KL-Q) [Jaques et al., 2019], SPIBB [Laroche and Trichelair, 2017], Behavior Cloning (BC) [Kober and Peters, 2010], and BRPO-C, which is a simpliﬁed version of BRPO that uses a constant (tunable) parameter as conﬁdence weight3. We do not consider ensemble models, thus do not include methods like BEAR [Kumar et al., 2019] among our baselines. CPI is also excluded since it is subsumed by BRPO-C with a grid search on the conﬁdence. It is also generally inferior to BRPO-C because candidate policy learning does not optimize the performance of the ﬁnal mixture policy. We evaluated on three discrete-action Open AI Gym tasks [Brockman et al., 2016]: Cartpole-v1, Lunarlander-v2, and Acrobot-v1. The behavior policy in each environment is trained using standard DQN until it reaches 75% of optimal performance, similar to the process adopted in related work (e.g., [Fujimoto et al., 2018]). To assess how exploration and the quality of behavior policy affect learning, we generate ﬁve sets of data for each task by injecting different random exploration into the same behavior policy. Speciﬁcally, we add ε-greedy exploration for ε = 1 (fully random), 0.5, 0.25, 0.15, and 0.05, generating 100K transitions each for batch RL training. All models use the same architecture for a given environment details (architectures, hyper-parameters, etc.) are described in the appendix of the extended paper. While training is entirely ofﬂine, policy performance is evaluated online using the simulator, at every 1000 training iterations. Each measurement is the average return w.r.t. 40 evaluation episodes and 5 random seeds, and results are averaged over a sliding window of size 10. Table 1 shows the average return of BRPO and the other baselines under the best hyper-parameter conﬁgurations in each task setting. Behavior policy performance decreases as ε increases, as expected, and BC matches that very closely. DQN performs poorly in the batch setting. Its performance improves as ε increases from 0.05 to 0.25, due to increased state-action coverage, but as ε goes higher (0.5, 1.0), the state space coverage decreases again since the (near-) random policy is less likely to reach a state far away from the initial state. BCQ, KL-Q and SPIBB follow the behavior policy in some ways, and showing different performance characteristics over the data sets. The underperformance relative to BRPO is more prominent for very low or very high ε, suggesting deﬁciency due to overly conservative updates or following the behavior policy too closely, when BRPO is able to learn. Since BRPO exploits the statistics of each (s, a) pair in the batch data, it achieves good performance in almost all scenarios, outperforming the baselines. The stable performance and robustness across various scenarios make BRPO an appealing algorithm for batch/ofﬂine RL in real-world, where it is usually difﬁcult to estimate the amount of exploration required prior to training, given access only to batch data.
Researcher Affiliation	Collaboration	Sungryull Sohn ,1,2 , Yinlam Chow ,1 , Jayden Ooi ,1 , Oﬁr Nachum1 , Honglak Lee1,2 , Ed Chi1 and Craig Boutilier1 1Google Research 2University of Michigan srsohn@umich.edu, {yinlamchow, jayden, oﬁrnachum, honglak, edchi, cboutilier}@google.com
Pseudocode	Yes	Algorithm 1 BRPO algorithm
Open Source Code	No	The paper does not provide any explicit statements about making the source code available or links to a code repository.
Open Datasets	Yes	We evaluated on three discrete-action Open AI Gym tasks [Brockman et al., 2016]: Cartpole-v1, Lunarlander-v2, and Acrobot-v1.
Dataset Splits	No	The paper describes generating 'ﬁve sets of data for each task' and using them for 'batch RL training', and evaluates policy performance online. However, it does not specify explicit 'training/validation/test dataset splits' or mention a 'validation set'.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions software like 'Open AI Gym', 'DQN', 'BCQ', etc., but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	To assess how exploration and the quality of behavior policy affect learning, we generate ﬁve sets of data for each task by injecting different random exploration into the same behavior policy. Speciﬁcally, we add ε-greedy exploration for ε = 1 (fully random), 0.5, 0.25, 0.15, and 0.05, generating 100K transitions each for batch RL training. All models use the same architecture for a given environment details (architectures, hyper-parameters, etc.) are described in the appendix of the extended paper. While training is entirely ofﬂine, policy performance is evaluated online using the simulator, at every 1000 training iterations. Each measurement is the average return w.r.t. 40 evaluation episodes and 5 random seeds, and results are averaged over a sliding window of size 10.