reproducibilityindex.ai

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Authors: Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, Shixiang Gu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate BREMEN on standard offline RL benchmarks of high-dimensional continuous control tasks, where only a single static dataset is used. In this fixed-batch setting, our experiments show that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain. Enabled by such stable and sample-efficient offline learning, we show that BREMEN can learn successful policies with only 5-10 deployments in the online setting, significantly outperforming existing off-policy and offline RL algorithms in deployment efficiency while keeping sample efficiency.
Researcher Affiliation	Collaboration	Tatsuya Matsushima Hiroki Furuta Yutaka Matsuo The University of Tokyo {matsushima, furuta, matsuo}@weblab.t.u-tokyo.ac.jp Ofir Nachum Shixiang Shane Gu Google Research {ofirnachum, shanegu}@google.com
Pseudocode	Yes	Algorithm 1 BREMEN for Deployment-Efficient RL Algorithm 2 BREMEN for Offline RL
Open Source Code	Yes	Codes and pre-trained models are available at https://github.com/matsuolab/BREMEN.
Open Datasets	Yes	We also test BREMEN with more recent benchmarks of D4RL (Fu et al., 2020) and compared the performance with the existing model-free and model-based methods.
Dataset Splits	Yes	we collect 3,000 steps through online interaction with the environment per 25 iterations and split these transitions into a 2-to-1 ratio of training and validation dataset for learning dynamics models. In batch size 100,000 settings, we collect 2,000 steps and split with 1-to-1 ratio.
Hardware Specification	Yes	BREMEN in deployment-efficient settings takes about two or three hours per deployment on an NVIDIA TITAN V.
Software Dependencies	No	The paper mentions using Adam as an optimizer and refers to open-source implementations of other algorithms (SAC, BC, BCQ, BRAC) and specific benchmarks (Mu Jo Co, Open AI Gym) but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Table 4 shows the hyper-parameters of BREMEN. The rollout length is searched from {250, 500, 1000}, and max step size δ is searched from {0.001, 0.01, 0.05, 0.1, 1.0}. As for the discount factor γ and GAE λ, we follow Wang et al. (2019). Parameter Ant Half Cheetah Hopper Walker2d Iteration per batch 2,000 2,000 6,000 2,000 Deployment 5 5 10 10 Rollouts length 250 250 1,000 1,000 Max step size δ 0.05 0.1 0.05 0.05 Discount factor γ 0.99 0.99 0.99 0.99 GAE λ 0.97 0.95 0.95 0.95 Stationary noise σ 0.1 0.1 0.1 0.1