Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Authors: Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, Shixiang Gu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate BREMEN on standard offline RL benchmarks of high-dimensional continuous control tasks, where only a single static dataset is used. In this fixed-batch setting, our experiments show that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain. Enabled by such stable and sample-efficient offline learning, we show that BREMEN can learn successful policies with only 5-10 deployments in the online setting, significantly outperforming existing off-policy and offline RL algorithms in deployment efficiency while keeping sample efficiency.
Researcher Affiliation Collaboration Tatsuya Matsushima Hiroki Furuta Yutaka Matsuo The University of Tokyo {matsushima, furuta, matsuo}@weblab.t.u-tokyo.ac.jp Ofir Nachum Shixiang Shane Gu Google Research {ofirnachum, shanegu}@google.com
Pseudocode Yes Algorithm 1 BREMEN for Deployment-Efficient RL Algorithm 2 BREMEN for Offline RL
Open Source Code Yes Codes and pre-trained models are available at https://github.com/matsuolab/BREMEN.
Open Datasets Yes We also test BREMEN with more recent benchmarks of D4RL (Fu et al., 2020) and compared the performance with the existing model-free and model-based methods.
Dataset Splits Yes we collect 3,000 steps through online interaction with the environment per 25 iterations and split these transitions into a 2-to-1 ratio of training and validation dataset for learning dynamics models. In batch size 100,000 settings, we collect 2,000 steps and split with 1-to-1 ratio.
Hardware Specification Yes BREMEN in deployment-efficient settings takes about two or three hours per deployment on an NVIDIA TITAN V.
Software Dependencies No The paper mentions using Adam as an optimizer and refers to open-source implementations of other algorithms (SAC, BC, BCQ, BRAC) and specific benchmarks (Mu Jo Co, Open AI Gym) but does not provide specific version numbers for these software components.
Experiment Setup Yes Table 4 shows the hyper-parameters of BREMEN. The rollout length is searched from {250, 500, 1000}, and max step size δ is searched from {0.001, 0.01, 0.05, 0.1, 1.0}. As for the discount factor γ and GAE λ, we follow Wang et al. (2019). Parameter Ant Half Cheetah Hopper Walker2d Iteration per batch 2,000 2,000 6,000 2,000 Deployment 5 5 10 10 Rollouts length 250 250 1,000 1,000 Max step size δ 0.05 0.1 0.05 0.05 Discount factor γ 0.99 0.99 0.99 0.99 GAE λ 0.97 0.95 0.95 0.95 Stationary noise σ 0.1 0.1 0.1 0.1