Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization
Authors: Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, Shixiang Gu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate BREMEN on standard offline RL benchmarks of high-dimensional continuous control tasks, where only a single static dataset is used. In this fixed-batch setting, our experiments show that BREMEN can not only achieve performance competitive with state-of-the-art when using standard dataset sizes but also learn with 10-20 times smaller datasets, which previous methods are unable to attain. Enabled by such stable and sample-efficient offline learning, we show that BREMEN can learn successful policies with only 5-10 deployments in the online setting, significantly outperforming existing off-policy and offline RL algorithms in deployment efficiency while keeping sample efficiency. |
| Researcher Affiliation | Collaboration | Tatsuya Matsushima Hiroki Furuta Yutaka Matsuo The University of Tokyo {matsushima, furuta, matsuo}@weblab.t.u-tokyo.ac.jp Ofir Nachum Shixiang Shane Gu Google Research {ofirnachum, shanegu}@google.com |
| Pseudocode | Yes | Algorithm 1 BREMEN for Deployment-Efficient RL Algorithm 2 BREMEN for Offline RL |
| Open Source Code | Yes | Codes and pre-trained models are available at https://github.com/matsuolab/BREMEN. |
| Open Datasets | Yes | We also test BREMEN with more recent benchmarks of D4RL (Fu et al., 2020) and compared the performance with the existing model-free and model-based methods. |
| Dataset Splits | Yes | we collect 3,000 steps through online interaction with the environment per 25 iterations and split these transitions into a 2-to-1 ratio of training and validation dataset for learning dynamics models. In batch size 100,000 settings, we collect 2,000 steps and split with 1-to-1 ratio. |
| Hardware Specification | Yes | BREMEN in deployment-efficient settings takes about two or three hours per deployment on an NVIDIA TITAN V. |
| Software Dependencies | No | The paper mentions using Adam as an optimizer and refers to open-source implementations of other algorithms (SAC, BC, BCQ, BRAC) and specific benchmarks (Mu Jo Co, Open AI Gym) but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Table 4 shows the hyper-parameters of BREMEN. The rollout length is searched from {250, 500, 1000}, and max step size δ is searched from {0.001, 0.01, 0.05, 0.1, 1.0}. As for the discount factor γ and GAE λ, we follow Wang et al. (2019). Parameter Ant Half Cheetah Hopper Walker2d Iteration per batch 2,000 2,000 6,000 2,000 Deployment 5 5 10 10 Rollouts length 250 250 1,000 1,000 Max step size δ 0.05 0.1 0.05 0.05 Discount factor γ 0.99 0.99 0.99 0.99 GAE λ 0.97 0.95 0.95 0.95 Stationary noise σ 0.1 0.1 0.1 0.1 |