An Optimistic Perspective on Offline Reinforcement Learning
Authors: Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that recent off-policy deep RL algorithms, even when trained solely on this fixed dataset, outperform the fully-trained DQN agent. Ablation studies highlight the role of offline dataset size and diversity as well as the algorithm choice in our positive results. |
| Researcher Affiliation | Collaboration | 1Google Research, Brain Team 2University of Alberta. Correspondence to: Rishabh Agarwal <rishabhagarwal@google.com>, Mohammad Norouzi <mnorouzi@google.com>. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | To provide a testbed for offline RL and reproduce our results, the DQN Replay Dataset is released at offline-rl.github.io. Open-source code at github.com/google-research/batch_rl. |
| Open Datasets | Yes | To provide a testbed for offline RL and reproduce our results, the DQN Replay Dataset is released at offline-rl.github.io. |
| Dataset Splits | No | The paper describes the generation of the DQN Replay Dataset and its use for training, but does not provide specific train/validation/test dataset splits with percentages or sample counts for reproducibility of data partitioning for the models trained. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like 'Dopamine baselines', 'RMSProp', and 'Adam' but does not provide specific version numbers for these or any other ancillary software dependencies. |
| Experiment Setup | Yes | We use the hyperparameters provided in Dopamine baselines (Castro et al., 2018) for a standardized comparison (Appendix A.4)... we use the same multi-head Q-network as QR-DQN with K = 200 heads... We also use Adam for optimization... For data collection, we use ϵ-greedy with a randomly sampled Q-estimate from the simplex for each episode, similar to Bootstrapped DQN. We follow the standard online RL protocol on Atari and use a fixed replay buffer of 1M frames. |