Principled Exploration via Optimistic Bootstrapping and Backward Induction
Authors: Chenjia Bai, Lingxiao Wang, Lei Han, Jianye Hao, Animesh Garg, Peng Liu, Zhaoran Wang
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments in the MNIST maze and Atari suite suggest that OB2I outperforms several state-of-the-art exploration approaches.We evaluate OB2I empirically by solving MNIST maze and 49 Atari games. |
| Researcher Affiliation | Collaboration | 1Harbin Institute of Technology, Harbin, China 2Northwestern University, Evanston, USA 3Tencent Robotics X 4Tianjin University 5University of Toronto, Vector Institute. |
| Pseudocode | Yes | Algorithm 1 LSVI-UCB in linear MDP |
| Open Source Code | Yes | The code is available at https://github.com/Baichenjia/OB2I. |
| Open Datasets | Yes | We evaluate the algorithms in high-dimensional image-based tasks, including MNIST Maze (Lee et al., 2019) and 49 Atari games. |
| Dataset Splits | No | The paper discusses training frames and evaluation for RL environments, but does not specify dataset splits (e.g., percentages or counts) for a separate validation set, which is typical for static dataset-based experiments. |
| Hardware Specification | Yes | BEBU, BEBU-UCB, BEBU-IDS and OB2I are trained for 20M frames with RTX-2080Ti GPU for 5 random seeds. |
| Software Dependencies | No | The paper mentions software concepts like 'Deep Reinforcement Learning' and 'Bootstrapped DQN', but does not provide specific version numbers for any software libraries, frameworks, or languages used in the experiments. |
| Experiment Setup | Yes | For OB2I, we set both α1 and α2 as 0.5 10 4 by tuning over five popular tasks, including Breakout, Freeway, Qbert, Seaquest, and Space Invaders. Generally, small α1 and α2 yield better performance empirically since the bonus accumulates along the episode that usually contains thousands of steps in Atari. We use diffusion factor β = 0.5 for all methods by following Lee et al. (2019). |