Behaviour Suite for Reinforcement Learning
Authors: Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado Van Hasselt
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to study agent behaviour through their performance on these shared benchmarks. To complement this effort, we open source github.com/deepmind/bsuite, which automates evaluation and analysis of any agent on bsuite. |
| Researcher Affiliation | Industry | Ian Osband , Yotam Doron, Matteo Hessel, John Aslanides Eren Sezener, Andre Saraiva, Katrina Mc Kinney, Tor Lattimore, Csaba Szepesvari Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado Van Hasselt. Corresponding author iosband@google.com. |
| Pseudocode | No | The paper describes the experiments and analysis but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | To complement this effort, we open source github.com/deepmind/bsuite, which automates evaluation and analysis of any agent on bsuite. |
| Open Datasets | Yes | Contextual bandit classification of MNIST with 1 rewards (Le Cun et al., 1998). |
| Dataset Splits | No | The paper describes running agents for a certain number of episodes (e.g., 10k episodes, 1k episodes) and using multiple seeds (e.g., 20 seeds, 4 seeds each), and varying problem parameters, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or counts for reproducing data partitioning. |
| Hardware Specification | No | The paper mentions 'Launch scripts for Google cloud that automate large scale compute at low cost' and 'Fast: iteration from launch to results in under 30min on standard CPU', but it does not provide specific hardware details such as exact GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper mentions 'Our code is Python', 'Tensorflow (Abadi et al., 2015)', and includes examples with 'Open AI Baselines, Dopamine', but it does not specify exact version numbers for these software dependencies, which are necessary for reproducibility. |
| Experiment Setup | Yes | For the bsuite experiment we run the agent on sizes N = 1, .., 100 exponentially spaced and look at the average regret compared to optimal after 10k episodes. The summary score is the percentage of runs for which the average regret is less than 75% of that achieved by a uniformly random policy." and "In each case we tune a learning rate to optimize performance on basic tasks from {1e-1, 1e-2, 1e-3}, keeping all other parameters constant at default value. |