Bootstrapped Representations in Reinforcement Learning

Authors: Charline Le Lan, Stephen Tu, Mark Rowland, Anna Harutyunyan, Rishabh Agarwal, Marc G Bellemare, Will Dabney

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al., 1999) and Mountain Car (Moore, 1990). We present an empirical evaluation that supports our theoretical characterizations and show the importance of the choice of a learning rule to learn the value function in Section 5. Figure 4: Subspace distance... over the course of training Figure 5: Subspace distance after 5 105 training steps Figure 6: Comparing effects of offline pre-training on the Four Rooms (left) and sparse Mountain Car (right) domains for different cumulant generation methods.
Researcher Affiliation Collaboration 1University of Oxford 2Google Deep Mind. Correspondence to: Charline Le Lan <charline.lelan@stats.ox.ac.uk>.
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper states 'We also thank Jesse Farebrother and Joshua Greaves for help with the Proto-Value Networks codebase (Farebrother et al., 2023).', which refers to using an external codebase, not releasing their own source code for the methodology described in the paper.
Open Datasets Yes We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al., 1999) and Mountain Car (Moore, 1990). These are well-known, publicly available benchmark environments/datasets.
Dataset Splits No The paper mentions 'The offline pre-training dataset contains 100000 and 200000 transitions for four-room and mountain car respectively' and training uses a 'replay buffer', but it does not specify explicit training, validation, or test splits with percentages or counts.
Hardware Specification No The paper does not specify any particular hardware details such as GPU models, CPU types, or memory used for conducting the experiments.
Software Dependencies No The paper cites 'Num Py (Oliphant, 2006; Walt et al., 2011; Harris et al., 2020), Sci Py (Jones et al., 2001), Matplotlib (Hunter, 2007) and JAX (Bradbury et al., 2018)'. While these libraries are mentioned, specific version numbers (e.g., NumPy 1.20) are not provided for them.
Experiment Setup Yes In this experiment, we selected a step size α = 0.08 for all the algorithms. We use a step size α = 5e-3 and train the different learning rules for 500k steps with 3 seeds. The learning rate for both offline and online training was the same as the standard DQN learning rate (0.00025), and similarly for the optimizer epsilon. The network architecture is a simple fully connected MLP with Re LU activations (Nair and Hinton, 2010) and two hidden layers of size 512 (first) and 256 (second), followed by a linear layer to give action-values.