Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

Authors: Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, Rishabh Agarwal

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that training value functions with categorical cross-entropy significantly enhances performance and scalability across various domains, including single-task RL on Atari 2600 games, multi-task RL on Atari with large-scale Res Nets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving stateof-the-art results on these domains.
Researcher Affiliation Collaboration 1Google Deep Mind 2Mila, Mc Gill University 3Mila, Université de Montréal.
Pseudocode Yes Listing 1 An implementation of HL-Gauss (Imani & White, 2018) in Jax (Bradbury et al., 2018). Listing 2 An implementation of HL-Gauss (Imani & White, 2018) in Py Torch (Paszke et al., 2019).
Open Source Code No The paper mentions extensive use of various open-source libraries (e.g., Jax, Flax, Optax) and building upon existing implementations (Dopamine), but it does not provide a direct link or explicit statement about its own source code being released or made available.
Open Datasets Yes We first evaluate the efficacy of HL-Gauss, Two-Hot, and C51 (Bellemare et al., 2017), on the Arcade Learning Environment (Bellemare et al., 2013). Following the protocol in Kumar et al. (2021). We make use of the entire dataset of Wordle games compiled by Snell et al. (2023).
Dataset Splits No The paper describes various evaluation metrics and procedures (e.g., 'report the interquartile mean (IQM) normalized scores with 95% stratified bootstrap confidence intervals'), and it uses standard datasets that often have predefined splits, but it does not explicitly state the training, validation, and test dataset splits (e.g., percentages or sample counts) within the paper.
Hardware Specification Yes This research was supported by the TPU resources at Google Deep Mind, and the authors are grateful to Doina Precup and Joelle Baral for their support.
Software Dependencies No The paper mentions extensive use of several software packages (e.g., Numpy, Sci Py, Jax, Flax, Optax, matplotlib, Seaborn) along with their corresponding citations. However, it does not provide specific version numbers for these software components.
Experiment Setup Yes Appendix C provides 'Experimental Methodology' and 'Table C.1. DQN+Adam Hyperparameters' which lists detailed hyperparameters such as 'Discount Factor γ 0.99', 'Learning Rate 6.25 10 5', and 'Batch Size 32', among others.