reproducibilityindex.ai

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

Authors: Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, Rishabh Agarwal

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that training value functions with categorical cross-entropy signiﬁcantly enhances performance and scalability across various domains, including single-task RL on Atari 2600 games, multi-task RL on Atari with large-scale Res Nets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving stateof-the-art results on these domains.
Researcher Affiliation	Collaboration	1Google Deep Mind 2Mila, Mc Gill University 3Mila, Université de Montréal.
Pseudocode	Yes	Listing 1 An implementation of HL-Gauss (Imani & White, 2018) in Jax (Bradbury et al., 2018). Listing 2 An implementation of HL-Gauss (Imani & White, 2018) in Py Torch (Paszke et al., 2019).
Open Source Code	No	The paper mentions extensive use of various open-source libraries (e.g., Jax, Flax, Optax) and building upon existing implementations (Dopamine), but it does not provide a direct link or explicit statement about its own source code being released or made available.
Open Datasets	Yes	We ﬁrst evaluate the efﬁcacy of HL-Gauss, Two-Hot, and C51 (Bellemare et al., 2017), on the Arcade Learning Environment (Bellemare et al., 2013). Following the protocol in Kumar et al. (2021). We make use of the entire dataset of Wordle games compiled by Snell et al. (2023).
Dataset Splits	No	The paper describes various evaluation metrics and procedures (e.g., 'report the interquartile mean (IQM) normalized scores with 95% stratiﬁed bootstrap conﬁdence intervals'), and it uses standard datasets that often have predefined splits, but it does not explicitly state the training, validation, and test dataset splits (e.g., percentages or sample counts) within the paper.
Hardware Specification	Yes	This research was supported by the TPU resources at Google Deep Mind, and the authors are grateful to Doina Precup and Joelle Baral for their support.
Software Dependencies	No	The paper mentions extensive use of several software packages (e.g., Numpy, Sci Py, Jax, Flax, Optax, matplotlib, Seaborn) along with their corresponding citations. However, it does not provide specific version numbers for these software components.
Experiment Setup	Yes	Appendix C provides 'Experimental Methodology' and 'Table C.1. DQN+Adam Hyperparameters' which lists detailed hyperparameters such as 'Discount Factor γ 0.99', 'Learning Rate 6.25 10 5', and 'Batch Size 32', among others.