Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
Authors: Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, Rishabh Agarwal
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that training value functions with categorical cross-entropy significantly enhances performance and scalability across various domains, including single-task RL on Atari 2600 games, multi-task RL on Atari with large-scale Res Nets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving stateof-the-art results on these domains. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2Mila, Mc Gill University 3Mila, Université de Montréal. |
| Pseudocode | Yes | Listing 1 An implementation of HL-Gauss (Imani & White, 2018) in Jax (Bradbury et al., 2018). Listing 2 An implementation of HL-Gauss (Imani & White, 2018) in Py Torch (Paszke et al., 2019). |
| Open Source Code | No | The paper mentions extensive use of various open-source libraries (e.g., Jax, Flax, Optax) and building upon existing implementations (Dopamine), but it does not provide a direct link or explicit statement about its own source code being released or made available. |
| Open Datasets | Yes | We first evaluate the efficacy of HL-Gauss, Two-Hot, and C51 (Bellemare et al., 2017), on the Arcade Learning Environment (Bellemare et al., 2013). Following the protocol in Kumar et al. (2021). We make use of the entire dataset of Wordle games compiled by Snell et al. (2023). |
| Dataset Splits | No | The paper describes various evaluation metrics and procedures (e.g., 'report the interquartile mean (IQM) normalized scores with 95% stratified bootstrap confidence intervals'), and it uses standard datasets that often have predefined splits, but it does not explicitly state the training, validation, and test dataset splits (e.g., percentages or sample counts) within the paper. |
| Hardware Specification | Yes | This research was supported by the TPU resources at Google Deep Mind, and the authors are grateful to Doina Precup and Joelle Baral for their support. |
| Software Dependencies | No | The paper mentions extensive use of several software packages (e.g., Numpy, Sci Py, Jax, Flax, Optax, matplotlib, Seaborn) along with their corresponding citations. However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Appendix C provides 'Experimental Methodology' and 'Table C.1. DQN+Adam Hyperparameters' which lists detailed hyperparameters such as 'Discount Factor γ 0.99', 'Learning Rate 6.25 10 5', and 'Batch Size 32', among others. |