Normalization and effective learning rates in reinforcement learning
Authors: Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P. van Hasselt, Razvan Pascanu, Will Dabney
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now validate the utility of Na P empirically. Our goal in this section is to validate two key properties: first, that Na P does not hurt performance on stationary tasks; second, that Na P can mitigate plasticity loss under a variety of both synthetic and natural nonstationarities. Section 5.1, 5.2, 5.3 present empirical evaluations on various tasks and benchmarks. |
| Researcher Affiliation | Industry | Google Deep Mind. Correspondence to clarelyle@google.com |
| Pseudocode | Yes | Algorithm 1 Na P: Normalize-and-Project |
| Open Source Code | No | We include as many details of the experiments as we can, but have not obtained permission to open-source the code itself yet. |
| Open Datasets | Yes | We evaluate our approach on a variety of sources of nonstationarity, using two architectures: a small CNN, and a fully-connected MLP (see Appendix B.4. for details). ... Large-scale image classification. We begin by studying the effect of Na P on two well-established benchmarks: a VGG16-like network [Simonyan and Zisserman, 2014] on CIFAR-10, and a Res Net50 [He et al., 2016] on the Image Net-1k dataset. ... Natural language: we evaluate the effect of Na P on a 400M-parameter transformer architecture (details in Appendix B.3) trained on the C4 dataset [Raffel et al., 2020]. ... RL on the Arcade Learning Environment. We conduct a full sweep over 57 Atari 2600 games comparing the effects of normalization, weight projection, and learning rate schedules on a Rainbow agent [Hessel et al., 2018]. |
| Dataset Splits | No | The paper describes training durations (e.g., 'train for 200M frames on the Atari 57 suite', 'train on each of 10 games for 20M frames', 'training for 30,000 steps'), but it does not explicitly state dataset splits (e.g., '80% training, 10% validation, 10% test') for reproducibility. While it mentions 'average online accuracy' and 'test sets', the specific partitioning methodology for train/validation/test sets is not provided in detail. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, cloud instance types) used for running experiments. |
| Software Dependencies | No | The paper mentions various software components and optimizers like 'Adam optimizer', 'RMSProp', 'AdamW', 'SGD', and refers to implementations such as 'DQN Zoo' and 'Brax', but it does not specify version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | We follow the default hyperparameters detailed in this codebase. In our implementation, we add normalization layers prior to each nonlinearity except for the final softmax. We train for 200M frames on the Atari 57 suite [Bellemare et al., 2013]. We also allow for a learning rate schedule, which we explicitly detail in cases where non-constant learning rates are used. ... Our cosine decay schedule uses an init value of 10^-8, a peak value of the default LR for Rainbow (0.000625), 1000 warmup steps after the optimizer is reset, and end-value equal to 10^-6. ... We use a batch size of 128. ... We use a weight decay parameter of 0.1 with the adamw optimizer. |