Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem
Authors: Maciej Wolczyk, Bartłomiej Cupiał, Mateusz Ostaszewski, Michał Bortkiewicz, Michał Zając, Razvan Pascanu, Łukasz Kuciński, Piotr Miłoś
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a detailed empirical analysis of the challenging Net Hack and Montezuma s Revenge environments, we show that standard knowledge retention techniques mitigate the problem and thus allow us to take full advantage of the pre-trained capabilities. |
| Researcher Affiliation | Collaboration | 1IDEAS NCBR 2University of Warsaw 3Warsaw University of Technology 4Jagiellonian University 5Google Deep Mind 6Institute of Mathematics, Polish Academy of Sciences 7deepsense.ai. |
| Pseudocode | Yes | Algorithm 1 Robotic Sequence |
| Open Source Code | Yes | The code is available at https://github. com/Bartek Cupial/finetuning-RL-as-CL. |
| Open Datasets | Yes | Net Hack Learning Environment (Küttler et al., 2020) is a complex game [...], Montezuma s Revenge is a popular video game [...] (Bellemare et al., 2013)., Robotic Sequence is a multi-stage robotic task based on the Meta-World benchmark (Yu et al., 2020)., We take the current state-of-the-art neural model (Tuyls et al., 2023) as our pre-trained policy π . |
| Dataset Splits | No | No explicit train/validation/test dataset splits were provided. The paper describes continuous training processes and evaluation at specific checkpoints rather than distinct dataset splits. |
| Hardware Specification | Yes | In this setup, we can run over 500 million environment steps under 24 hours of training on A100 Nvidia GPU. For each experiment, we use 8 CPU cores and 30GB RAM. |
| Software Dependencies | No | No specific version numbers for software dependencies (e.g., PyTorch, Adam optimizer) were provided. The paper mentions 'We used Py Torch implementation by jcwleo from https://github.com/jcwleo/random-network-distillation-pytorch' and 'We set the learning rate to 10^-3 and use the Adam (Kingma & Ba, 2014) optimizer.' |
| Experiment Setup | Yes | More technical details, including the neural network architecture, can be found in Appendix B.1. Detailed hyperparameter values can be found in Table 2. The model hyperparameters are shown in Table 1 analogical to Table 6 from (Petrenko et al., 2020). The final hyperparameters are listed in Table 3. |