Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem
Authors: Maciej Wolczyk, Bartłomiej Cupiał, Mateusz Ostaszewski, Michał Bortkiewicz, Michał Zając, Razvan Pascanu, Łukasz Kuciński, Piotr Miłoś
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a detailed empirical analysis of the challenging Net Hack and Montezuma s Revenge environments, we show that standard knowledge retention techniques mitigate the problem and thus allow us to take full advantage of the pre-trained capabilities. |
| Researcher Affiliation | Collaboration | 1IDEAS NCBR 2University of Warsaw 3Warsaw University of Technology 4Jagiellonian University 5Google Deep Mind 6Institute of Mathematics, Polish Academy of Sciences 7deepsense.ai. |
| Pseudocode | Yes | Algorithm 1 Robotic Sequence |
| Open Source Code | Yes | The code is available at https://github. com/Bartek Cupial/finetuning-RL-as-CL. |
| Open Datasets | Yes | Net Hack Learning Environment (Küttler et al., 2020) is a complex game [...], Montezuma s Revenge is a popular video game [...] (Bellemare et al., 2013)., Robotic Sequence is a multi-stage robotic task based on the Meta-World benchmark (Yu et al., 2020)., We take the current state-of-the-art neural model (Tuyls et al., 2023) as our pre-trained policy π . |
| Dataset Splits | No | No explicit train/validation/test dataset splits were provided. The paper describes continuous training processes and evaluation at specific checkpoints rather than distinct dataset splits. |
| Hardware Specification | Yes | In this setup, we can run over 500 million environment steps under 24 hours of training on A100 Nvidia GPU. For each experiment, we use 8 CPU cores and 30GB RAM. |
| Software Dependencies | No | No specific version numbers for software dependencies (e.g., PyTorch, Adam optimizer) were provided. The paper mentions 'We used Py Torch implementation by jcwleo from https://github.com/jcwleo/random-network-distillation-pytorch' and 'We set the learning rate to 10^-3 and use the Adam (Kingma & Ba, 2014) optimizer.' |
| Experiment Setup | Yes | More technical details, including the neural network architecture, can be found in Appendix B.1. Detailed hyperparameter values can be found in Table 2. The model hyperparameters are shown in Table 1 analogical to Table 6 from (Petrenko et al., 2020). The final hyperparameters are listed in Table 3. |