reproducibilityindex.ai

Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

Authors: Maciej Wolczyk, Bartłomiej Cupiał, Mateusz Ostaszewski, Michał Bortkiewicz, Michał Zając, Razvan Pascanu, Łukasz Kuciński, Piotr Miłoś

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a detailed empirical analysis of the challenging Net Hack and Montezuma s Revenge environments, we show that standard knowledge retention techniques mitigate the problem and thus allow us to take full advantage of the pre-trained capabilities.
Researcher Affiliation	Collaboration	1IDEAS NCBR 2University of Warsaw 3Warsaw University of Technology 4Jagiellonian University 5Google Deep Mind 6Institute of Mathematics, Polish Academy of Sciences 7deepsense.ai.
Pseudocode	Yes	Algorithm 1 Robotic Sequence
Open Source Code	Yes	The code is available at https://github. com/Bartek Cupial/finetuning-RL-as-CL.
Open Datasets	Yes	Net Hack Learning Environment (Küttler et al., 2020) is a complex game [...], Montezuma s Revenge is a popular video game [...] (Bellemare et al., 2013)., Robotic Sequence is a multi-stage robotic task based on the Meta-World benchmark (Yu et al., 2020)., We take the current state-of-the-art neural model (Tuyls et al., 2023) as our pre-trained policy π .
Dataset Splits	No	No explicit train/validation/test dataset splits were provided. The paper describes continuous training processes and evaluation at specific checkpoints rather than distinct dataset splits.
Hardware Specification	Yes	In this setup, we can run over 500 million environment steps under 24 hours of training on A100 Nvidia GPU. For each experiment, we use 8 CPU cores and 30GB RAM.
Software Dependencies	No	No specific version numbers for software dependencies (e.g., PyTorch, Adam optimizer) were provided. The paper mentions 'We used Py Torch implementation by jcwleo from https://github.com/jcwleo/random-network-distillation-pytorch' and 'We set the learning rate to 10^-3 and use the Adam (Kingma & Ba, 2014) optimizer.'
Experiment Setup	Yes	More technical details, including the neural network architecture, can be found in Appendix B.1. Detailed hyperparameter values can be found in Table 2. The model hyperparameters are shown in Table 1 analogical to Table 6 from (Petrenko et al., 2020). The ﬁnal hyperparameters are listed in Table 3.