Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

Authors: Maciej Wolczyk, Bartłomiej Cupiał, Mateusz Ostaszewski, Michał Bortkiewicz, Michał Zając, Razvan Pascanu, Łukasz Kuciński, Piotr Miłoś

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a detailed empirical analysis of the challenging Net Hack and Montezuma s Revenge environments, we show that standard knowledge retention techniques mitigate the problem and thus allow us to take full advantage of the pre-trained capabilities.
Researcher Affiliation Collaboration 1IDEAS NCBR 2University of Warsaw 3Warsaw University of Technology 4Jagiellonian University 5Google Deep Mind 6Institute of Mathematics, Polish Academy of Sciences 7deepsense.ai.
Pseudocode Yes Algorithm 1 Robotic Sequence
Open Source Code Yes The code is available at https://github. com/Bartek Cupial/finetuning-RL-as-CL.
Open Datasets Yes Net Hack Learning Environment (Küttler et al., 2020) is a complex game [...], Montezuma s Revenge is a popular video game [...] (Bellemare et al., 2013)., Robotic Sequence is a multi-stage robotic task based on the Meta-World benchmark (Yu et al., 2020)., We take the current state-of-the-art neural model (Tuyls et al., 2023) as our pre-trained policy π .
Dataset Splits No No explicit train/validation/test dataset splits were provided. The paper describes continuous training processes and evaluation at specific checkpoints rather than distinct dataset splits.
Hardware Specification Yes In this setup, we can run over 500 million environment steps under 24 hours of training on A100 Nvidia GPU. For each experiment, we use 8 CPU cores and 30GB RAM.
Software Dependencies No No specific version numbers for software dependencies (e.g., PyTorch, Adam optimizer) were provided. The paper mentions 'We used Py Torch implementation by jcwleo from https://github.com/jcwleo/random-network-distillation-pytorch' and 'We set the learning rate to 10^-3 and use the Adam (Kingma & Ba, 2014) optimizer.'
Experiment Setup Yes More technical details, including the neural network architecture, can be found in Appendix B.1. Detailed hyperparameter values can be found in Table 2. The model hyperparameters are shown in Table 1 analogical to Table 6 from (Petrenko et al., 2020). The final hyperparameters are listed in Table 3.