Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Lifetime tuning is incompatible with continual reinforcement learning

Authors: Golnaz Mesbahi, Parham Mohammad Panahi, Olya Mastikhina, Steven Tang, Martha White, Adam White

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical evidence to support our position by testing DQN and SAC across several continuing and non-stationary environments with two main findings: (1) lifetime tuning does not allow us to identify algorithms that work well for continual learning all algorithms equally succeed; (2) recently developed continual RL algorithms outperform standard noncontinual algorithms when tuning is limited to a fraction of the agent s lifetime.
Researcher Affiliation Academia 1Department of Computing Science, University of Alberta, Edmonton, Canada 2Alberta Machine Intelligence Institute (Amii) 3Canada CIFAR AI Chair.
Pseudocode No The paper describes various algorithms and methodologies (DQN, SAC, W0Regularization, PT-DQN) but does not present any of them in a structured pseudocode or algorithm block.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We contrast using k-percent tuning and lifetime tuning to compare continual learning mitigation strategies for DQN in Jelly Bean World, a testbed for neverending, continual learning (Platanios et al., 2020).
Dataset Splits No The paper discusses tuning phases (e.g., "k-percent of its lifetime") and runs, but does not provide specific training/test/validation dataset splits with percentages, sample counts, or predefined split references.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies or library names with version numbers needed to replicate the experiments.
Experiment Setup Yes We consider a large set of hyperparameters for DQN, sweeping exploration (epsilon), batch size, buffer size, minimum buffer size, and the values of learning rate and β2 of the Adam optimizer. The ranges and chosen hyperparameters listed in Tables 1 and 2 respectively. ... Learning rate 10i : i [ 1, , 5], 0.08 Batch size 32, 256 Buffer size 1, 000, 10, 000, 100, 000 Min buffer size 0, 1000 Exploration ϵ 0.01, 0.1 Adam optimizer β2 0.9, 0.999 (Table 1)