Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learn the Time to Learn: Replay Scheduling in Continual Learning

Authors: Marcus Klasson, Hedvig Kjellstrom, Cheng Zhang

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present the experimental results to show the importance of replay scheduling in CL. First, we demonstrate the benefits with replay scheduling by using MCTS for finding replay schedules in Section 4.1. Thereafter, we evaluate our RL-based framework using DQN (Mnih et al., 2013) and A2C (Mnih et al., 2016) for learning policies that generalize to new CL scenarios in Section 4.2. Full details on experimental settings and additional results are in Appendix C and D. We conduct experiments on several CL benchmark datasets: Split MNIST (Le Cun et al., 1998; Zenke et al., 2017), Split Fashion MNIST (Xiao et al., 2017), Split not MNIST (Bulatov, 2011), Permuted MNIST (Goodfellow et al., 2013), and Split CIFAR-100 (Krizhevsky & Hinton, 2009), and Split mini Imagenet (Vinyals et al., 2016).
Researcher Affiliation	Collaboration	Marcus Klasson EMAIL Aalto University Hedvig Kjellström EMAIL KTH Royal Institute of Technology Cheng Zhang EMAIL Microsoft Research
Pseudocode	Yes	A Additional Methodology In this section, we provide pseudo-code for MCTS to search for replay schedules in single CL environments in Section A.1 as well as pseudo-code for the RL-based framework for learning the replay scheduling policies in Section A.2. Algorithm 1 Monte Carlo Tree Search for Replay Scheduling Algorithm 2 RL Framework for Learning Replay Scheduling Policy
Open Source Code	Yes	Code is publicly available under the MIT license1. 1Code: https://github.com/marcusklasson/replay_scheduling
Open Datasets	Yes	We conduct experiments on several CL benchmark datasets: Split MNIST (Le Cun et al., 1998; Zenke et al., 2017), Split Fashion MNIST (Xiao et al., 2017), Split not MNIST (Bulatov, 2011), Permuted MNIST (Goodfellow et al., 2013), and Split CIFAR-100 (Krizhevsky & Hinton, 2009), and Split mini Imagenet (Vinyals et al., 2016).
Dataset Splits	Yes	We let the network fϕ, parameterized by ϕ, learn T tasks sequentially from the datasets D1, . . . , DT arriving one at a time. The t-th dataset Dt = {(x(i) t , y(i) t )}Nt i=1 consists of Nt samples where x(i) t and y(i) t are the i-th data point and class label respectively. Furthermore, each dataset is split into a training, validation, and test set, i.e., Dt = {D(train) t , D(val) t , D(test) t }. MCTS and Heur-GD randomly sample 15% of the training data of each task to use for validation.
Hardware Specification	Yes	All experiments were performed on one NVIDIA Ge Force RTW 2080Ti on an internal GPU cluster.
Software Dependencies	No	The code for DQN was adapted from Open AI baselines (Dhariwal et al., 2017) and the Py Torch (Paszke et al., 2019) tutorial on DQN https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html. For A2C, we followed the implementations released by Kostrikov (2018) and Igl et al. (2021). Explanation: The paper mentions software libraries like PyTorch and refers to other implementations, but does not provide specific version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	CL Hyperparameters. We train all networks with the Adam optimizer (Kingma & Ba, 2015) with learning rate η = 0.001 and hyperparameters β1 = 0.9 and β2 = 0.999. Note that the learning rate for Adam is not reset before training on a new task. Next, we give details on number of training epochs and batch sizes specific for each dataset: Split MNIST: 10 epochs/task, batch size 128. Split Fashion MNIST: 30 epochs/task, batch size 128. Split not MNIST: 50 epochs/task, batch size 128. Permuted MNIST: 20 epochs/task, batch size 128. Split CIFAR-100: 25 epochs/task, batch size 256. Split mini Imagenet: 1 epoch/task (task 1 trained for 5 epochs as warm up), batch size 32.