Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reinforcement Teaching

Authors: Calarina Muslimani, Alex Lewandowski, Dale Schuurmans, Matthew E. Taylor, Jun Luo

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the generality and eﬀectiveness of Reinforcement Teaching, we conduct experiments in which a teacher learns to signiﬁcantly improve both reinforcement and supervised learning algorithms. To demonstrate the generality and eﬀectiveness of Reinforcement Teaching, we conduct experiments in both curriculum learning (Section 5.1) and step-size adaptation (Section 5.2). Results in discrete and continuous control environments show examples of Reinforcement Teaching, in which the teacher learns a policy that selects sub-tasks for an RL student. We report the area under the student’s learning curve (AUC) when trained using the teacher’s learned curriculum (See Tables 2 and 3). We also compare the teacher’s own learning eﬃciency across the RL-teaching methods (see Figure 4-left).
Researcher Affiliation	Collaboration	Calarina Muslimani 1,2 EMAIL Alex Lewandowski 1,2 EMAIL Dale Schuurmans1,3,4 EMAIL Matthew E. Taylor1,4 EMAIL Jun Luo2 EMAIL 1 Department of Computing Science, University of Alberta 2 Noah s Ark Lab, Huawei Technologies Canada Co., Ltd. 3 Google Brain 4 Alberta Machine Intelligence Institute (Amii)
Pseudocode	Yes	With the full Reinforcement Teaching framework outlined, see Algorithm 1 for the corresponding pseudocode of the teacher-student interaction. Algorithm 1 Reinforcement Teaching Framework
Open Source Code	Yes	A Code for Experiments The source code to run our experiments can be found in this anonymized dropbox link: https://www.dropbox.com/sh/hjkzzgctnqf6d8w/AAAYEyca Dv POeifz8FZb R3k La?dl=0
Open Datasets	Yes	Furthermore, we perform a policy-transfer experiment, where we demonstrate that with our approach, the teacher can learn a step-size adaptation policy that can be transferred to new students classifying diﬀerent benchmark datasets (MNIST, Fashion MNIST) and even new students with diﬀerent architectures (see Figure 7). MNIST (Le Cun et al., 2010) and Fashion-MNIST (Xiao et al., 2017). CIFAR-10: The student’s neural network is a Le Net5 CNN wih a batch size of 128. Subsampled dataset to 10000 so that an episode covers one epoch of training.
Dataset Splits	No	The paper mentions subsampling datasets to 10000 for one epoch of training and provides batch sizes, but it does not explicitly state the training/test/validation splits (e.g., percentages or counts) needed to reproduce the data partitioning. For instance, it does not specify how the 10000 subsampled data points are divided into train, test, or validation sets.
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or memory) are provided in the paper. The paper refers to general terms like 'RL training loop' but does not specify the underlying hardware.
Software Dependencies	Yes	For the PPO student, we used the open-source implementation in (Willems & Karra, 2020). For the DDPG student, we used the Open AI Baselines implementation Dhariwal et al. (2017). Pytorch actor-critic deep reinforcement learning algorithms: A2c and ppo, 2020. URL https://github.com/lcswillems/torch-ac/tree/ 85d0b2b970ab402e3ab289a4b1f94572f9368dad.
Experiment Setup	Yes	See Table 10 for full speciﬁcation of student hyperparameters. Table 5: Fixed teacher hyperparameters used across all methods. Table 6: Teacher agent hyperparameters for all methods (excluding ablation experiments). Table 9: Hyperparameters used in the teacher-student training procedure. Table 10: Student hyperparameters. The teacher in the supervised learning experiment used Double DQN with ϵ-greedy exploration and an ϵ value of 0.01. The batch size and hidden neural network size was 256. The action-value network had 1 hidden layer, but the state encoder has 2 hidden layers.