Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reinforcement Teaching

Authors: Calarina Muslimani, Alex Lewandowski, Dale Schuurmans, Matthew E. Taylor, Jun Luo

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the generality and effectiveness of Reinforcement Teaching, we conduct experiments in which a teacher learns to significantly improve both reinforcement and supervised learning algorithms. To demonstrate the generality and effectiveness of Reinforcement Teaching, we conduct experiments in both curriculum learning (Section 5.1) and step-size adaptation (Section 5.2). Results in discrete and continuous control environments show examples of Reinforcement Teaching, in which the teacher learns a policy that selects sub-tasks for an RL student. We report the area under the student’s learning curve (AUC) when trained using the teacher’s learned curriculum (See Tables 2 and 3). We also compare the teacher’s own learning efficiency across the RL-teaching methods (see Figure 4-left).
Researcher Affiliation Collaboration Calarina Muslimani 1,2 EMAIL Alex Lewandowski 1,2 EMAIL Dale Schuurmans1,3,4 EMAIL Matthew E. Taylor1,4 EMAIL Jun Luo2 EMAIL 1 Department of Computing Science, University of Alberta 2 Noah s Ark Lab, Huawei Technologies Canada Co., Ltd. 3 Google Brain 4 Alberta Machine Intelligence Institute (Amii)
Pseudocode Yes With the full Reinforcement Teaching framework outlined, see Algorithm 1 for the corresponding pseudocode of the teacher-student interaction. Algorithm 1 Reinforcement Teaching Framework
Open Source Code Yes A Code for Experiments The source code to run our experiments can be found in this anonymized dropbox link: https://www.dropbox.com/sh/hjkzzgctnqf6d8w/AAAYEyca Dv POeifz8FZb R3k La?dl=0
Open Datasets Yes Furthermore, we perform a policy-transfer experiment, where we demonstrate that with our approach, the teacher can learn a step-size adaptation policy that can be transferred to new students classifying different benchmark datasets (MNIST, Fashion MNIST) and even new students with different architectures (see Figure 7). MNIST (Le Cun et al., 2010) and Fashion-MNIST (Xiao et al., 2017). CIFAR-10: The student’s neural network is a Le Net5 CNN wih a batch size of 128. Subsampled dataset to 10000 so that an episode covers one epoch of training.
Dataset Splits No The paper mentions subsampling datasets to 10000 for one epoch of training and provides batch sizes, but it does not explicitly state the training/test/validation splits (e.g., percentages or counts) needed to reproduce the data partitioning. For instance, it does not specify how the 10000 subsampled data points are divided into train, test, or validation sets.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory) are provided in the paper. The paper refers to general terms like 'RL training loop' but does not specify the underlying hardware.
Software Dependencies Yes For the PPO student, we used the open-source implementation in (Willems & Karra, 2020). For the DDPG student, we used the Open AI Baselines implementation Dhariwal et al. (2017). Pytorch actor-critic deep reinforcement learning algorithms: A2c and ppo, 2020. URL https://github.com/lcswillems/torch-ac/tree/ 85d0b2b970ab402e3ab289a4b1f94572f9368dad.
Experiment Setup Yes See Table 10 for full specification of student hyperparameters. Table 5: Fixed teacher hyperparameters used across all methods. Table 6: Teacher agent hyperparameters for all methods (excluding ablation experiments). Table 9: Hyperparameters used in the teacher-student training procedure. Table 10: Student hyperparameters. The teacher in the supervised learning experiment used Double DQN with ϵ-greedy exploration and an ϵ value of 0.01. The batch size and hidden neural network size was 256. The action-value network had 1 hidden layer, but the state encoder has 2 hidden layers.