Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reinforcement Teaching
Authors: Calarina Muslimani, Alex Lewandowski, Dale Schuurmans, Matthew E. Taylor, Jun Luo
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the generality and eļ¬ectiveness of Reinforcement Teaching, we conduct experiments in which a teacher learns to signiļ¬cantly improve both reinforcement and supervised learning algorithms. To demonstrate the generality and eļ¬ectiveness of Reinforcement Teaching, we conduct experiments in both curriculum learning (Section 5.1) and step-size adaptation (Section 5.2). Results in discrete and continuous control environments show examples of Reinforcement Teaching, in which the teacher learns a policy that selects sub-tasks for an RL student. We report the area under the studentās learning curve (AUC) when trained using the teacherās learned curriculum (See Tables 2 and 3). We also compare the teacherās own learning eļ¬ciency across the RL-teaching methods (see Figure 4-left). |
| Researcher Affiliation | Collaboration | Calarina Muslimani 1,2 EMAIL Alex Lewandowski 1,2 EMAIL Dale Schuurmans1,3,4 EMAIL Matthew E. Taylor1,4 EMAIL Jun Luo2 EMAIL 1 Department of Computing Science, University of Alberta 2 Noah s Ark Lab, Huawei Technologies Canada Co., Ltd. 3 Google Brain 4 Alberta Machine Intelligence Institute (Amii) |
| Pseudocode | Yes | With the full Reinforcement Teaching framework outlined, see Algorithm 1 for the corresponding pseudocode of the teacher-student interaction. Algorithm 1 Reinforcement Teaching Framework |
| Open Source Code | Yes | A Code for Experiments The source code to run our experiments can be found in this anonymized dropbox link: https://www.dropbox.com/sh/hjkzzgctnqf6d8w/AAAYEyca Dv POeifz8FZb R3k La?dl=0 |
| Open Datasets | Yes | Furthermore, we perform a policy-transfer experiment, where we demonstrate that with our approach, the teacher can learn a step-size adaptation policy that can be transferred to new students classifying diļ¬erent benchmark datasets (MNIST, Fashion MNIST) and even new students with diļ¬erent architectures (see Figure 7). MNIST (Le Cun et al., 2010) and Fashion-MNIST (Xiao et al., 2017). CIFAR-10: The studentās neural network is a Le Net5 CNN wih a batch size of 128. Subsampled dataset to 10000 so that an episode covers one epoch of training. |
| Dataset Splits | No | The paper mentions subsampling datasets to 10000 for one epoch of training and provides batch sizes, but it does not explicitly state the training/test/validation splits (e.g., percentages or counts) needed to reproduce the data partitioning. For instance, it does not specify how the 10000 subsampled data points are divided into train, test, or validation sets. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory) are provided in the paper. The paper refers to general terms like 'RL training loop' but does not specify the underlying hardware. |
| Software Dependencies | Yes | For the PPO student, we used the open-source implementation in (Willems & Karra, 2020). For the DDPG student, we used the Open AI Baselines implementation Dhariwal et al. (2017). Pytorch actor-critic deep reinforcement learning algorithms: A2c and ppo, 2020. URL https://github.com/lcswillems/torch-ac/tree/ 85d0b2b970ab402e3ab289a4b1f94572f9368dad. |
| Experiment Setup | Yes | See Table 10 for full speciļ¬cation of student hyperparameters. Table 5: Fixed teacher hyperparameters used across all methods. Table 6: Teacher agent hyperparameters for all methods (excluding ablation experiments). Table 9: Hyperparameters used in the teacher-student training procedure. Table 10: Student hyperparameters. The teacher in the supervised learning experiment used Double DQN with ϵ-greedy exploration and an ϵ value of 0.01. The batch size and hidden neural network size was 256. The action-value network had 1 hidden layer, but the state encoder has 2 hidden layers. |