On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning
Authors: Marc Vischer, Robert Tjarko Lange, Henning Sprekeler
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare supervised behavioral cloning with DRL, putting a special emphasis on the resulting input representations used for prediction and control. Thereby, we connect the statistical perspective of sparse structure discovery (e.g. Hastie et al., 2019) with the iterative magnitude pruning (IMP, Han et al., 2015) procedure in the context of Markov decision processes (MDPs). The contributions of this work are summarized as follows: 1. We show that winning tickets exist in both high-dimensional visual and control tasks (con- tinuous/discrete). A positive lottery ticket effect is robustly observed for both off-policy DRL algorithms, including Deep-Q-Networks (DQN, Mnih et al., 2015) and on-policy policy-gradient methods (PPO, Schulman et al., 2015; 2017), providing evidence that the lottery ticket effect is a universal phenomenon across optimization formulations in DRL. |
| Researcher Affiliation | Academia | Marc Vischer Technical University Berlin Robert Tjarko Lange Technical University Berlin Science of Intelligence Henning Sprekeler Technical University Berlin Science of Intelligence |
| Pseudocode | No | The paper describes the Iterative Magnitude Pruning (IMP) procedure but does not present it in a pseudocode block or algorithm format. |
| Open Source Code | No | We will release the code after the publication of the paper. |
| Open Datasets | Yes | We scale our results to four Py Bullet (Ellenberger, 2018) continuous control and a subset of ALE benchmark (Bellemare et al., 2013) environments. To test the robustness of the lottery ticket phenomenon to different architectures and diverse tasks, we repeat the baseline comparison distillation experiments for the Min Atar environments (see figure 9). We trained MLP and CNN-based agents to distill experts value estimators and using the same architecture and hyperparameters for all considered games. For MLP agents the mask consistently contributes most to the the ticket. For CNN-based agents and selected games (Asterix and Space Invaders) the weight initialization contributes more. |
| Dataset Splits | No | The paper mentions 'evaluation episodes' for performance measurement, but it does not specify traditional training/validation/test dataset splits, which is common in reinforcement learning where data is generated dynamically during training. |
| Hardware Specification | No | The paper states: 'The simulations were conducted on a CPU cluster and no GPUs were used.' and 'Each individual IMP run required between 8 (Cart-Pole and Acrobot), 10 (Maze Grid, Min Atar) and 20 cores (Py Bullet and ATARI environments).' It does not provide specific CPU models, GPU models, or detailed cloud instance specifications. |
| Software Dependencies | No | All simulations were implemented in Python using the rlpyt DRL training package (Stooke & Abbeel, 2019, MIT License) and Py Torch pruning utilities (Paszke et al., 2017). The environments were implented by the Open AI gym (Brockman et al., 2016, MIT License), Min Atar (Young & Tian, 2019, GPL-3.0 License), Py Bullet gym (Ellenberger, 2018) packages and the ALE benchmark environment (Bellemare et al., 2013). Furthermore, all visualizations were done using Matplotlib (Hunter, 2007) and Seaborn (Waskom, 2021, BSD-3-Clause License). Finally, the numerical analysis was supported by Num Py (Harris et al., 2020, BSD-3-Clause License). While it lists software packages and their original publication years, it does not provide specific version numbers (e.g., PyTorch 1.x, Python 3.y). |
| Experiment Setup | Yes | C HYPERPARAMETER SETTINGS FOR REPRODUCTION All simulations were implemented in Python using the rlpyt DRL training package (Stooke & Abbeel, 2019, MIT License) and Py Torch pruning utilities (Paszke et al., 2017). The environments were implented by the Open AI gym (Brockman et al., 2016, MIT License), Min Atar (Young & Tian, 2019, GPL-3.0 License), Py Bullet gym (Ellenberger, 2018) packages and the ALE benchmark environment (Bellemare et al., 2013). Furthermore, all visualizations were done using Matplotlib (Hunter, 2007) and Seaborn (Waskom, 2021, BSD-3-Clause License). Finally, the numerical analysis was supported by Num Py (Harris et al., 2020, BSD-3-Clause License). We will release the code after the publication of the paper. The simulations were conducted on a CPU cluster and no GPUs were used. Each individual IMP run required between 8 (Cart-Pole and Acrobot), 10 (Maze Grid, Min Atar) and 20 cores (Py Bullet and ATARI environments). Depending on the setting, a for lottery ticket experiment of 20 to 30 iterations lasts between 2 hours (Cart-Pole) and 5 (ATARI games) days of training time. C.1 CART-POLE BEHAVIORAL CLONING & PPO Table 2: Hyperparameters for the BC algorithm on Cart-Pole. Results reported in fig. 2, 3 and 11. Parameter Value Student Network Size 128,128 units and 256,256 units Teacher Network Size 64,64 units and 128,128 units Learning Rate 0.001 (Adam) Training Environment Steps 10.000 Number of workers 4 Distillation Loss Cross-entropy expert-student policies Table 3: Hyperparameters for the PPO algorithm on Cart-Pole. Results reported in fig. 2, 8 and 11. Parameter Value Parameter Value Optimizer Adam Value Loss Coeff. 0.5 Learning Rate 0.0005 Entropy Loss Coeff. 0.001 Temporal Discount Factor 0.99 Likelihood Ratio Clip 0.2 Training Environment Steps 80.000 Number of workers 4 GAE λ 0.8 Number of epochs 4 |