reproducibilityindex.ai

Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

Authors: Emilio Parisotto, Russ Salakhutdinov

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As a case study, we develop this procedure in the context of partially-observable environments, where transformer models have had large improvements over LSTMs recently, at the cost of significantly higher computational complexity. With transformer models as the learner and LSTMs as the actor, we demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model while maintaining the fast inference and reduced total training time of the LSTM actor model. and 5 EXPERIMENTS
Researcher Affiliation	Academia	Emilio Parisotto & Ruslan Salakhutdinov Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA {eparisot,rsalakhu}@cs.cmu.edu
Pseudocode	No	The paper does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the described methodology is open-source or publicly available.
Open Datasets	No	The paper uses custom environments 'I-Maze' and 'Meta-Fetch' for its experiments, but does not provide concrete access information (links, DOIs, repositories, or citations with author/year for public versions) for these environments/datasets.
Dataset Splits	Yes	For all models, we sweep over the V-MPO target network update frequency KL {1, 10, 100}. In initial experiments, we also sweeped the Initial α setting over values {0.1, 0.5, 1.0, 5.0}. All experiment runs have 3 unique seeds. For each model, we choose the hyperparameter setting that achieved highest mean return over all seeds.
Hardware Specification	Yes	Reference Machine A: Reference Machine A has a 36-thread Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz, 64GB of RAM, and 2 GPUs: a Ge Force GTX 1080 Ti and a TITAN V. Reference Machine B: Reference Machine B has a 40-thread Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 256GB of RAM and 2 GPUs: a Tesla P40 and a Tesla V100-PCIE-16GB.
Software Dependencies	No	The paper mentions several algorithms and tools like 'V-MPO', 'IMPALA', and 'Pop Art', and 'Adam Optimizer', but it does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Table 1: Common hyperparameters across experiments. Hyperparameter Value Optimizer Adam Learning Rate 0.0001 NA 30 ND 8 Batch Size 64 TU 20 Discount Factor (γ) 0.99 Grad. Norm. Clipping Disabled Initial η 1.0 Initial α {0.1, 0.5, 1.0, 5.0} ϵη 0.1 ϵα 0.004 KL {1, 10, 100} Pop Art Decay 0.0003 and Algorithm Details: For experiments, we use the V-MPO algorithm (Song et al., 2020) as the RL algorithm underlying each of the procedures we test. For ALD, we use V-MPO with V-trace corrections (Espeholt et al., 2018) which worked slightly better in preliminary experiments.