Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

Authors: Emilio Parisotto, Russ Salakhutdinov

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a case study, we develop this procedure in the context of partially-observable environments, where transformer models have had large improvements over LSTMs recently, at the cost of significantly higher computational complexity. With transformer models as the learner and LSTMs as the actor, we demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model while maintaining the fast inference and reduced total training time of the LSTM actor model. and 5 EXPERIMENTS
Researcher Affiliation Academia Emilio Parisotto & Ruslan Salakhutdinov Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA {eparisot,rsalakhu}@cs.cmu.edu
Pseudocode No The paper does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is open-source or publicly available.
Open Datasets No The paper uses custom environments 'I-Maze' and 'Meta-Fetch' for its experiments, but does not provide concrete access information (links, DOIs, repositories, or citations with author/year for public versions) for these environments/datasets.
Dataset Splits Yes For all models, we sweep over the V-MPO target network update frequency KL {1, 10, 100}. In initial experiments, we also sweeped the Initial α setting over values {0.1, 0.5, 1.0, 5.0}. All experiment runs have 3 unique seeds. For each model, we choose the hyperparameter setting that achieved highest mean return over all seeds.
Hardware Specification Yes Reference Machine A: Reference Machine A has a 36-thread Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz, 64GB of RAM, and 2 GPUs: a Ge Force GTX 1080 Ti and a TITAN V. Reference Machine B: Reference Machine B has a 40-thread Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 256GB of RAM and 2 GPUs: a Tesla P40 and a Tesla V100-PCIE-16GB.
Software Dependencies No The paper mentions several algorithms and tools like 'V-MPO', 'IMPALA', and 'Pop Art', and 'Adam Optimizer', but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Table 1: Common hyperparameters across experiments. Hyperparameter Value Optimizer Adam Learning Rate 0.0001 NA 30 ND 8 Batch Size 64 TU 20 Discount Factor (γ) 0.99 Grad. Norm. Clipping Disabled Initial η 1.0 Initial α {0.1, 0.5, 1.0, 5.0} ϵη 0.1 ϵα 0.004 KL {1, 10, 100} Pop Art Decay 0.0003 and Algorithm Details: For experiments, we use the V-MPO algorithm (Song et al., 2020) as the RL algorithm underlying each of the procedures we test. For ALD, we use V-MPO with V-trace corrections (Espeholt et al., 2018) which worked slightly better in preliminary experiments.