Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
Authors: Emilio Parisotto, Russ Salakhutdinov
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a case study, we develop this procedure in the context of partially-observable environments, where transformer models have had large improvements over LSTMs recently, at the cost of significantly higher computational complexity. With transformer models as the learner and LSTMs as the actor, we demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model while maintaining the fast inference and reduced total training time of the LSTM actor model. and 5 EXPERIMENTS |
| Researcher Affiliation | Academia | Emilio Parisotto & Ruslan Salakhutdinov Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA {eparisot,rsalakhu}@cs.cmu.edu |
| Pseudocode | No | The paper does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is open-source or publicly available. |
| Open Datasets | No | The paper uses custom environments 'I-Maze' and 'Meta-Fetch' for its experiments, but does not provide concrete access information (links, DOIs, repositories, or citations with author/year for public versions) for these environments/datasets. |
| Dataset Splits | Yes | For all models, we sweep over the V-MPO target network update frequency KL {1, 10, 100}. In initial experiments, we also sweeped the Initial α setting over values {0.1, 0.5, 1.0, 5.0}. All experiment runs have 3 unique seeds. For each model, we choose the hyperparameter setting that achieved highest mean return over all seeds. |
| Hardware Specification | Yes | Reference Machine A: Reference Machine A has a 36-thread Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz, 64GB of RAM, and 2 GPUs: a Ge Force GTX 1080 Ti and a TITAN V. Reference Machine B: Reference Machine B has a 40-thread Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 256GB of RAM and 2 GPUs: a Tesla P40 and a Tesla V100-PCIE-16GB. |
| Software Dependencies | No | The paper mentions several algorithms and tools like 'V-MPO', 'IMPALA', and 'Pop Art', and 'Adam Optimizer', but it does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Table 1: Common hyperparameters across experiments. Hyperparameter Value Optimizer Adam Learning Rate 0.0001 NA 30 ND 8 Batch Size 64 TU 20 Discount Factor (γ) 0.99 Grad. Norm. Clipping Disabled Initial η 1.0 Initial α {0.1, 0.5, 1.0, 5.0} ϵη 0.1 ϵα 0.004 KL {1, 10, 100} Pop Art Decay 0.0003 and Algorithm Details: For experiments, we use the V-MPO algorithm (Song et al., 2020) as the RL algorithm underlying each of the procedures we test. For ALD, we use V-MPO with V-trace corrections (Espeholt et al., 2018) which worked slightly better in preliminary experiments. |