Improving Generalization in Meta Reinforcement Learning using Learned Objectives
Authors: Louis Kirsch, Sjoerd van Steenkiste, Juergen Schmidhuber
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Unlike recent meta-RL algorithms, Meta Gen RL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms humanengineered RL algorithms. |
| Researcher Affiliation | Academia | Louis Kirsch, Sjoerd van Steenkiste, J urgen Schmidhuber The Swiss AI Lab IDSIA, USI, SUPSI {louis, sjoerd, juergen}@idsia.ch |
| Pseudocode | Yes | Algorithm 1 Meta Gen RL: Meta-Training Require: p(e) a distribution of environments P {(e1 p(e), φ1, θ1, B1 ), . . .} Randomly initialize population of agents Randomly initialize objective function Lα while Lα has not converged do for e, φ, θ, B P do For each agent i in parallel if extend replay buffer B then Extend B using πφ in e Sample trajectories from B Update critic Qθ using TD-error Update policy by following φLα Compute objective function gradient i for agent i according to Equation 6 Sum gradients P i i to update Lα |
| Open Source Code | Yes | 1Code is available at http://louiskirsch.com/code/metagenrl |
| Open Datasets | Yes | We investigate the learning and generalization capabilities of Meta Gen RL on several continuous control benchmarks including Half Cheetah (Cheetah) and Hopper from Mu Jo Co (Todorov et al., 2012), and Lunar Lander Continuous (Lunar) from Open AI gym (Brockman et al., 2016). |
| Dataset Splits | Yes | Mean return across multiple seeds (Meta Gen RL: 6 meta-train 2 meta-test seeds, RL2: 6 meta-train 2 meta-test seeds, EPG: 3 meta-train 2 meta-test seeds) obtained by training randomly initialized agents during meta-test time on previously seen environments (cyan) and on unseen environments (brown). |
| Hardware Specification | No | No specific hardware details (GPU models, CPU models, memory) are mentioned in the paper. The 'ACKNOWLEDGEMENTS' section mentions 'computational resources by the Swiss National Supercomputing Centre (CSCS, project: s978)' and donations of 'a DGX-1' and 'a Minsky machine', but these are not explicitly tied to the execution of the experiments described in the paper with detailed specifications. |
| Software Dependencies | No | The paper mentions software components like 'Re LU activations', 'layer normalization (Ba et al., 2016)', 'LSTM', 'tanh activations', 'sigmoid activations', and 'Adam (Kingma & Ba, 2015)'. However, it does not provide specific version numbers for these components or for the programming language/libraries used to implement them (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We will allow for 600K environment interactions per agent during meta-training and then meta-test the objective function for 1M interactions. Further details are available in Appendix B. [...] Truncated episode length 20, Global norm gradient clipping 1.0, Critic learning rate λ1 1e-3, Policy learning rate λ2 1e-3, Second order learning rate λ3 1e-3, Obj. func. learning rate λ4 1e-3, Critic noise 0.2, Critic noise clip 0.5, Target network update speed 0.005, Discount factor 0.99, Batch size 100, Random exploration timesteps 10000, Policy gaussian noise std 0.1, Timesteps per agent 1M. |