Improving Generalization in Meta Reinforcement Learning using Learned Objectives

Authors: Louis Kirsch, Sjoerd van Steenkiste, Juergen Schmidhuber

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Unlike recent meta-RL algorithms, Meta Gen RL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms humanengineered RL algorithms.
Researcher Affiliation Academia Louis Kirsch, Sjoerd van Steenkiste, J urgen Schmidhuber The Swiss AI Lab IDSIA, USI, SUPSI {louis, sjoerd, juergen}@idsia.ch
Pseudocode Yes Algorithm 1 Meta Gen RL: Meta-Training Require: p(e) a distribution of environments P {(e1 p(e), φ1, θ1, B1 ), . . .} Randomly initialize population of agents Randomly initialize objective function Lα while Lα has not converged do for e, φ, θ, B P do For each agent i in parallel if extend replay buffer B then Extend B using πφ in e Sample trajectories from B Update critic Qθ using TD-error Update policy by following φLα Compute objective function gradient i for agent i according to Equation 6 Sum gradients P i i to update Lα
Open Source Code Yes 1Code is available at http://louiskirsch.com/code/metagenrl
Open Datasets Yes We investigate the learning and generalization capabilities of Meta Gen RL on several continuous control benchmarks including Half Cheetah (Cheetah) and Hopper from Mu Jo Co (Todorov et al., 2012), and Lunar Lander Continuous (Lunar) from Open AI gym (Brockman et al., 2016).
Dataset Splits Yes Mean return across multiple seeds (Meta Gen RL: 6 meta-train 2 meta-test seeds, RL2: 6 meta-train 2 meta-test seeds, EPG: 3 meta-train 2 meta-test seeds) obtained by training randomly initialized agents during meta-test time on previously seen environments (cyan) and on unseen environments (brown).
Hardware Specification No No specific hardware details (GPU models, CPU models, memory) are mentioned in the paper. The 'ACKNOWLEDGEMENTS' section mentions 'computational resources by the Swiss National Supercomputing Centre (CSCS, project: s978)' and donations of 'a DGX-1' and 'a Minsky machine', but these are not explicitly tied to the execution of the experiments described in the paper with detailed specifications.
Software Dependencies No The paper mentions software components like 'Re LU activations', 'layer normalization (Ba et al., 2016)', 'LSTM', 'tanh activations', 'sigmoid activations', and 'Adam (Kingma & Ba, 2015)'. However, it does not provide specific version numbers for these components or for the programming language/libraries used to implement them (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We will allow for 600K environment interactions per agent during meta-training and then meta-test the objective function for 1M interactions. Further details are available in Appendix B. [...] Truncated episode length 20, Global norm gradient clipping 1.0, Critic learning rate λ1 1e-3, Policy learning rate λ2 1e-3, Second order learning rate λ3 1e-3, Obj. func. learning rate λ4 1e-3, Critic noise 0.2, Critic noise clip 0.5, Target network update speed 0.005, Discount factor 0.99, Batch size 100, Random exploration timesteps 10000, Policy gaussian noise std 0.1, Timesteps per agent 1M.