Discovered Policy Optimisation

Authors: Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, Jakob Foerster

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.
Researcher Affiliation Collaboration Chris Lu FLAIR, University of Oxford christopher.lu@exeter.ox.ac.uk Jakub Grudzien Kuba BAIR, UC Berkeley kuba@berkeley.edu Alistair Letcher aletcher.github.io ahp.letcher@gmail.com Luke Metz Google Brain luke.s.metz@gmail.com Christian Schroeder de Witt FLAIR, University of Oxford cs@robots.ox.ac.uk Jakob Foerster FLAIR, University of Oxford jakob.foerster@eng.ox.ac.uk
Pseudocode No The paper does not contain any section explicitly labeled 'Pseudocode' or 'Algorithm', nor are there any structured code-like blocks detailing the methods.
Open Source Code Yes 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] They are attached in the supplementary material
Open Datasets Yes We evaluate LPO and DPO in the Brax [8] continuous control environments and Minatar [41, 20] environments, where they obtain superior performance compared to PPO.
Dataset Splits No The paper discusses meta-training across environments and evaluating policies, but does not provide specific training, validation, and test dataset splits in terms of percentages or sample counts for reproduction.
Hardware Specification Yes 3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] They are in Appendix A
Software Dependencies No We implement our method on top of the Brax version of PPO, which provides a Mirror Learning-friendly code template, keeping the policy architecture and training hyperparameters unchanged. For meta-training we use both evosax [19] and the Learned_optimization [22] libraries. (The paper mentions software libraries but does not provide specific version numbers for all key dependencies).
Experiment Setup Yes For full details of meta-training see Appendix A. ... The Brax PPO implementation uses different hyperparameters, such as the number of update epochs and the total number of timesteps, for each of the tasks.