Discovered Policy Optimisation
Authors: Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, Jakob Foerster
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings. |
| Researcher Affiliation | Collaboration | Chris Lu FLAIR, University of Oxford christopher.lu@exeter.ox.ac.uk Jakub Grudzien Kuba BAIR, UC Berkeley kuba@berkeley.edu Alistair Letcher aletcher.github.io ahp.letcher@gmail.com Luke Metz Google Brain luke.s.metz@gmail.com Christian Schroeder de Witt FLAIR, University of Oxford cs@robots.ox.ac.uk Jakob Foerster FLAIR, University of Oxford jakob.foerster@eng.ox.ac.uk |
| Pseudocode | No | The paper does not contain any section explicitly labeled 'Pseudocode' or 'Algorithm', nor are there any structured code-like blocks detailing the methods. |
| Open Source Code | Yes | 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] They are attached in the supplementary material |
| Open Datasets | Yes | We evaluate LPO and DPO in the Brax [8] continuous control environments and Minatar [41, 20] environments, where they obtain superior performance compared to PPO. |
| Dataset Splits | No | The paper discusses meta-training across environments and evaluating policies, but does not provide specific training, validation, and test dataset splits in terms of percentages or sample counts for reproduction. |
| Hardware Specification | Yes | 3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] They are in Appendix A |
| Software Dependencies | No | We implement our method on top of the Brax version of PPO, which provides a Mirror Learning-friendly code template, keeping the policy architecture and training hyperparameters unchanged. For meta-training we use both evosax [19] and the Learned_optimization [22] libraries. (The paper mentions software libraries but does not provide specific version numbers for all key dependencies). |
| Experiment Setup | Yes | For full details of meta-training see Appendix A. ... The Brax PPO implementation uses different hyperparameters, such as the number of update epochs and the total number of timesteps, for each of the tasks. |