reproducibilityindex.ai

Discovered Policy Optimisation

Authors: Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, Jakob Foerster

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments in Brax environments conﬁrm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.
Researcher Affiliation	Collaboration	Chris Lu FLAIR, University of Oxford christopher.lu@exeter.ox.ac.uk Jakub Grudzien Kuba BAIR, UC Berkeley kuba@berkeley.edu Alistair Letcher aletcher.github.io ahp.letcher@gmail.com Luke Metz Google Brain luke.s.metz@gmail.com Christian Schroeder de Witt FLAIR, University of Oxford cs@robots.ox.ac.uk Jakob Foerster FLAIR, University of Oxford jakob.foerster@eng.ox.ac.uk
Pseudocode	No	The paper does not contain any section explicitly labeled 'Pseudocode' or 'Algorithm', nor are there any structured code-like blocks detailing the methods.
Open Source Code	Yes	3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] They are attached in the supplementary material
Open Datasets	Yes	We evaluate LPO and DPO in the Brax [8] continuous control environments and Minatar [41, 20] environments, where they obtain superior performance compared to PPO.
Dataset Splits	No	The paper discusses meta-training across environments and evaluating policies, but does not provide specific training, validation, and test dataset splits in terms of percentages or sample counts for reproduction.
Hardware Specification	Yes	3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] They are in Appendix A
Software Dependencies	No	We implement our method on top of the Brax version of PPO, which provides a Mirror Learning-friendly code template, keeping the policy architecture and training hyperparameters unchanged. For meta-training we use both evosax [19] and the Learned_optimization [22] libraries. (The paper mentions software libraries but does not provide specific version numbers for all key dependencies).
Experiment Setup	Yes	For full details of meta-training see Appendix A. ... The Brax PPO implementation uses different hyperparameters, such as the number of update epochs and the total number of timesteps, for each of the tasks.