reproducibilityindex.ai

Fingerprint Policy Optimisation for Robust Reinforcement Learning

Authors: Supratik Paul, Michael A. Osborne, Shimon Whiteson

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the empirical performance of FPO, we start by applying it to a simple problem: a modiﬁed version of the cliff walker task (Sutton & Barto, 1998), with one dimensional state and action spaces. We then move on to simulated robotics problems based on the Mu Jo Co simulator (Brockman et al., 2016) with much higher dimensionalities. These were modiﬁed to include SREs.
Researcher Affiliation	Academia	Supratik Paul 1 Michael A. Osborne 2 Shimon Whiteson 1 1Department of Computer Science, University of Oxford, UK 2Department of Engineering Science, University of Oxford, UK. Correspondence to: Supratik Paul <supratik.paul@cs.ox.ac.uk>.
Pseudocode	Yes	Algorithm 1 Fingerprint Policy Optimisation input Initial policy π0, original distribution p(θ), randomly initialised qψ0(θ), policy optimisation method POLOPT, number of policy iterations N, dataset D0 = {} 1: for n = 1, 2, 3, ..., N do 2: Sample θ1:k from qψn 1(θ), and with πn 1 sample trajectories τ1:k corresponding to each θ1:k 3: Compute πn = POLOPT(τ1:k) = POLOPT(ψn 1, πn 1) 4: Compute J(πn) using numerical quadrature as described in Section 3.2. Use the sampled trajectories to compute the policy ﬁngerprint as described in Section 3.3. 5: Set Dn = Dn 1 {((ψn 1, πn 1), J(πn))} and update the GP to condition on Dn 6: Use either the UCB (3) or FITBO (4) acquisition functions to select ψn. 7: end for
Open Source Code	No	No, the paper does not provide explicit statements or links indicating that the source code for FPO (the authors' own methodology) is open-source or publicly available. It only mentions thanking Binxin Ru for sharing code for FITBO, which is a component used in their work.
Open Datasets	Yes	We then move on to simulated robotics problems based on the Mu Jo Co simulator (Brockman et al., 2016) with much higher dimensionalities. [...] In the original Open AI Gym Half Cheetah task, the objective is to maximise forward velocity. We modify the original problem such that in 98% of the cases the objective is to achieve a target velocity of 2, with rewards decreasing linearly with the distance from the target.
Dataset Splits	No	No, the paper does not specify explicit training, validation, or test dataset splits. It describes modifications to continuous control tasks (e.g., Half Cheetah, Ant) and the use of a simulator, but does not provide quantitative details on how data was partitioned for different phases of model development or evaluation.
Hardware Specification	No	No, the paper mentions receiving 'a generous equipment grant from NVIDIA' but does not specify any particular GPU models, CPU types, memory, or other detailed hardware specifications used for running the experiments.
Software Dependencies	Yes	OpenAI Gym and MuJoCo are installed on a Linux server running Python 3.8 and CUDA 10.1.
Experiment Setup	Yes	We repeat all our experiments across 10 random starts. [...] In the original Open AI Gym Half Cheetah task, the objective is to maximise forward velocity. We modify the original problem such that in 98% of the cases the objective is to achieve a target velocity of 2, with rewards decreasing linearly with the distance from the target. In the remaining 2%, the target velocity is set to 4, with a large bonus reward, which acts as an SRE. [...] Further experimental details are provided in Appendix A.1.