Fingerprint Policy Optimisation for Robust Reinforcement Learning
Authors: Supratik Paul, Michael A. Osborne, Shimon Whiteson
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the empirical performance of FPO, we start by applying it to a simple problem: a modified version of the cliff walker task (Sutton & Barto, 1998), with one dimensional state and action spaces. We then move on to simulated robotics problems based on the Mu Jo Co simulator (Brockman et al., 2016) with much higher dimensionalities. These were modified to include SREs. |
| Researcher Affiliation | Academia | Supratik Paul 1 Michael A. Osborne 2 Shimon Whiteson 1 1Department of Computer Science, University of Oxford, UK 2Department of Engineering Science, University of Oxford, UK. Correspondence to: Supratik Paul <supratik.paul@cs.ox.ac.uk>. |
| Pseudocode | Yes | Algorithm 1 Fingerprint Policy Optimisation input Initial policy π0, original distribution p(θ), randomly initialised qψ0(θ), policy optimisation method POLOPT, number of policy iterations N, dataset D0 = {} 1: for n = 1, 2, 3, ..., N do 2: Sample θ1:k from qψn 1(θ), and with πn 1 sample trajectories τ1:k corresponding to each θ1:k 3: Compute πn = POLOPT(τ1:k) = POLOPT(ψn 1, πn 1) 4: Compute J(πn) using numerical quadrature as described in Section 3.2. Use the sampled trajectories to compute the policy fingerprint as described in Section 3.3. 5: Set Dn = Dn 1 {((ψn 1, πn 1), J(πn))} and update the GP to condition on Dn 6: Use either the UCB (3) or FITBO (4) acquisition functions to select ψn. 7: end for |
| Open Source Code | No | No, the paper does not provide explicit statements or links indicating that the source code for FPO (the authors' own methodology) is open-source or publicly available. It only mentions thanking Binxin Ru for sharing code for FITBO, which is a component used in their work. |
| Open Datasets | Yes | We then move on to simulated robotics problems based on the Mu Jo Co simulator (Brockman et al., 2016) with much higher dimensionalities. [...] In the original Open AI Gym Half Cheetah task, the objective is to maximise forward velocity. We modify the original problem such that in 98% of the cases the objective is to achieve a target velocity of 2, with rewards decreasing linearly with the distance from the target. |
| Dataset Splits | No | No, the paper does not specify explicit training, validation, or test dataset splits. It describes modifications to continuous control tasks (e.g., Half Cheetah, Ant) and the use of a simulator, but does not provide quantitative details on how data was partitioned for different phases of model development or evaluation. |
| Hardware Specification | No | No, the paper mentions receiving 'a generous equipment grant from NVIDIA' but does not specify any particular GPU models, CPU types, memory, or other detailed hardware specifications used for running the experiments. |
| Software Dependencies | Yes | OpenAI Gym and MuJoCo are installed on a Linux server running Python 3.8 and CUDA 10.1. |
| Experiment Setup | Yes | We repeat all our experiments across 10 random starts. [...] In the original Open AI Gym Half Cheetah task, the objective is to maximise forward velocity. We modify the original problem such that in 98% of the cases the objective is to achieve a target velocity of 2, with rewards decreasing linearly with the distance from the target. In the remaining 2%, the target velocity is set to 4, with a large bonus reward, which acts as an SRE. [...] Further experimental details are provided in Appendix A.1. |