Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Non-parametric Policy Search with Limited Information Loss
Authors: Herke van Hoof, Gerhard Neumann, Jan Peters
JMLR 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show the strong performance of the proposed method, and how it can be approximated efficiently. Finally, we show that our algorithm can learn a real-robot under-powered swing-up task directly from image data. |
| Researcher Affiliation | Academia | Herke van Hoof EMAIL School of Computer Science, Mc Gill University, Mc Connell Eng Bldg, Room 318, 3480 University St, Montreal, Quebec, Canada Gerhard Neumann EMAIL Lincoln Centre for Autonomous Systems, Lincoln University, Lincoln, United Kingdom Jan Peters EMAIL Intelligent Autonomous Systems Institute, Technische Universit at Darmstadt, Darmstadt, Germany Robot Learning Lab, Max Planck Institute for Intelligent Systems, T ubingen, Germany |
| Pseudocode | Yes | Algorithm 1 Policy iteration with relative entropy policy search (REPS) generate roll-outs according to πi 1 minimize dual η , α arg min g(η, α) Eq. 11 calculate Bellman errors for each sample δj Rj + α T φ(s j) φ(sj) Eq. 13 calculate the sample weights wj exp(δj/η ) Sec. 2.4 fit a generalizing policy πi(a|s) = N(µ(s), σ2(s)) Sec. 2.4 until convergence |
| Open Source Code | No | The paper does not provide an explicit statement of code release for the methodology described, nor does it provide a direct link to a code repository. It mentions using 'the reference implementation provided in the RLlab framework (Duan et al., 2016)' which refers to a third-party tool, not their own implementation's code. A video link is provided for an experiment, not source code. |
| Open Datasets | No | The paper uses a modified version of the puddle-world task, following 'Sutton (1996)' for its description, which is a conceptual task definition rather than a specific dataset. For the real-robot experiment, the data is generated by the robot itself ('The camera provides video frames'). There is no mention of a specific, publicly available dataset with a link, DOI, repository, or formal citation that provides access to the raw data used in the experiments. |
| Dataset Splits | No | The paper mentions generating 'roll-outs' and a 'forgetting mechanism that only keeps the latest 30 roll-outs'. It also states, 'In each iteration, 20 roll-outs were performed, retaining the roll-outs from the last 3 iterations in memory'. While cross-validation is mentioned for hyper-parameter optimization ('two-fold cross-validation'), the dynamic, online nature of data collection and retention does not describe a fixed, reproducible split of a static dataset into training, validation, or test sets by percentages or counts. |
| Hardware Specification | Yes | An indication of the time requirement of the different methods is given in Figure 5d. ... Time requirement of different approximation methods on a 2.7 GHz processor running in single threads, log scale. |
| Software Dependencies | No | The paper mentions using 'the RLlab framework (Duan et al., 2016)' and implementing methods in 'Python' (implied by the nature of ML research), but it does not specify any version numbers for these or other software libraries or dependencies. For example, it does not state 'RLlab X.Y.Z' or 'Python 3.X'. |
| Experiment Setup | Yes | For REPS, we use a KL bound ϵ of 0.5 in our experiments... The exploration parameter ϵ of the NPALP method was set to 0.1, with the standard deviation of Gaussian noise set to 30Nm2. The Lipschitz constant was set to 1... The greediness parameter c for onpolicy value-iteration was set to 2... For NP-REPS, we set the bandwidth of the value function to half the maximal pixel intensity and added a l2 regularizer 10−9αTα... For DDPG, we obtained a learning rate of 10−4 for both the Q function and policy, a minimum replay memory size of 1500 and a maximum memory size of 5000 and a batch size of 64 (the relatively high batch size made the method more sample efficient). For DDPG, we used a Gaussian exploration strategy with a decay period of 5000... best performance for two layers with 4 hidden units each for TRPO and two layers with 8 hidden units for DDPG. ... The bound on the KL divergence was set to ϵ = 0.8. We used 1000 random basis features... these three bandwidth parameters were set by hand to 0.5, 1.4, and 4.0, respectively... |