Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierarchical Relative Entropy Policy Search

Authors: Christian Daniel, Gerhard Neumann, Oliver Kroemer, Jan Peters

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons. Keywords: Reinforcement Learning, Policy Search, Hierarchical Learning, Robot Learning, Motor Skill Learning, Robust Learning, Structured Learning, Temporal Correlation, Hi REPS, REPS
Researcher Affiliation	Academia	Christian Daniel1 EMAIL Gerhard Neumann1 EMAIL Oliver Kroemer1 EMAIL Jan Peters1,2 EMAIL 1Technische Universit at Darmstadt Fachbereich Informatik , Fachgruppe Intelligente Autonome Systeme Hochschulstraße 10 64289 Darmstadt Germany 2Max-Planck-Institut f ur Intelligente Systeme Spemannstraße 38 72076 T ubingen Germany
Pseudocode	Yes	Table 1: Episodic Hi REPS. In each iteration the algorithm starts by sampling an sub-policy o from the gating policy π(a\|s) given the initial state s and an action a from π(a\|s, o) from the sub-policy. Subsequently, the action is executed to generate the reward r(s, a). The parameters η, ξ and θ are determined by minimizing the dual-function g. Table 2: Time-indexed Hi REPS. In each iteration the algorithm starts by sampling from the policy π1 given the initial state s1 and executes the sampled action to generate the next state s2. From this state, the next action is sampled with policy π2. This procedure is repeated until the ﬁnal time-step is reached. The algorithm observes state transitions and rewards for each step k and the ﬁnal reward signal r(s). The parameters η1:K+1, ξ1:K+1 and θ1:K+1 are determined by minimizing the dualfunction g, where η1:K+1 and ξ1:K+1 are vectors containing the Lagrangian parameters ηk and ξk for each decision step. Table 3: Inﬁnite Horizon Hi REPS. The algorithm follows the form of the episodic implementation, however one iteration produces state, action, reward triples for each time step in each rollout of an iteration. In the inﬁnite horizon case, a model Pa ss is required to compute E[V (s )].
Open Source Code	No	The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available or released. It mentions third-party software like DMPs but not its own implementation code.
Open Datasets	No	The paper describes various tasks (e.g., Puddle World, Tetherball, Robot Hockey) which appear to be custom-designed simulations or real-world experimental setups, rather than referring to external, publicly available datasets with specific access information (links, DOIs, or standard citations). For instance, for the Puddle World experiment, it states: "In a ﬁrst toy task, we test Hi REPS on a variation of the puddle world (Sutton, 1996). While this task is of limited difﬁculty it is interesting as it is a well known setting which exhibits the averaging problem of interest to us. Additionally, the simplicity of the problem allows us to thoroughly assess the quality of the solutions found by the RL agent, which is often difﬁcult in real robot tasks. Our version differs from the standard version by having a continuous action space instead of a discrete one." This indicates a modified, custom task setup rather than a public dataset.
Dataset Splits	No	The paper describes collecting 'samples per iteration' or 'trajectories per iteration' in the context of reinforcement learning, where data is generated dynamically. For example, 'we use ten samples per iteration', '50 samples per iteration', '15 trajectories per iteration', '20 samples per iteration', '30 samples per iteration'. This refers to the number of episodes or rollouts collected during the learning process, not a static partitioning into training, validation, or test sets as typically found in supervised learning with fixed datasets.
Hardware Specification	Yes	For the robot experiment, we mounted a table-tennis paddle to the end-effector of the robot arm. In order to track the ball, a Kinect RGBD camera was setup to look at the robot from the opposite side of the pole. We used a DLR-Kuka lightweight arm with 7 degrees of freedom as depicted in Figure 15a. The ball s position and velocity were determined using a Microsoft Kinect which was mounted above the robot.
Software Dependencies	No	The paper mentions Dynamic Movement Primitives (DMPs) and other general concepts, but does not specify any particular software libraries, frameworks, or solvers with their version numbers that would be necessary for reproduction (e.g., Python 3.x, TensorFlow x.x, PyTorch x.x, or a specific solver version).
Experiment Setup	Yes	The number of options can usually be chosen generously, i.e., around 20 seems to be reasonable for a wide range of problems. The entropy bound κ is probably the most important parameter to consider since it does not have a clear equivalent in existing approaches. However, our experiments showed that a value of 0.9 seems to work well in almost all cases and no major tuning was necessary. The parameter ϵ is probably the parameter that is the most tuning-intensive in the proposed method, especially if the total number of episodes is crucial, e.g., in real robot experiments. In our experience values for ϵ between 0.5 and 1.5 are reasonable and, most often, we would start a new task with ϵ = 1. For Hi REPS, just two sub-policies were used. REPS takes a longer time to reliably ﬁnd good solutions, as the algorithm averages over both modes. Hi REPS without bounded entropy performs slightly better than REPS. We initialize Hi REPS with 30 randomly located sub-policies and use ten samples per iteration. We run the algorithms with 50 samples per iteration and always keep the last 400 samples. We initialize our algorithm with 30 sub-policies and stop deleting sub-policies if only 5 sub-policies are left. We initialized the algorithm with 15 sub-policies and sampled 15 trajectories per iteration. In this task, the pendulum starts hanging down with a random perturbation. The goal of the robot is to ﬁnd a solution that ﬁrst swings up the pendulum and then stabilizes the pendulum at the top. The pendulum has a mass of 10kg, a length of 0.5m and a friction coefﬁcient of 0.2. The robot can exert at most 30nm of torque and uses 20 samples per learning iteration. The internal robot control runs at 100Hz and the restart probability in the base setting is given as (1 γ) = 0.02/d, where d determines the control frequency of the learned policy. The reward function punishes deviation from the desired upright position with a factor of 500 and punishes velocities with a factor of 10. These punishments are subtracted from a base value of 500. Every policy was evaluated for 20s and in each iteration the agent collected three episodes before updating the policy.