Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Hierarchical Relative Entropy Policy Search
Authors: Christian Daniel, Gerhard Neumann, Oliver Kroemer, Jan Peters
JMLR 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons. Keywords: Reinforcement Learning, Policy Search, Hierarchical Learning, Robot Learning, Motor Skill Learning, Robust Learning, Structured Learning, Temporal Correlation, Hi REPS, REPS |
| Researcher Affiliation | Academia | Christian Daniel1 EMAIL Gerhard Neumann1 EMAIL Oliver Kroemer1 EMAIL Jan Peters1,2 EMAIL 1Technische Universit at Darmstadt Fachbereich Informatik , Fachgruppe Intelligente Autonome Systeme Hochschulstraße 10 64289 Darmstadt Germany 2Max-Planck-Institut f ur Intelligente Systeme Spemannstraße 38 72076 T ubingen Germany |
| Pseudocode | Yes | Table 1: Episodic Hi REPS. In each iteration the algorithm starts by sampling an sub-policy o from the gating policy π(a|s) given the initial state s and an action a from π(a|s, o) from the sub-policy. Subsequently, the action is executed to generate the reward r(s, a). The parameters η, ξ and θ are determined by minimizing the dual-function g. Table 2: Time-indexed Hi REPS. In each iteration the algorithm starts by sampling from the policy π1 given the initial state s1 and executes the sampled action to generate the next state s2. From this state, the next action is sampled with policy π2. This procedure is repeated until the final time-step is reached. The algorithm observes state transitions and rewards for each step k and the final reward signal r(s). The parameters η1:K+1, ξ1:K+1 and θ1:K+1 are determined by minimizing the dualfunction g, where η1:K+1 and ξ1:K+1 are vectors containing the Lagrangian parameters ηk and ξk for each decision step. Table 3: Infinite Horizon Hi REPS. The algorithm follows the form of the episodic implementation, however one iteration produces state, action, reward triples for each time step in each rollout of an iteration. In the infinite horizon case, a model Pa ss is required to compute E[V (s )]. |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available or released. It mentions third-party software like DMPs but not its own implementation code. |
| Open Datasets | No | The paper describes various tasks (e.g., Puddle World, Tetherball, Robot Hockey) which appear to be custom-designed simulations or real-world experimental setups, rather than referring to external, publicly available datasets with specific access information (links, DOIs, or standard citations). For instance, for the Puddle World experiment, it states: "In a first toy task, we test Hi REPS on a variation of the puddle world (Sutton, 1996). While this task is of limited difficulty it is interesting as it is a well known setting which exhibits the averaging problem of interest to us. Additionally, the simplicity of the problem allows us to thoroughly assess the quality of the solutions found by the RL agent, which is often difficult in real robot tasks. Our version differs from the standard version by having a continuous action space instead of a discrete one." This indicates a modified, custom task setup rather than a public dataset. |
| Dataset Splits | No | The paper describes collecting 'samples per iteration' or 'trajectories per iteration' in the context of reinforcement learning, where data is generated dynamically. For example, 'we use ten samples per iteration', '50 samples per iteration', '15 trajectories per iteration', '20 samples per iteration', '30 samples per iteration'. This refers to the number of episodes or rollouts collected during the learning process, not a static partitioning into training, validation, or test sets as typically found in supervised learning with fixed datasets. |
| Hardware Specification | Yes | For the robot experiment, we mounted a table-tennis paddle to the end-effector of the robot arm. In order to track the ball, a Kinect RGBD camera was setup to look at the robot from the opposite side of the pole. We used a DLR-Kuka lightweight arm with 7 degrees of freedom as depicted in Figure 15a. The ball s position and velocity were determined using a Microsoft Kinect which was mounted above the robot. |
| Software Dependencies | No | The paper mentions Dynamic Movement Primitives (DMPs) and other general concepts, but does not specify any particular software libraries, frameworks, or solvers with their version numbers that would be necessary for reproduction (e.g., Python 3.x, TensorFlow x.x, PyTorch x.x, or a specific solver version). |
| Experiment Setup | Yes | The number of options can usually be chosen generously, i.e., around 20 seems to be reasonable for a wide range of problems. The entropy bound κ is probably the most important parameter to consider since it does not have a clear equivalent in existing approaches. However, our experiments showed that a value of 0.9 seems to work well in almost all cases and no major tuning was necessary. The parameter ϵ is probably the parameter that is the most tuning-intensive in the proposed method, especially if the total number of episodes is crucial, e.g., in real robot experiments. In our experience values for ϵ between 0.5 and 1.5 are reasonable and, most often, we would start a new task with ϵ = 1. For Hi REPS, just two sub-policies were used. REPS takes a longer time to reliably find good solutions, as the algorithm averages over both modes. Hi REPS without bounded entropy performs slightly better than REPS. We initialize Hi REPS with 30 randomly located sub-policies and use ten samples per iteration. We run the algorithms with 50 samples per iteration and always keep the last 400 samples. We initialize our algorithm with 30 sub-policies and stop deleting sub-policies if only 5 sub-policies are left. We initialized the algorithm with 15 sub-policies and sampled 15 trajectories per iteration. In this task, the pendulum starts hanging down with a random perturbation. The goal of the robot is to find a solution that first swings up the pendulum and then stabilizes the pendulum at the top. The pendulum has a mass of 10kg, a length of 0.5m and a friction coefficient of 0.2. The robot can exert at most 30nm of torque and uses 20 samples per learning iteration. The internal robot control runs at 100Hz and the restart probability in the base setting is given as (1 γ) = 0.02/d, where d determines the control frequency of the learned policy. The reward function punishes deviation from the desired upright position with a factor of 500 and punishes velocities with a factor of 10. These punishments are subtracted from a base value of 500. Every policy was evaluated for 20s and in each iteration the agent collected three episodes before updating the policy. |