reproducibilityindex.ai

Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks

Authors: Ingmar Schubert, Ozgur S Oguz, Marc Toussaint

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed signiﬁcantly improve the sample efﬁciency of RL over plan-based PB-RS.
Researcher Affiliation	Academia	Ingmar Schubert1, Ozgur S. Oguz2,3, and Marc Toussaint1,2 1 Learning and Intelligent Systems Group, TU Berlin, Germany 2 Max Planck Institute for Intelligent Systems, Stuttgart, Germany 3 Machine Learning and Robotics Lab, University of Stuttgart, Germany
Pseudocode	No	The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Implementation details can be found in the supplementary code material.
Open Datasets	No	The paper describes setting up a simulated environment similar to Fetch Push-v1 in Open AI Gym and using NVIDIA Phys X engine, but it does not specify a publicly available pre-collected dataset for training. Data is generated through interaction in the simulation.
Dataset Splits	No	The paper mentions collecting data in rollout episodes and running test episodes, but it does not specify explicit training, validation, or test dataset splits in the traditional sense, as data is generated dynamically through simulation.
Hardware Specification	No	The paper mentions using the NVIDIA Phys X engine for simulation but does not specify any particular hardware (CPU, GPU models, etc.) used for running the experiments.
Software Dependencies	No	The paper mentions using DDPG, PPO, and TensorFlow, as well as an open-source implementation from Barhate (2020), but it does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	We use γ = γ = 0.9 as discount factor. The agent collects data in rollout episodes of random length sampled uniformly between 0 and 300. Each of these episodes starts at the same initial position indicated in ﬁgure 2a; we do not assume that the system can be reset to arbitrary states. The exploration policy acts ϵ-greedily with respect to the current actor, where ϵ = 0.2. After every rollout, actor and critic are updated using the replay buffer. Both actor and critic are implemented as neural networks in Tensorﬂow (Abadi et al., 2016). We use A = 40 and M = 30 in all experiments reported. The actor network outputs mean values between 0.1 and 0.1 for the Gaussian policy, while the standard deviation of the policy output is ﬁxed to 0.015. We use the clipping parameter ϵ = 0.1 (see Schulman et al. (2017)).