Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks

Authors: Ingmar Schubert, Ozgur S Oguz, Marc Toussaint

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed significantly improve the sample efficiency of RL over plan-based PB-RS.
Researcher Affiliation Academia Ingmar Schubert1, Ozgur S. Oguz2,3, and Marc Toussaint1,2 1 Learning and Intelligent Systems Group, TU Berlin, Germany 2 Max Planck Institute for Intelligent Systems, Stuttgart, Germany 3 Machine Learning and Robotics Lab, University of Stuttgart, Germany
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Implementation details can be found in the supplementary code material.
Open Datasets No The paper describes setting up a simulated environment similar to Fetch Push-v1 in Open AI Gym and using NVIDIA Phys X engine, but it does not specify a publicly available pre-collected dataset for training. Data is generated through interaction in the simulation.
Dataset Splits No The paper mentions collecting data in rollout episodes and running test episodes, but it does not specify explicit training, validation, or test dataset splits in the traditional sense, as data is generated dynamically through simulation.
Hardware Specification No The paper mentions using the NVIDIA Phys X engine for simulation but does not specify any particular hardware (CPU, GPU models, etc.) used for running the experiments.
Software Dependencies No The paper mentions using DDPG, PPO, and TensorFlow, as well as an open-source implementation from Barhate (2020), but it does not provide specific version numbers for any of these software components.
Experiment Setup Yes We use γ = γ = 0.9 as discount factor. The agent collects data in rollout episodes of random length sampled uniformly between 0 and 300. Each of these episodes starts at the same initial position indicated in figure 2a; we do not assume that the system can be reset to arbitrary states. The exploration policy acts ϵ-greedily with respect to the current actor, where ϵ = 0.2. After every rollout, actor and critic are updated using the replay buffer. Both actor and critic are implemented as neural networks in Tensorflow (Abadi et al., 2016). We use A = 40 and M = 30 in all experiments reported. The actor network outputs mean values between 0.1 and 0.1 for the Gaussian policy, while the standard deviation of the policy output is fixed to 0.015. We use the clipping parameter ϵ = 0.1 (see Schulman et al. (2017)).