Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks
Authors: Ingmar Schubert, Ozgur S Oguz, Marc Toussaint
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed significantly improve the sample efficiency of RL over plan-based PB-RS. |
| Researcher Affiliation | Academia | Ingmar Schubert1, Ozgur S. Oguz2,3, and Marc Toussaint1,2 1 Learning and Intelligent Systems Group, TU Berlin, Germany 2 Max Planck Institute for Intelligent Systems, Stuttgart, Germany 3 Machine Learning and Robotics Lab, University of Stuttgart, Germany |
| Pseudocode | No | The paper does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Implementation details can be found in the supplementary code material. |
| Open Datasets | No | The paper describes setting up a simulated environment similar to Fetch Push-v1 in Open AI Gym and using NVIDIA Phys X engine, but it does not specify a publicly available pre-collected dataset for training. Data is generated through interaction in the simulation. |
| Dataset Splits | No | The paper mentions collecting data in rollout episodes and running test episodes, but it does not specify explicit training, validation, or test dataset splits in the traditional sense, as data is generated dynamically through simulation. |
| Hardware Specification | No | The paper mentions using the NVIDIA Phys X engine for simulation but does not specify any particular hardware (CPU, GPU models, etc.) used for running the experiments. |
| Software Dependencies | No | The paper mentions using DDPG, PPO, and TensorFlow, as well as an open-source implementation from Barhate (2020), but it does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | We use γ = γ = 0.9 as discount factor. The agent collects data in rollout episodes of random length sampled uniformly between 0 and 300. Each of these episodes starts at the same initial position indicated in figure 2a; we do not assume that the system can be reset to arbitrary states. The exploration policy acts ϵ-greedily with respect to the current actor, where ϵ = 0.2. After every rollout, actor and critic are updated using the replay buffer. Both actor and critic are implemented as neural networks in Tensorflow (Abadi et al., 2016). We use A = 40 and M = 30 in all experiments reported. The actor network outputs mean values between 0.1 and 0.1 for the Gaussian policy, while the standard deviation of the policy output is fixed to 0.015. We use the clipping parameter ϵ = 0.1 (see Schulman et al. (2017)). |