Guided Exploration with Proximal Policy Optimization using a Single Demonstration
Authors: Gabriele Libardi, Gianni De Fabritiis, Sebastian Dittert
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train an agent on a combination of demonstrations and own experience to solve problems with variable initial conditions and we integrate it with proximal policy optimization (PPO). We finally compare variations of this algorithm to different imitation learning algorithms on a set of hard-exploration tasks in the Animal-AI Olympics environment. To test this new algorithm we created a benchmark of hardexploration problems of varying levels of difficulty using the Animal-AI Olympics challenge environment (Beyret et al., 2019; Crosby et al., 2019). We also included some experiments on tasks that have already been extensively studied in the literature: the Reacher Py Bullet Env-v0 (Coumans & Bai, 2017) and Lunar Lander-v2 (Brockman et al., 2016) tasks. |
| Researcher Affiliation | Academia | 1Computational Science Laboratory, Universitat Pompeu Fabra (UPF) 2ICREA . Correspondence to: Gabriele Libardi <gabrielelibardi@yahoo.it>, Gianni De Fabritiis <gianni.defabritiis@upf.edu>. |
| Pseudocode | Yes | Algorithm 1 PPO+D |
| Open Source Code | Yes | The code is available here: https://github.com/compsciencelab/ppo_D. The source code is available at https: //github.com/compsciencelab/ppo_D. |
| Open Datasets | Yes | To test this new algorithm we created a benchmark of hard-exploration problems of varying levels of difficulty using the Animal-AI Olympics challenge environment (Beyret et al., 2019; Crosby et al., 2019). We also included some experiments on tasks that have already been extensively studied in the literature: the Reacher Py Bullet Env-v0 (Coumans & Bai, 2017) and Lunar Lander-v2 (Brockman et al., 2016) tasks. |
| Dataset Splits | No | The paper describes using different initial conditions and training with varying hyperparameters but does not provide explicit train/validation/test dataset splits (e.g., percentages or counts) or refer to standard predefined splits for specific datasets beyond using the environments/tasks themselves. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions 'Pytorch implementations of reinforcement learning algorithms' and cites a GitHub repository (Kostrikov, 2018), but it does not specify the version numbers for PyTorch or any other software dependencies used in their own experiments. |
| Experiment Setup | Yes | For Sparse Reacher Py Bullet Env-v0 we chose the hyperparameters ρ = 0.1, φ = 0.0 and for Sparse Lunar Lander-v2 ρ = 0.3, φ = 0.0, which we found to be optimal for both PPO+D and PPO+BC. For behavioral cloning we trained for 3000 learner steps (updates of the policy) with learning rate 10 5. |