reproducibilityindex.ai

Guided Exploration with Proximal Policy Optimization using a Single Demonstration

Authors: Gabriele Libardi, Gianni De Fabritiis, Sebastian Dittert

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train an agent on a combination of demonstrations and own experience to solve problems with variable initial conditions and we integrate it with proximal policy optimization (PPO). We ﬁnally compare variations of this algorithm to different imitation learning algorithms on a set of hard-exploration tasks in the Animal-AI Olympics environment. To test this new algorithm we created a benchmark of hardexploration problems of varying levels of difﬁculty using the Animal-AI Olympics challenge environment (Beyret et al., 2019; Crosby et al., 2019). We also included some experiments on tasks that have already been extensively studied in the literature: the Reacher Py Bullet Env-v0 (Coumans & Bai, 2017) and Lunar Lander-v2 (Brockman et al., 2016) tasks.
Researcher Affiliation	Academia	1Computational Science Laboratory, Universitat Pompeu Fabra (UPF) 2ICREA . Correspondence to: Gabriele Libardi <gabrielelibardi@yahoo.it>, Gianni De Fabritiis <gianni.defabritiis@upf.edu>.
Pseudocode	Yes	Algorithm 1 PPO+D
Open Source Code	Yes	The code is available here: https://github.com/compsciencelab/ppo_D. The source code is available at https: //github.com/compsciencelab/ppo_D.
Open Datasets	Yes	To test this new algorithm we created a benchmark of hard-exploration problems of varying levels of difﬁculty using the Animal-AI Olympics challenge environment (Beyret et al., 2019; Crosby et al., 2019). We also included some experiments on tasks that have already been extensively studied in the literature: the Reacher Py Bullet Env-v0 (Coumans & Bai, 2017) and Lunar Lander-v2 (Brockman et al., 2016) tasks.
Dataset Splits	No	The paper describes using different initial conditions and training with varying hyperparameters but does not provide explicit train/validation/test dataset splits (e.g., percentages or counts) or refer to standard predefined splits for specific datasets beyond using the environments/tasks themselves.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies	No	The paper mentions 'Pytorch implementations of reinforcement learning algorithms' and cites a GitHub repository (Kostrikov, 2018), but it does not specify the version numbers for PyTorch or any other software dependencies used in their own experiments.
Experiment Setup	Yes	For Sparse Reacher Py Bullet Env-v0 we chose the hyperparameters ρ = 0.1, φ = 0.0 and for Sparse Lunar Lander-v2 ρ = 0.3, φ = 0.0, which we found to be optimal for both PPO+D and PPO+BC. For behavioral cloning we trained for 3000 learner steps (updates of the policy) with learning rate 10 5.