reproducibilityindex.ai

The Successful Ingredients of Policy Gradient Algorithms

Authors: Sven Gronauer, Martin Gottwald, Klaus Diepold

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To allow an equitable assessment, we conduct our experiments based on a uniﬁed and modular implementation. Our results underline the signiﬁcance of recent algorithmic advances and demonstrate that reaching state-of-the-art performance may not need sophisticated algorithms but can also be accomplished by the combination of a few simple ingredients. In this paper, we empirically study the signiﬁcance of speciﬁc ingredients...
Researcher Affiliation	Academia	Sven Gronauer , Martin Gottwald , Klaus Diepold Technical University of Munich, Germany {sven.gronauer, martin.gottwald, kldi}@tum.de
Pseudocode	No	The paper describes algorithms and methods using prose and mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	For the supplemental materials and the implementation see: https://github.com/Sven Gronauer/successful-ingredients-paper
Open Datasets	Yes	To benchmark the performance in continuous control problems, we use the ﬁve locomotion environments Half Cheetah, Hopper, Ant, Walker2D, and Humanoid as well as the three robotic manipulations tasks Reacher, Pusher, and Kuka. All eight tasks are evaluated in the Py Bullet physics engine [Coumans and Bai, 2016].
Dataset Splits	No	The paper discusses continuous environment interactions and batch collection for training, typical for reinforcement learning, but does not specify a fixed validation dataset split with percentages or sample counts in the way supervised learning commonly does.
Hardware Specification	No	The paper does not explicitly provide specific hardware details, such as GPU or CPU models, used for running the experiments.
Software Dependencies	No	The paper mentions the 'Py Bullet physics engine [Coumans and Bai, 2016]' and 'Adam [Kingma and Ba, 2015]' as an optimizer, but it does not provide specific version numbers for Py Bullet or any other software dependencies crucial for replication.
Experiment Setup	Yes	We aligned the hyper-parameters to the ones suggested in Henderson et al. [2018]. We applied a single learner setup, used as discount factor γ = 0.99, collected batches of size 32000 for each policy iteration, and ran each seed over a total of 107 environment interactions. For the neural networks, we used the same structure for both policy and value networks, i.e. multi-layer perceptrons with two hidden layers consisting of 64 neurons each followed by tanh non-linearities. The default optimizer was Adam [Kingma and Ba, 2015] for the policy and value network, respectively. Our studied ingredients are only applied to the policy network but not to the value network.