The Successful Ingredients of Policy Gradient Algorithms
Authors: Sven Gronauer, Martin Gottwald, Klaus Diepold
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To allow an equitable assessment, we conduct our experiments based on a unified and modular implementation. Our results underline the significance of recent algorithmic advances and demonstrate that reaching state-of-the-art performance may not need sophisticated algorithms but can also be accomplished by the combination of a few simple ingredients. In this paper, we empirically study the significance of specific ingredients... |
| Researcher Affiliation | Academia | Sven Gronauer , Martin Gottwald , Klaus Diepold Technical University of Munich, Germany {sven.gronauer, martin.gottwald, kldi}@tum.de |
| Pseudocode | No | The paper describes algorithms and methods using prose and mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | For the supplemental materials and the implementation see: https://github.com/Sven Gronauer/successful-ingredients-paper |
| Open Datasets | Yes | To benchmark the performance in continuous control problems, we use the five locomotion environments Half Cheetah, Hopper, Ant, Walker2D, and Humanoid as well as the three robotic manipulations tasks Reacher, Pusher, and Kuka. All eight tasks are evaluated in the Py Bullet physics engine [Coumans and Bai, 2016]. |
| Dataset Splits | No | The paper discusses continuous environment interactions and batch collection for training, typical for reinforcement learning, but does not specify a fixed validation dataset split with percentages or sample counts in the way supervised learning commonly does. |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details, such as GPU or CPU models, used for running the experiments. |
| Software Dependencies | No | The paper mentions the 'Py Bullet physics engine [Coumans and Bai, 2016]' and 'Adam [Kingma and Ba, 2015]' as an optimizer, but it does not provide specific version numbers for Py Bullet or any other software dependencies crucial for replication. |
| Experiment Setup | Yes | We aligned the hyper-parameters to the ones suggested in Henderson et al. [2018]. We applied a single learner setup, used as discount factor γ = 0.99, collected batches of size 32000 for each policy iteration, and ran each seed over a total of 107 environment interactions. For the neural networks, we used the same structure for both policy and value networks, i.e. multi-layer perceptrons with two hidden layers consisting of 64 neurons each followed by tanh non-linearities. The default optimizer was Adam [Kingma and Ba, 2015] for the policy and value network, respectively. Our studied ingredients are only applied to the policy network but not to the value network. |