reproducibilityindex.ai

Direct Advantage Estimation

Authors: Hsiao-Ru Pan, Nico Gürtler, Alexander Neitz, Bernhard Schölkopf

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DAE empirically on three discrete control domains and show that it can outperform generalized advantage estimation (GAE), a strong baseline for advantage estimation, on a majority of the environments when applied to policy optimization.We test our method empirically on three discrete domains including (1) a synthetic environment, (2) the Min Atar suite [Young and Tian, 2019] and (3) the Arcade Learning Environment (ALE) [Bellemare et al., 2013], and demonstrate that DAE outperforms Generalized Advantage Estimation (GAE) [Schulman et al., 2015b] on most of them.
Researcher Affiliation	Collaboration	Hsiao-Ru Pan1 Nico Gürtler1 Alexander Neitz2 Bernhard Schölkopf1 1Max Planck Institute for Intelligent Systems, Tübingen 2Deep Mind
Pseudocode	Yes	Algorithm 1 PPO with DAE (shared network)
Open Source Code	Yes	3Code is available at https://github.com/hrpan/dae.
Open Datasets	Yes	the Min Atar [Young and Tian, 2019] suite, a set of environments inspired by Atari games with similar dynamics but simpler observation space, and (3) the Arcade Learning Environment (ALE) [Bellemare et al., 2013]
Dataset Splits	No	The paper describes training and evaluation but does not specify explicit dataset split percentages or counts for train, validation, and test sets. It mentions 'training episodes' and 'last 100 training episodes' for metrics, but not data partitioning.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for experiments, such as CPU or GPU models, or cloud computing specifications.
Software Dependencies	No	The paper mentions using 'PPO implementation and the tuned hyperparameters from Rafﬁn et al. [2021], Rafﬁn [2020]' (referring to Stable-Baselines3 and RL-Baselines3-Zoo), but it does not specify version numbers for these or other software dependencies.
Experiment Setup	Yes	For both methods, we use the same network architectures and hyperparameters to train the agents, see Appendix C for more details. ... For DAE, we tune two of the hyperparameters, namely the scaling coefﬁcient of the value function loss and the number of epochs in each PPO iteration using the Min Atar environments, which are then ﬁxed for the ALE experiments. ... See Appendix C for a more detailed description of the network architectures and hyperparameters.