Direct Advantage Estimation
Authors: Hsiao-Ru Pan, Nico Gürtler, Alexander Neitz, Bernhard Schölkopf
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DAE empirically on three discrete control domains and show that it can outperform generalized advantage estimation (GAE), a strong baseline for advantage estimation, on a majority of the environments when applied to policy optimization.We test our method empirically on three discrete domains including (1) a synthetic environment, (2) the Min Atar suite [Young and Tian, 2019] and (3) the Arcade Learning Environment (ALE) [Bellemare et al., 2013], and demonstrate that DAE outperforms Generalized Advantage Estimation (GAE) [Schulman et al., 2015b] on most of them. |
| Researcher Affiliation | Collaboration | Hsiao-Ru Pan1 Nico Gürtler1 Alexander Neitz2 Bernhard Schölkopf1 1Max Planck Institute for Intelligent Systems, Tübingen 2Deep Mind |
| Pseudocode | Yes | Algorithm 1 PPO with DAE (shared network) |
| Open Source Code | Yes | 3Code is available at https://github.com/hrpan/dae. |
| Open Datasets | Yes | the Min Atar [Young and Tian, 2019] suite, a set of environments inspired by Atari games with similar dynamics but simpler observation space, and (3) the Arcade Learning Environment (ALE) [Bellemare et al., 2013] |
| Dataset Splits | No | The paper describes training and evaluation but does not specify explicit dataset split percentages or counts for train, validation, and test sets. It mentions 'training episodes' and 'last 100 training episodes' for metrics, but not data partitioning. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for experiments, such as CPU or GPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions using 'PPO implementation and the tuned hyperparameters from Raffin et al. [2021], Raffin [2020]' (referring to Stable-Baselines3 and RL-Baselines3-Zoo), but it does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For both methods, we use the same network architectures and hyperparameters to train the agents, see Appendix C for more details. ... For DAE, we tune two of the hyperparameters, namely the scaling coefficient of the value function loss and the number of epochs in each PPO iteration using the Min Atar environments, which are then fixed for the ALE experiments. ... See Appendix C for a more detailed description of the network architectures and hyperparameters. |