reproducibilityindex.ai

DNA: Proximal Policy Optimization with a Dual Network Architecture

Authors: Matthew Aitchison, Penny Sweetser

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper explores the problem of simultaneously learning a value function and policy in deep actor-critic reinforcement learning models. We find that the common practice of learning these functions jointly is sub-optimal due to an order-of-magnitude difference in noise levels between the two tasks. Instead, we show that learning these tasks independently, but with a constrained distillation phase, significantly improves performance. Furthermore, we find that policy gradient noise levels decrease when using a lower variance return estimate. Whereas, value learning noise level decreases with a lower bias estimate. Together these insights inform an extension to Proximal Policy Optimization we call Dual Network Architecture (DNA), which significantly outperforms its predecessor. DNA also exceeds the performance of the popular Rainbow DQN algorithm on four of the five environments tested, even under more difficult stochastic control settings.
Researcher Affiliation	Academia	Matthew Aitchison The Australian National University matthew.aitchison@anu.edu.au Penny Sweetser The Australian National University
Pseudocode	Yes	Algorithm 1 Proximal Policy Optimization with Dual Network Architecture
Open Source Code	Yes	The source code used to generate the results in this paper is provided in the supplementary material. We also provide an implementation of our algorithm at https://github.com/maitchison/PPO/tree/DNA.
Open Datasets	Yes	To evaluate our algorithm s performance, we used the Atari-5 benchmark [2]. Scores in Atari-5 are generated using a weighted geometric average over five specific games and produce results that correlate well with the median score if all 57-games had been evaluated. This allowed us to perform multiple seeded runs and defined a clear training and test split between the games. In all cases, we fit hyperparameters to the 3-game validation set and only used the 5-game test set for final evaluations.
Dataset Splits	Yes	In all cases, we fit hyperparameters to the 3-game validation set and only used the 5-game test set for final evaluations.
Hardware Specification	No	No specific hardware (GPU/CPU models, memory, etc.) used for experiments is mentioned in the paper.
Software Dependencies	No	The paper mentions using Adam [21] for optimization, but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	A coarse hyperparameter sweep found initial hyperparameters for our model on the Atari-3 validation set. Notably, we found the optimal mini-batch size for value and distillation to be the minimum tested (256), while the optimal mini-batch size for policy was the largest tested (2048). For optimization, we used Adam [21], over the standard 200 million frames. Full hyperparameter details for our experiments are given in Appendix B.