DNA: Proximal Policy Optimization with a Dual Network Architecture

Authors: Matthew Aitchison, Penny Sweetser

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper explores the problem of simultaneously learning a value function and policy in deep actor-critic reinforcement learning models. We find that the common practice of learning these functions jointly is sub-optimal due to an order-of-magnitude difference in noise levels between the two tasks. Instead, we show that learning these tasks independently, but with a constrained distillation phase, significantly improves performance. Furthermore, we find that policy gradient noise levels decrease when using a lower variance return estimate. Whereas, value learning noise level decreases with a lower bias estimate. Together these insights inform an extension to Proximal Policy Optimization we call Dual Network Architecture (DNA), which significantly outperforms its predecessor. DNA also exceeds the performance of the popular Rainbow DQN algorithm on four of the five environments tested, even under more difficult stochastic control settings.
Researcher Affiliation Academia Matthew Aitchison The Australian National University matthew.aitchison@anu.edu.au Penny Sweetser The Australian National University
Pseudocode Yes Algorithm 1 Proximal Policy Optimization with Dual Network Architecture
Open Source Code Yes The source code used to generate the results in this paper is provided in the supplementary material. We also provide an implementation of our algorithm at https://github.com/maitchison/PPO/tree/DNA.
Open Datasets Yes To evaluate our algorithm s performance, we used the Atari-5 benchmark [2]. Scores in Atari-5 are generated using a weighted geometric average over five specific games and produce results that correlate well with the median score if all 57-games had been evaluated. This allowed us to perform multiple seeded runs and defined a clear training and test split between the games. In all cases, we fit hyperparameters to the 3-game validation set and only used the 5-game test set for final evaluations.
Dataset Splits Yes In all cases, we fit hyperparameters to the 3-game validation set and only used the 5-game test set for final evaluations.
Hardware Specification No No specific hardware (GPU/CPU models, memory, etc.) used for experiments is mentioned in the paper.
Software Dependencies No The paper mentions using Adam [21] for optimization, but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes A coarse hyperparameter sweep found initial hyperparameters for our model on the Atari-3 validation set. Notably, we found the optimal mini-batch size for value and distillation to be the minimum tested (256), while the optimal mini-batch size for policy was the largest tested (2048). For optimization, we used Adam [21], over the standard 200 million frames. Full hyperparameter details for our experiments are given in Appendix B.