Divergence-Augmented Policy Optimization

Authors: Qing Wang, Yingru Li, Jiechao Xiong, Tong Zhang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments on Atari games show that in the data-scarce scenario where the reuse of off-policy data becomes necessary, our method can achieve better performance than other state-of-the-art deep reinforcement learning algorithms.
Researcher Affiliation Collaboration Qing Wang Huya AI Guangzhou, China Yingru Li The Chinese University of Hong Kong Shenzhen, China Jiechao Xiong Tencent AI Lab Shenzhen, China Tong Zhang The Hong Kong University of Science and Technology Hong Kong, China
Pseudocode Yes Algorithm 1 Divergence-Augmented Policy Optimization (DAPO)
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We experiment with the proposed method on the commonly used Atari 2600 environment from Arcade Learning Environment (ALE) (Bellemare et al., 2013).
Dataset Splits No The paper describes training parameters such as batch size and number of roll-outs, and mentions running experiments multiple times. However, it does not explicitly provide specific training/test/validation dataset splits as percentages or sample counts.
Hardware Specification No The paper mentions running experiments with 'the same environmental settings and computational resources' but does not provide specific hardware details such as GPU or CPU models, memory, or cloud computing specifications.
Software Dependencies No The paper states 'The algorithm is implemented with Tensor Flow (Abadi et al., 2016)' but does not specify a version number for TensorFlow or any other software dependencies.
Experiment Setup Yes The learning rate is linearly scaled from 1e-3 to 0. The parameters are updated according to a mixture of policy loss and value loss, with the loss scaling coefficient c = 0.5. In calculating multi-step λ-returns Rs,a and divergence Ds,a, we use fixed λ = 0.9 and γ = 0.99. The batch size is set to 1024, with roll-out length set to 32, resulting in 1024/32=32 roll-outs in a batch. The policy πt and value Vt is updated every 100 iterations (M = 100 in Algorithm 1).