reproducibilityindex.ai

Divergence-Augmented Policy Optimization

Authors: Qing Wang, Yingru Li, Jiechao Xiong, Tong Zhang

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical experiments on Atari games show that in the data-scarce scenario where the reuse of off-policy data becomes necessary, our method can achieve better performance than other state-of-the-art deep reinforcement learning algorithms.
Researcher Affiliation	Collaboration	Qing Wang Huya AI Guangzhou, China Yingru Li The Chinese University of Hong Kong Shenzhen, China Jiechao Xiong Tencent AI Lab Shenzhen, China Tong Zhang The Hong Kong University of Science and Technology Hong Kong, China
Pseudocode	Yes	Algorithm 1 Divergence-Augmented Policy Optimization (DAPO)
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We experiment with the proposed method on the commonly used Atari 2600 environment from Arcade Learning Environment (ALE) (Bellemare et al., 2013).
Dataset Splits	No	The paper describes training parameters such as batch size and number of roll-outs, and mentions running experiments multiple times. However, it does not explicitly provide specific training/test/validation dataset splits as percentages or sample counts.
Hardware Specification	No	The paper mentions running experiments with 'the same environmental settings and computational resources' but does not provide specific hardware details such as GPU or CPU models, memory, or cloud computing specifications.
Software Dependencies	No	The paper states 'The algorithm is implemented with Tensor Flow (Abadi et al., 2016)' but does not specify a version number for TensorFlow or any other software dependencies.
Experiment Setup	Yes	The learning rate is linearly scaled from 1e-3 to 0. The parameters are updated according to a mixture of policy loss and value loss, with the loss scaling coefﬁcient c = 0.5. In calculating multi-step λ-returns Rs,a and divergence Ds,a, we use ﬁxed λ = 0.9 and γ = 0.99. The batch size is set to 1024, with roll-out length set to 32, resulting in 1024/32=32 roll-outs in a batch. The policy πt and value Vt is updated every 100 iterations (M = 100 in Algorithm 1).