reproducibilityindex.ai

Policy Optimization with Model-Based Explorations

Authors: Feiyang Pan, Qingpeng Cai, An-Xiang Zeng, Chun-Xiang Pan, Qing Da, Hualin He, Qing He, Pingzhong Tang4675-4682

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare POME with PPO on Atari 2600 games, and it shows that POME outperforms PPO on 33 games out of 49 games. We verify POME on the Atari 2600 game playing benchmarks. We compare our POME with the original model-free PPO algorithm and a model-based extended version of PPO. Experimental results show that POME outperforms the original PPO on 33 Atari games out of 49.
Researcher Affiliation	Collaboration	1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China. {panfeiyang, heqing}@ict.ac.cn 2University of Chinese Academy of Sciences, Beijing 100049, China. 3IIIS, Tsinghua University. cqp14@mails.tsinghua.edu.cn, kenshinping@gmail.com 4Alibaba Group. {renzhong, xuanran}@taobao.com, {daqing.dq, hualin.hhl}@alibaba-inc.com
Pseudocode	Yes	Algorithm 1 Policy Optimization with Model-based Explorations (single worker)
Open Source Code	No	The paper mentions using 'a standard open-sourced PPO implementation (Dhariwal et al. 2017)' with a GitHub link provided for it, but does not state that the code for their proposed method (POME) is open-source or provide a link to it.
Open Datasets	Yes	We verify POME on the Atari 2600 game playing benchmarks. We use the Arcade Learning Environment (Bellemare et al. 2013) benchmarks along with a standard open-sourced PPO implementation (Dhariwal et al. 2017).
Dataset Splits	No	The paper mentions 'training for 10M timesteps' and discusses 'minibatch size' and 'rollout trajectory', but does not provide specific training/validation/test dataset splits (percentages, counts, or references to predefined splits).
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions 'a standard open-sourced PPO implementation (Dhariwal et al. 2017)' and 'Adam gradient descent optimizer (Kingma and Ba 2014)', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For all the algorithms, the discount factor γ is set to 0.99 and the advantage values are estimated by the k-step error (16) where the horizon k is set to be 128. We use 8 actors (workers) to simultaneously run the algorithm and the minibatch size is 128 * 8. ... L = LPOME(θ) + cv Lv(φ) + c T LT (θ T), which is optimized by the Adam gradient descent optimizer (Kingma and Ba 2014) with learning rate 2.5e-4 f, where f is a fraction linearly annealed from 1 to 0 over the course of learning, and cv, c T are coefﬁcients for tuning the learning rate of the value function and the transition function. In our experiments we set these coefﬁcients to cv = 1 and c T = 2.